My Photo
Location: Downham Market, Norfolk, United Kingdom

Saturday, December 17, 2016

On the efficiency of word-extraction functions

A number of BASIC dialects provide a function which returns a single word from a string, selected according to its ordinal position. Typically the string is split either at a space character or at a specified delimiter. One such dialect is Liberty BASIC, in which the word$() function does this. However BBC BASIC has never supported this functionality, either natively or (in the case of BBC BASIC for Windows) in a library. This begs the question: "why not?".

It all comes down to efficiency. BBC BASIC, despite being a traditional interpreted language, has a reputation for relatively fast execution. Whilst in part this results from careful design of the interpreter (which is coded in assembly language) just as important is the choice of what functionality to provide. Consider this piece of Liberty BASIC code:

t$ = "The quick brown fox jumps over the lazy dog."
for i = 1 to 9
  print word$(t$, i)

This works fine, but when you think about exactly what is going on under the hood not everything in the garden is lovely. Each time round the loop (except the first) the word$() function has to repeat everything it did on every previous iteration!

So on the second time round, when it is finding the 2nd word in the string, it can only do that by first finding the first word. And on the third time round, it is having to do again everything it did on both the first and second iterations!

By the time it's got to the 9th time round it's had to find the 1st word 9 times, the 2nd word 8 times and the 3rd word 7 times (and so on). This is hopelessly inefficient: just imagine trying to extract every word in an 80,000-word novel this way.

It makes much more sense to split the string into its individual words in one operation, then there is no wasteful repetition. And that's what the function that BBC BASIC does provide (FN_split in the STRINGLIB library) is designed to do.

If for some reason you cannot live without the functionality of LB's word$() there is a way of improving its efficiency. The code listed below splits the entire string into its component parts on the first call, then on subsequent calls it checks whether the same string has been supplied. If it has it doesn't bother to repeat the splitting operation but just returns the word from the previously created array:

INSTALL @lib$+"stringlib"

t$ = "The quick brown fox jumps over the lazy dog."
FOR i% = 1 TO 9
  PRINT FNword$(t$, i%)

DEF FNword$(a$, n%)
PRIVATE a$(), p$, c%
IF p$<>a$ p$ = a$ : c% = FN_split(p$, " ", a$())
IF n%<1 OR n%>c% THEN = ""
= a$(n%-1) 

The price you pay for the improved speed is memory usage, because the array must be retained from one call to the next. What's more the array's memory is leaked each time the function is called with a string containing more words than the previous maximum.


Post a Comment

<< Home