Text Compression Alan Tobias goes in for a little letter crunching on the Spectrum. Text compression methods have been in use on computer sys- tems for a number of years and a variety of techniques are available. A recent article in Your Computer (I.F. Boulton, Vol 4 No 3) demonstrated such a technique for the ZX-81 where the unused character codes were used to represent the more commonly found character pairs. This type of tokenisa- tion has been used widely with the unused ASCII codes 128 to 255 being taken to represent groups of letters, words or even frequently occurring phrases. However, a limitation of this method us due to the fact that only 128, or at most 255, token values are possible, ie, the text may be only partially tokenised. I will describe a simple text com- pression technique which overcomes this limitation and enables any piece of text to be totally tokenised. The principle of tokenising text appears very attractive since it is easy to see that commonly occurring words or phrases can be replaced by tokens of far fewer bytes than the original text. Single byte tokens, having values in the range 0 to 255, impose severe limitations as we have already noted. Two byte tokens, with values in the range 0 to 65535, are obviously excessive. However, when we con- sider the number of different words likely to be found in an adventure text database, or even in every day usage, we will find that this number is only a few thousand. There- fore, if we use two bytes per token and limit the number of items in the dictionary to 2048, which is more than ade- quate for the majority of applications, then the token number may be stored in 11 bits only with the 5 remaining bits being used to convey additional information about the text. In developing this technique I decided to use these bits to describe details of the text punctuation and layout. Before proceeding, we should note that one of the most commonly occurring characters in any passage of text is a space. Furthermore, we can see that, with the exception of words which terminate exactly at the end of full width lines (including any commas or periods), all words may be regarded as having a trailing space. Thus, if our decoding/ expandsion routine is able to apply this simple rule then there is no need to code spaces explicitly. An obvious application of some of the unused token bits is to indicate the presence of commas or periods following any word. By using a separate bit for each of these it is possible to accomodate words which are followed by both. When we construct the dictionary of different words which appear in the text, it is obviously desirable that any word occurring both at the start and in the middle of a sentence is stored in the dictionary once only. Therefore, it would be advantageous to use one of the token bits to indicate that a word should be output with a capital letter at its start. This requires that all words stored in the dictio- nary should have their first letters converted to lower case. Again it is possible for our decoding routine to apply some simple rules, namely that a word should be auto- matically output with a capital letter at the start of any text message or following a period. It will therefore be necessary to code the capital letter flag bit only when these are required in the middle of a sentence. Another punctuation item which could usefully be coded in a token bit is the presence of a newline character follo- wing a word plus its trailing blank, period or comma. However, we can again minimise the text coding by having the decoding routine provide a newline character automati- cally if the next word to be output will not fit within the current line. Thus, it will be necessary only to code those newline characters which are specifically required at par- ticular points in the text. In setting up the two byte tokens I have taken care to minimise the number of bits which will be set for any word. This means that for the large majority of words in a passage of text the token will contain the word's dictio- nary number only. For dictionary numbers less than 256 a single byte token would thus suffice in many cases. How- ever, we need some means of telling the decoding routine whether it should expect a one- or two-byte token. This can be achieved by using the one remaining unused bit of the token which, through necessity, will be located in the first byte of the token together with the low order bits of the dictionary item number. This arrangement will permit single byte tokens for dictionary numbers less than 128 when the words require no punctuation flag bits. The one remaining item we need to define is the means of terminating each string of tokens which represents a message. For this we require some unique value. If in the tokens we add 1 to the dictionary numbers, so that there is a maximum of 2047 dictionary entries numbered from 1 to 2047, we find that a zero value byte will provide a suit- able message terminator. Thus, if we want to print out any text message then it is necessary only to locate the appro- priate occurrence of a zero byte and begin printing at the token which follows it. For the present it is assumed that the total number of different text messages within the data base will not exceed 256. The required message number for output may then be specified by a single byte. Applications having more than 256 different text messages may be accomodated by permitting a suitable number of message lists all based upon a single text dictionary. In storing the text dictionary we need some means of marking the end of each entry. A suitable method of doing this appears in the Spectrum ROM and has been used here. It is done by setting bit 7 of the final character in each entry thus giving it an ASCII code of greated than 127. Any required dictionary entry is then found by locating the appropriate occurrence of a dictionary byte in which bit 7 has been set. Earlier, we saw that it was not necessary to code trail- ing spaces explicitly. Therefore, if we want to define phrases which are to be tokenised then it is necessary to consider other ways of representing spaces within these phrases. A simple solution is to represent them by some other character, preferably one which is unlikely to be found elsewhere in the text. The system described here has been designed to interpret the character '@' (ASCII 64) as a 'phrase space'. When found in a phrase this character is stored explicitly in the dictionary entry whereas during output it is replaced by a true space. Using the hex loader shown in listing 1, you can load the Z-80 machine-code routines which will both compress and expand text according to the system described above. List- ing 2 gives a hexadecimal dump of the Z-80 machine-code routines. The decoding routine occupies 199 bytes beginning at location 31000 and the compression routine occupies 315 bytes starting at location 31220. The code beginning at location 31199 merely sets up the registers for the expan- sion routine as used by this overall program. The first location of the printer buffer (23296) holds the total number of characters in an input line of lines of text while the remainder of the buffer - 23297 to 23551 - is used to store the input text. Listing 3 gives the Basic program which will drive the text compression routines described above. It is menu driven and is simple to use. Option 1 enables you to reset the base addresses for both the dictionary and the token- ised messages. It is essential that you select this option prior to initial text input. During operation of the program the dictionary of words and phrases is built up as the text input is scanned. This method of operation is sensible since, for large text data- bases, it is unlikely that both the original and compressed text may be stored simultaneously. In order to take full advantage of the use of single byte tokens it is advisable to enter some dummy text messages initially which contain the words you believe to occur most frequently in your text. When you have done this, reset the message storage but retain the dictionary. Because the program given here has been devised to com- press text as it is input from the keyboard, it enables you to lay out the text as required. During input you may type in up to eight lines of text at once - maximum of 255 characters - and each section of input does not have to finish at the end of a sentence. It is only necessary to ensure that you leave a space between each word. When you have completed the input for a message type the character "_" (ASCII 95) as the first character of a new single item of input. As was noted above, newline charac- ters are automatically provided if the next word to be printed will not fit into the current line. Additional new- line characters can be inserted into the text by including a "^" character (ASCII 94) at the appropriate place. With the exception of the characters shown in table 1, for which special interpretation applies, all ASCII charac- ters with codes in the range 32 to 127 will be treated as normal text characters. Note that if you want to save some compressed text for subsequent extension then it is essential that this is done by the program so that all pointers are preserved. These pointers are required only for the compression system's book-keeping and are not needed by any program which will use the compressed text. Similarly, the compression rou- tine, stored in locations 31220 to 31534, may be omitted from the target program. You will, however, require the heart of the expansion routine - stored in locations 31000 to 31198, or suitably relocated as required. On entry to the start of this routine, register 'a' should contain the required message number and register pairs 'hl' and 'bc' should point to the start addresses of the text tokens and dictionary respectively. Since these are saved as 'code' and are position independent they may be loaded into any region of memory for your target program. ___________________________________________________________ Table 1. Special input characters Charac- ASCII Interpretation ter Code @ 64 Treat as a space within phrases ^ 94 Insert newline character after word _ 95 End of input for current message ___________________________________________________________ Figure 1. Text Example and corresponding tokens. Message: You are in a dark, damp cellar with a narrow passageway leading south. Word/Phrase Token Value(s) You are in 2 a 4 dark, 7, 64 damp 8 cellar 10 with 12 a 4 narrow 14 passageway 16 leading 18 south. 21, 32, 0 No. of token bytes in compressed message = 14