Text Compression

             Alan Tobias goes in for a little
             letter crunching on the Spectrum.

Text compression methods have been in use on computer sys-
tems for a number of years and a variety of techniques are
available. A recent article in Your Computer (I.F. Boulton,
Vol 4 No 3) demonstrated such a technique for the ZX-81
where the unused character codes were used to represent the
more commonly found character pairs. This type of tokenisa-
tion has been used widely with the unused ASCII codes 128
to 255 being taken to represent groups of letters, words or
even frequently occurring phrases. However, a limitation of
this method us due to the fact that only 128, or at most
255, token values are possible, ie, the text may be only
partially tokenised. I will describe a simple text com-
pression technique which overcomes this limitation and
enables any piece of text to be totally tokenised.
  The principle of tokenising text appears very attractive
since it is easy to see that commonly occurring words or
phrases can be replaced by tokens of far fewer bytes than
the original text. Single byte tokens, having values in the
range 0 to 255, impose severe limitations as we have
already noted. Two byte tokens, with values in the range 0
to 65535, are obviously excessive. However, when we con-
sider the number of different words likely to be found in
an adventure text database, or even in every day usage, we
will find that this number is only a few thousand. There-
fore, if we use two bytes per token and limit the number of
items in the dictionary to 2048, which is more than ade-
quate for the majority of applications, then the token
number may be stored in 11 bits only with the 5 remaining
bits being used to convey additional information about the
text. In developing this technique I decided to use these
bits to describe details of the text punctuation and
layout.
  Before proceeding, we should note that one of the most
commonly occurring characters in any passage of text is a
space. Furthermore, we can see that, with the exception of
words which terminate exactly at the end of full width
lines (including any commas or periods), all words may be
regarded as having a trailing space. Thus, if our decoding/
expandsion routine is able to apply this simple rule then
there is no need to code spaces explicitly.
  An obvious application of some of the unused token bits
is to indicate the presence of commas or periods following
any word. By using a separate bit for each of these it is
possible to accomodate words which are followed by both.
  When we construct the dictionary of different words which
appear in the text, it is obviously desirable that any word
occurring both at the start and in the middle of a sentence
is stored in the dictionary once only. Therefore, it would
be advantageous to use one of the token bits to indicate
that a word should be output with a capital letter at its
start. This requires that all words stored in the dictio-
nary should have their first letters converted to lower
case. Again it is possible for our decoding routine to
apply some simple rules, namely that a word should be auto-
matically output with a capital letter at the start of any
text message or following a period. It will therefore be
necessary to code the capital letter flag bit only when
these are required in the middle of a sentence.
  Another punctuation item which could usefully be coded in
a token bit is the presence of a newline character follo-
wing a word plus its trailing blank, period or comma.
However, we can again minimise the text coding by having
the decoding routine provide a newline character automati-
cally if the next word to be output will not fit within the
current line. Thus, it will be necessary only to code those
newline characters which are specifically required at par-
ticular points in the text.
  In setting up the two byte tokens I have taken care to
minimise the number of bits which will be set for any word.
This means that for the large majority of words in a
passage of text the token will contain the word's dictio-
nary number only. For dictionary numbers less than 256 a
single byte token would thus suffice in many cases. How-
ever, we need some means of telling the decoding routine
whether it should expect a one- or two-byte token. This can
be achieved by using the one remaining unused bit of the
token which, through necessity, will be located in the
first byte of the token together with the low order bits of
the dictionary item number. This arrangement will permit
single byte tokens for dictionary numbers less than 128
when the words require no punctuation flag bits.
  The one remaining item we need to define is the means of
terminating each string of tokens which represents a
message. For this we require some unique value. If in the
tokens we add 1 to the dictionary numbers, so that there is
a maximum of 2047 dictionary entries numbered from 1 to
2047, we find that a zero value byte will provide a suit-
able message terminator. Thus, if we want to print out any
text message then it is necessary only to locate the appro-
priate occurrence of a zero byte and begin printing at the
token which follows it.
  For the present it is assumed that the total number of
different text messages within the data base will not
exceed 256. The required message number for output may then
be specified by a single byte. Applications having more
than 256 different text messages may be accomodated by
permitting a suitable number of message lists all based
upon a single text dictionary.
  In storing the text dictionary we need some means of
marking the end of each entry. A suitable method of doing
this appears in the Spectrum ROM and has been used here.
It is done by setting bit 7 of the final character in each
entry thus giving it an ASCII code of greated than 127.
Any required dictionary entry is then found by locating the
appropriate occurrence of a dictionary byte in which bit 7
has been set.
  Earlier, we saw that it was not necessary to code trail-
ing spaces explicitly. Therefore, if we want to define
phrases which are to be tokenised then it is necessary to
consider other ways of representing spaces within these
phrases. A simple solution is to represent them by some
other character, preferably one which is unlikely to be
found elsewhere in the text. The system described here has
been designed to interpret the character '@' (ASCII 64) as
a 'phrase space'. When found in a phrase this character is
stored explicitly in the dictionary entry whereas during
output it is replaced by a true space.
  Using the hex loader shown in listing 1, you can load the
Z-80 machine-code routines which will both compress and
expand text according to the system described above. List-
ing 2 gives a hexadecimal dump of the Z-80 machine-code
routines. The decoding routine occupies 199 bytes beginning
at location 31000 and the compression routine occupies 315
bytes starting at location 31220. The code beginning at
location 31199 merely sets up the registers for the expan-
sion routine as used by this overall program.
  The first location of the printer buffer (23296) holds
the total number of characters in an input line of lines of
text while the remainder of the buffer - 23297 to 23551 -
is used to store the input text.
  Listing 3 gives the Basic program which will drive the
text compression routines described above. It is menu
driven and is simple to use. Option 1 enables you to reset
the base addresses for both the dictionary and the token-
ised messages. It is essential that you select this option
prior to initial text input.
  During operation of the program the dictionary of words
and phrases is built up as the text input is scanned. This
method of operation is sensible since, for large text data-
bases, it is unlikely that both the original and compressed
text may be stored simultaneously. In order to take full
advantage of the use of single byte tokens it is advisable
to enter some dummy text messages initially which contain
the words you believe to occur most frequently in your
text. When you have done this, reset the message storage
but retain the dictionary.
  Because the program given here has been devised to com-
press text as it is input from the keyboard, it enables you
to lay out the text as required. During input you may type
in up to eight lines of text at once - maximum of 255
characters - and each section of input does not have to
finish at the end of a sentence. It is only necessary to
ensure that you leave a space between each word.
  When you have completed the input for a message type the
character "_" (ASCII 95) as the first character of a new
single item of input. As was noted above, newline charac-
ters are automatically provided if the next word to be
printed will not fit into the current line. Additional new-
line characters can be inserted into the text by including
a "^" character (ASCII 94) at the appropriate place.
  With the exception of the characters shown in table 1,
for which special interpretation applies, all ASCII charac-
ters with codes in the range 32 to 127 will be treated as
normal text characters.
  Note that if you want to save some compressed text for
subsequent extension then it is essential that this is done
by the program so that all pointers are preserved. These
pointers are required only for the compression system's
book-keeping and are not needed by any program which will
use the compressed text. Similarly, the compression rou-
tine, stored in locations 31220 to 31534, may be omitted
from the target program. You will, however, require the
heart of the expansion routine - stored in locations 31000
to 31198, or suitably relocated as required. On entry to
the start of this routine, register 'a' should contain the
required message number and register pairs 'hl' and 'bc'
should point to the start addresses of the text tokens and
dictionary respectively. Since these are saved as 'code'
and are position independent they may be loaded into any
region of memory for your target program.
___________________________________________________________

 Table 1. Special input characters

 Charac-  ASCII  Interpretation
   ter    Code

    @      64    Treat as a space within phrases
    ^      94    Insert newline character after word
    _      95    End of input for current message
___________________________________________________________

 Figure 1. Text Example and corresponding tokens.

 Message: You are in a dark, damp cellar with a narrow
 passageway leading south.
       Word/Phrase   Token Value(s)
       You are in       2
       a                4
       dark,            7, 64
       damp             8
       cellar           10
       with             12
       a                4
       narrow           14
       passageway       16
       leading          18
       south.           21, 32, 0
 No. of token bytes in compressed message = 14