Data le 1 (zebra) f p(x) CW l(x) l(x)p(x) x f p(x) CW l(x) l(x)p(x) x f p(x) CW in .NET Assign ECC200 in .NET Data le 1 (zebra) f p(x) CW l(x) l(x)p(x) x f p(x) CW l(x) l(x)p(x) x f p(x) CW

Data le 1 (zebra) f p(x) CW l(x) l(x)p(x) x f p(x) CW l(x) l(x)p(x) x f p(x) CW use none none drawer tocreate none with none MSI Plessey Data le 3 (snakes). Data le 4 (parki ng) l(x) l(x)p(x). 0.343 0.343 0.

34 none none 3 0.343 0.229 0.

229 0.229 0.229 0.

286 0.286 0.143 0.

143 0.143 0.143 0.

143 0.143 0.143 0.

143. p(x). l(x) l(x)p(x). Z A E O R B I S F H J N T C D G K L M P Q U V W X Y 9 7 3 3 3 2 2 2 none for none 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0. 0.2500 0.1944 0.

0833 0.0833 0.0833 0.

0556 0.0556 0.0556 0.

0278 0.0278 0.0278 0.

0278 0.0278 0.0000 0.

0000 0.0000 0.0000 0.

0000 0.0000 0.0000 0.

0000 0.0000 0.0000 0.

0000 0.0000 0.0000.

10 11 0001 0010 none for none 0011 1010 1011 00000 00001 10000 10001 10010 10011. 2 2 4 4 4 4 4 5 5 5 5 5 5. 0.500 0.389 0.

33 none none 3 0.333 0.333 0.

222 0.222 0.278 0.

139 0.139 0.139 0.

139 0.139. 36 1.0000. Mean CW length (bit/symbol) Compression ratio 33.89%. 27.14%. 34.71%. 20.00%. 9.2 Data compression to 34.7%. A clos none for none er look at the table data shows that the factor that appears to increase the compression ratio is the frequency spread in the top group of most frequent characters.

If the most frequent characters have dissimilar frequencies, then shorter codewords can be assigned to a larger number of symbols. We observe that the rst three data les, corresponding to nontypical English texts, lend themselves to greater compression than the fourth data le, corresponding to ordinary English. There is no need to go through tedious statistics to conclude beforehand that increasing the length of such English-text data les would give compression ratios increasingly closer to the limit of r = 15.

76%. Clearly, this is because the probability distribution of long English-text sequences will duplicate with increasing delity the standard distribution for which we have found this compression limit. On the other hand, shorter sequences of only a few characters might have signi cantly higher compression ratios.

To take an extreme example, the data le AAA (for American Automobile Association) takes 1 bit/symbol and, thus, has a compression ratio of r = 1 1/5 = 80%. If we take the full ASCII code for reference (7 bit/character), the compression becomes r = 1 1/7 = 85.7%.

7 The above examples have shown that for any given data le, there exists an optimal (Huffman) code that achieves maximum data compression. As we have seen, the codeword assignment is different for each data le to be compressed. Therefore, one needs to keep track of which code is used for compression in order to be able to recover the original, uncompressed data.

This information, which we refer to as overhead, must then be transmitted along with the compressed data, which we refer to as payload. Since the overhead bits reduce the effective compression rate, it is clear that the overhead size should be the smallest possible relatively to the payload. In the previous examples, the overhead is simply the one-to-one correspondence table between codewords and symbols.

Using ve-bit (ASCII) codewords to designate each of the character symbols, and a ve-bit eld to designate the corresponding compressed codewords makes a ten-bit overhead per data le symbol. Taking, for instance, data le 3 (Table 9.2), there are 13 symbols, which produces 130 bits of overhead.

It is easily calculated that the payload represents 111 bits, which leads to a total of 130 + 111 = 241 bits for the complete compressed le (overhead + payload). In contrast, a ve-bit ASCII code for the same uncompressed data le would represent only 170 bits, as can also be easily veri ed. The compressed le thus turns out to be 40% bigger than the uncompressed one! The conclusion is that.

This considerati on illustrates the interest of acronyms. Their primary use is to save text space, easing up reading and avoiding burdensome redundancies. This is particularly true with technical papers, where the publication space is usually limited.

An equally important use of acronyms is to capture abstract concepts into small groups of characters, for instance ADSL (asymmetric digital subscriber line) or HTML (hypertext markup language). The most popular acronyms are the ones that are easy to remember, such as FAQ (frequently asked questions), IMHO (in my humble opinion), WYSIWYG (what you see is what you get), NIMBY (not in my backyard), and the champion VERONICA (very easy rodent-oriented netwide index to computerized archives), for instance. The repeated use of acronyms makes them progressively accepted as true English words or generic brand names, to the point that their original character-to-word correspondence is eventually forgotten by their users, for instance: PC for personal computer, GSM for global system for mobile [communications], LASER for light ampli cation by stimulated emission of radiation, NASDAQ for National Association of Securities Dealers Automated Quotations, etc.

Language may thus act a natural selfcompression machine, which uses the human mind as a convenient dictionary. In practice, this dictionary is only rarely referred to, since the acronym gains its own meaning by repeated use..

Copyright © . All rights reserved.