|
|
|
|
INTRODUCTION The purpose of this text file is to highlight the major features of location based encoding (LBE). LBE is a method of file compression - to the uninitiated -compression is a method of reducing the size of a file. There are two types of compression - lossy and lossless. Lossless compression is a method by which a file is encoded in such a way that when the file is finally decompressed all of the original information is still present. Lossy compression however is the method by which some information - mostly superficial - is lost. This method is obviously best suited for movie and audio files where a slight loss in resolution and picture quality hardly makes any difference. LBE is a lossless compression algorithm. In a pc based on ASCII - the American Standard Code for Information Interchange - all characters are represented using 8 bits. However it is to be noted that some characters in text and other files occur more often than others. For e.g. in a text file, the “space” character occurs more often than any other character. So this algorithm like the Huffman encoding scheme advocates the usage of lesser number of bits to represent the more frequently occurring characters and more number of bits to represent characters with comparatively lesser frequencies. How LBE works LBE accomplishes the task outlined above by first extracting the unique characters from the file to be compressed and then sorting them according to their frequency of occurrence. Once this is done the characters are then put in a matrix. But the characters are not simply inserted in the usual column or row major. They have to be arranged in such a way that the distance (in terms of number of characters) from the first (1,1) element is minimum. This is achieved by placing them in a particular order. This idea is explained in the figure below. Please note that the numbers 1,2,3 - represent the most frequently occurring character, the second most frequently occurring character and so on respectively and not the characters themselves.
1 2 4 7 11 3 5 8 12 6 9 13 10 14 and so on..... -- FIG1 15 It is pretty clear from the figure that the distance between the (1,1) element and the rest of the characters is the least when compared to putting the characters in rows or columns. To represent any character in this matrix - for e.g. the 9th element - its location is first obtained. It happens to be in the 3rd row and the 2nd column. This character is now represented as 00101- the first two zeros indicate that the character is not present in either the first or the second row. The third character - 1 indicates that the required character is present in the third row. The next zero indicates that the character isn’t in the 1st column. The next character, which happens to be one, tells the program that the character is in the 2nd column. Similarly the codes for the 6th, 7th and 8th most frequently occurring characters are 6 - 0011 7 - 10001 8 - 01001 Note that the number of bits required to represent the characters in each slanting row are the same - the number of bits required to represent 2 and 3 are the same. This applies to 4,5,6 and also to 7,8,9,10. This is quite important, as you will see later. As is expected all unique characters are encoded in this fashion and every time they are encountered in the text file to be compressed they are replaced by this variable length binary sequence rather than their ASCII equivalents. You will notice that the number of bits required to represent the less frequently occurring characters is a lot more than the no of bits for the first few characters. While decompressing, since it is not known which code sequence belongs to which character, all unique characters are stored in a header in the descending order of frequency. The maximum no of characters, which can be represented within 7 bits, is 22. For all other characters 8 bits or more are required. This obviously is not profitable, as the representation of such characters will only serve to increase the size of the file. So there is a need to increase the number of characters, which can be represented with seven bits. This is achieved by using the method of multiple tables. As the name implies it involves the creation of multiple tables to store the characters. The code for each of the characters in the new tables is generated the same way as above. The only difference is that code for each of the characters in tables are prefixed by 2(n)-2 number of ‘1’ bits where n =1,2,3... For e.g. - codes for the 1st table contain 0 prefixed ‘1’ bits Codes for the 2nd table contain 2 prefixed ‘1’ bits Codes for the 3rd table contain 4 prefixed ‘1’ bits and so on... To make this clearer, assume that the 9th character in FIG 1 was in the 3 rd table. Then the code for it is 111100101. Note the 4 ‘1’ bits in front of the usual code for the ninth element. Arranging the elements in multiple tables is a lot more complex. This is how we put our characters in multiple tables. As usual the numbers represent the rank of the characters in terms of frequency of occurrence I table II table III table...and so on * 1 3 6 12 20 32 * 10 17 26 39 * 30 44 2 4 7 13 21 33
11 18 27 40
31 45 5 8 14 22 34
19 28 41
46 9 15 23 35
29 42 16 24 36
43 25 37 38
-- FIG II Notice that the after the 9th character, the 10th and 11th characters are put in table number II before returning to table number I. Note the arrangement after the 16th character also. The reason for arranging characters in such a manner is to ensure that the more frequently occurring characters are represented by as few bits as possible. For example, the tenth character is represented by 11101 (the first 2 bits indicating that the character is in the 2nd table)- i.e. a total of 5 bits. If it had been left in table number I, then the code for character number 10 would have been 100001 – a total of 6 bits. Thus a new table is created if it is discovered that putting characters in the new table is more profitable than continuing to insert them into the present table. Insertion of characters into the new table is stopped when it is discovered that putting them into the older tables is more profitable. Decompressing files encoded in LBE is simple. From the header the unique characters are obtained in the descending order of frequency. With this information the tables are created again. The codes are then read from the file. To obtain the table in which the character is present the number of even number of ‘1’ bits present in the beginning of the sequence is obtained. This number is divided by 2 and then incremented by one to obtain the table. Then the rest of the sequence is obtained to locate the character required. For e.g. if the code encountered is 1111101(ref fig II)
1111
1
01
|
|
|
Reps
the table number indicates
the indicates the
column The
no of ‘1’ bits =4 row = 1
= 2 Therefore the table Number= 3
-- FIG III Therefore the required character is the 30th most repeating character. However, even by the method of multiple tables, the total number of unique characters that can be represented by 7 bits is just 31. PROFITABILITY Profitability is a major concern that I have not outlined before. Even the characters that are represented by 7 bits are not immediately profitable. This is because of the presence of the header in which the unique characters are listed. The characters in the header are coded in normal ASCII- 8 bit representation. Say during coding a character is to be encoded in LBE with 4 bits. The first time a character occurs in the file to be compressed it will be replaced with the 4 bit LBE encoding. But the total number of bits in the file referring to the existence of this character is 12 - (8 in the header and 4 in the actual coding). This is 4 more than in the original file. The second time the same character is encountered it is replaced again by 4 bits. The total number of bits this time is (8 - header + 4 - code for the first occurrence +4 - code for the second occurrence). This is equal to the number of bits in the original text file (=16). It’s only the next time the character appears (3rd occurrence) is the coding technique profitable. Therefore replacing this character is suitable only when there are at least 3 occurrences of the character. Thus, every character in the tables has a minimum requirement of character occurrences depending on the length of its bit sequence. As the length of the bit sequence increases the minimum requirement for every character also increases. FILE SPLITTING File splitting was proposed to serve various sections of the file better. By file splitting we mean splitting the file into many sections and then applying LBE to each of these sections. It is thought we will be able to achieve, for the entire file, a higher compression ratio. This is because the tables created for each of the sections will reflect better, the needs for that section. We have still not defined an efficient algorithm to split the file ADVANTAGES OF LBE OVER HUFFMAN ENCODING The main and obvious advantage of LBE is the sheer speed. LBE can immediately generate the codes for characters without being given the file to be compressed. This is in sharp contrast with Huffman in which a binary tree has to be constructed every time a new file is to be compressed - taking precious processing time. Once the program is loaded, LBE requires time only for code substitution - replacing characters with their bit sequences. The reduction in compression ratio with respect to Huffman is also negligible. For e.g. while LBE gives a compression ratio of 36.95% for a 1.3 MB file, Huffman gives a ratio of 38.77% for the same file. MODIFICATIONS FOR EFFICIENT IMPLEMENTATION
In a program, we don’t actually put the characters in the matrix. Instead we automatically generate the code for each of the characters. All characters are arranged in a linear array. A function is first used to assign the table numbers for each character. Depending on the table number and the number of characters already present in the “table”, the appropriate code is generated. Then the file to be compressed is read. As the characters present in the file are read, binary search is used to locate the character in the array and the code associated with it is then written into the compressed file. RATIOS FOR SOME FILE TYPES
All sizes in bytes. The following figures don’t represent either the best or the worst ratios for the files listed below.
Size
compressed size
ratios (%) Pdf 77012 58620 23.88 Htm 171764 107797 37.24 Bmp 139926 54222 61.24 Mdb 2471936 1893296 23.408 Doc 36352 19165 47.27
Txt 1395473 879798 36.95 Exe 22196 18676 15.85
kindly note that these are neither the best or the worst results for these
type of files. |
|
Members Archive How LBE WORKSHome
For problems or questions regarding this project contact iamthebiggestone@yahoo.co.in.
|