I'm revamping my programming skills and implemented the Huffman algorithm. For now, I'm just considering [a-z] with no special characters. The probability values for a-z have been used from wikipedia.
When I run it, I get roughly 2x compression for random paragraphs. But for this calculation I assume original letters require 8 bits each (ASCII).
But if I think about it, to represent 26 items, i just need 5 bits. If I calculate based on this fact, then compression factor drops to almost 1.1
So my question is, how is the compression factor determined in real world applications?
2nd question - if I write an encoder / decoder which uses 5 bits for representing a-z ( say a=0, b=1, etc) - is this also a considered a valid "compression" algorithm?
You have essentially the right answer, which is that you can't expect a lot of compression if all that you're working with is the letter frequencies of the English language.
The correct way to calculate the gain resulting from knowledge of the letter frequencies is to consider the entropy of a 26-symbol alphabet of equal probabilities with the entropy of the letters in English.
(I wish stackoverflow allowed TeX equations like math.stackexchange.com does. Then I could write decent equations here. Oh well.)
The key formula is -p log(p), where p is the probability of that symbol and the log is in base 2 to get the answer in bits. You calculate this for each symbol and then sum over all symbols.
Then in an ideal arithmetic coding scheme, an equiprobable set of 26-symbols would be coded in 4.70 bits per symbol. For the distribution in English (using the probabilities from the Wikipedia article), we get 4.18 bits per symbol. A reduction of only about 11%.
So that's all the frequency bias by itself can buy you. (It buys you a lot more in Scrabble scores, but I digress.)
We can also look at the same thing in the approximate space of Huffman coding, where each code is an integral number of bits. In this case you would not assume five bits per letter (with six codes wasted). Applying Huffman coding to 26 symbols of equal probability gives six codes that are four bits in length and 20 codes that are five bits in length. This results in 4.77 bits per letter on average. Huffman coding using the letter frequencies occurring in English gives an average of 4.21 bits per letter. A reduction of 12%, which is about the same as the entropy calculation.
There are many ways that real compressors do much better than this. First, they code what is actually in the file, using the frequencies of what's there instead of what they are across the English language. This makes it language independent, optimizes for the actual contents, and doesn't even code symbols that are not present. Second, you can break up the input into pieces and make a new code for each. If the pieces are big enough, then the overhead of transmitting a new code is small, and the gain is usually larger to optimize on a smaller chunk. Third, you can look for higher order effects. Instead of the frequency of single letters, you can take into account the previous letter and look at the probability of the next letter given its predecessor. Now you have 26^2 probabilities (for just letters) to track. These can also be generated dynamically for the actual data at hand, but now you need more data to get a gain, more memory, and more time. You can go to third order, fourth order, etc. for even greater compression performance at the cost of memory and time.
There are other schemes to pre-process the data by, for example, doing run-length encoding, looking for matching strings, applying block transforms, tokenizing XML, delta-coding audio or images, etc., etc. to further expose redundancies for an entropy coder to then take advantage of. I alluded to arithmetic coding, which can be used instead of Huffman to code very probable symbols in less than a bit and all symbols to fractional bit accuracy for better performance in the entropy step.
Back to your question of what constitutes compression, you can begin with any starting point you like, e.g. one eight-bit byte per letter, make assertions about your input, e.g. all lower case letters (accepting that if the assertion is false, the scheme fails), and then assess the compression effectiveness. So long as you use all of the same assumptions when comparing two different compression schemes. You must be careful that anything that is data dependent must also be considered part of the compressed data. E.g. a custom Huffman code derived from a block of data must be sent with that block of data.