Compress big integers in the smallest possible chain

advertisements

I have a bunch of 10 digit integers that I'm passing in a URL. Something like: "4294965286", "2292964213". They will always be positive and always be 10 digits.

I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.

I'm currently using asp.net so a vb.net or c# solution would be best.

Thanks


Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.

The "simple" method is to use base-64 encoding.

That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).

Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.

Happy coding.


* Using base-64 requires 6 output characters.

Edit: I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where lg(n) is log to the base 2.

The number of input bits required to represent the input number is bits/char * chars -> lg(10) * 10 (or just lg(9999999999)) -> ~33.2 bits. Using jball's manipulation to shift the number first, the number of bits required is lg(8999999999) -> ~33.06 bits. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).

So we try to find an x (number of characters in base-64 encoding) such that:

lg(64) * x = 33.2 -> 6 * x = 33.2 -> x ~ 5.53. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.

However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.

Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.

byte[] data = BitConverter.GetBytes(value);
// make data big-endian if needed
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data);
}
// first 5 base-64 character always "A" (as first 30 bits always zero)
// only need to keep the 6 characters (36 bits) at the end
string base64 = Convert.ToBase64String(data, 0, 8).Substring(5,6);

byte[] data2 = new byte[8];
// add back in all the characters removed during encoding
Convert.FromBase64String("AAAAA" + base64 + "=").CopyTo(data2, 0);
// reverse again from big to little-endian
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data2);
}
long decoded = BitConverter.ToInt64(data2, 0);


Making it "prettier"

Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (lg(32) * 6).

However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (lg(48) * 6) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10) required.

Here is a proof-of-concept:

// This does not "pad" values
string Encode(long inp, IEnumerable<char> map) {
    Debug.Assert(inp >= 0, "not implemented for negative numbers");

    var b = map.Count();
    // value -> character
    var toChar = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Index, i => i.Value);
    var res = "";
    if (inp == 0) {
      return "" + toChar[0];
    }
    while (inp > 0) {
      // encoded least-to-most significant
      var val = (int)(inp % b);
      inp = inp / b;
      res += toChar[val];
    }
    return res;
}

long Decode(string encoded, IEnumerable<char> map) {
    var b = map.Count();
    // character -> value
    var toVal = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Value, i => i.Index);
    long res = 0;
    // go in reverse to mirror encoding
    for (var i = encoded.Length - 1; i >= 0; i--) {
      var ch = encoded[i];
      var val = toVal[ch];
      res = (res * b) + val;
    }
    return res;
}

void Main()
{
    // for a 48-bit base, omits l/L, 1, i/I, o/O, 0
    var map = new char [] {
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K',
        'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
        'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
        'h', 'j', 'k', 'm', 'n', 'p', 'q', 'r', 's', 't',
        'u', 'v', 'x', 'y', 'z', '2', '3', '4',
    };
    var test = new long[] {0, 1, 9999999999, 4294965286, 2292964213, 1000000000};
    foreach (var t in test) {
        var encoded = Encode(t, map);
        var decoded = Decode(encoded, map);
        Console.WriteLine(string.Format("value: {0} encoded: {1}", t, encoded));
        if (t != decoded) {
            throw new Exception("failed for " + t);
        }
    }
}

The result is:

value: 0 encoded: A
value: 1 encoded: B
value: 9999999999 encoded: SrYsNt
value: 4294965286 encoded: ZNGEvT
value: 2292964213 encoded: rHd24J
value: 1000000000 encoded: TrNVzD

The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.