How can I determine the statistical randomness of a binary string?
Ergo, how can I code my own test, and return a single value that corresponds to the statistical randomness, a value between 0 and 1.0 (0 being not random, 1.0 being random)?
The test would need to work on binary strings of any size.
When you do it with pen and paper, you might explore strings like this:
0 (arbitrary randomness, the only other choice is 1)
00 (not random, its a repeat and matches the size)
01 (better, two different values)
010 (less random, palindrome)
011 (less random, more 1's, still acceptable)
0101 (less random, pattern)
0100 (better, less ones, but any other distribution causes patterns)
Size: 1, Possibilities: 2
0: 1.0 (random)
1: 1.0 (random)
Size: 2, P:4
01: 1.0 (random)
10: 1.0 (random)
000: ? non-random
001: 1.0 (random)
010: ? less random
011: 1.0 (random)
100: 1.0 (random)
101: ? less random
110 1.0 (random)
111: ? non-random
And so on.
I feel that this may play a lot into breaking the string into all possible substrings and comparing frequencies, but it seems like this sort of groundwork should already have been done in the early days of computer science.
This will give you an entropy count from 0 to 1.0:
You might want to try looking into the Shannon Entropy, which is a measure of entropy as applied to data and information. In fact, it is actually almost a direct analogue of Physical formula for entropy as defined by the most accepted interpretations of Thermodynamics.
More specifically, in your case, with a binary string, you can see the Binary Entropy Function, which is a special case involving randomness in binary bits of data.
This is calculated by
H(p) = -p*log(p) - (1-p)*log(1-p)
(logarithms in base 2; assume
0*log(0) is 0)
p is your percentage of 1's (or of 0's; the graph is symmetrical, so your answer is the same either way)
Here is what the function yields:
As you can see, if
p is 0.5 (same amount of 1's as 0's), your entropy is at the maximum (1.0). If
p is 0 or 1.0, the entropy is 0.
This appears to be just what you want, right?
The only exception is your Size 1 cases, which could just be put as an exception. However, 100% 0's and 100% 1's don't seem too entropic to me. But implement them as you'd like.
Also, this does not take into account any "ordering" of the bits. Only the sum total of them. So, repetition/palindromes won't get any boost. You might want to add an extra heuristic for this.
Here are your other case examples:
00: -0*log(0) - (1-0)*log(1-0) = 0.0 01: -0.5*log(0.5) - (1-0.5)*log(1-0.5) = 1.0 010: -(1/3)*log(1/3) - (2/3)*log(2/3) = 0.92 0100: -0.25*log(0.25) - (1-0.25)*log(1-0.25) = 0.81