There’s Guessing and There’s Geussing

Colorful assortment of six-sided dice

This post was inspired by a blog entry titled Playing with Randomness by Johannes Weber.

What does random actually mean? That’s a philosophical question, and the definition probably depends on the context.

When it comes to random numbers, I like to define it as a contrast to biased, as an unknown number within a known range that can not be predicted more accurately than an uneducated guess.

For example, if you have a six-sided dice that is fair, you can not predict the outcome with more accuracy than just picking a number because you like it. However if you happen to know the dice lands on 5 more often than other numbers, you can make an educated guess that it will land on five and you will right more often than just picking a number. In that case the result of rolling the dice has a bias towards landing on a five and thus us not random.

With computers generating a number via a software algorithm, it never really is actually random. The algorithm determines the result and if you know what the seed was and where it is in the sequence, you can predict what the next number will be.

Going back to the example of the dice, if you do not know it has a bias towards landing on five, you may figure it out rather quickly because there will be a pattern that you can recognize. If it is a fair dice with truly random results, the odds are the same for any given roll and there is no relationship between past rolls and the next roll.

When software generates a random number, that is not the case. However what software can do is attempt to mimic the same kind of distribution of results that would result if it was truly random.

In essence what the software is doing is attempting to obfuscate the algorithm used so that a pattern can not be found that allows for an educated prediction to have more accuracy than an uneducated case. It still is not actually random even when it achieves that goal, which is why an algorithm based random number generator is called a pseudo Random Number Generator. Security by Obscurity, eh? 😉

There are several methods that are used to obfuscate the pattern that results from an algorithm. One method used by the Linux kernel is to mix in actual random data with the generated random data. The timing of interrupts from the keyboard and mouse are often used for this. The haveged daemon gathers an incredible amount of entropy from hardware variances that are mixed with with the generated data.

Still, it is obfuscation.

When reading the blog linked at the top that investigates testing several different pRNGs, I started thinking about the obfuscation aspect, and what I could do to potentially obfuscate a source of data that is not very random so that it would appear to be more random than it actually is. This is what I came up with, implemented in PHP:

function filter_results($rnd) {
    static $salt = null;
    static $nonce = null;
    static $counter = 0;
    if(is_null($salt)) {
        $salt = '';
    }
    if(is_null($nonce)) {
        $nonce = random_bytes(16);
    } else {
        sodium_increment($nonce);
    }
    $rawhash = hash('sha256', $salt . $nonce, true);
    $res = $rnd ^ $rawhash;
    $counter++;
    if($counter === 32) {
        $str = $res . $rnd . $nonce . $salt;
        $hash = hash('sha384', $str, true);
        $salt = base64_encode($hash);
        $counter = 0;
    }
    return $res;
}

That function takes 32 bytes of pseudo-Random data (the $rnd argument) and then modifies it to try and reduce or hide any patterns that may exist in the algorithm that generated it.

The very first time the function is called, it initializes a 16-byte nonce using the PHP random_bytes() function which allegedly is suitable for cryptographic purposes. It only does that the first time it is called. On subsequent calls, that nonce is incremented.

When the function is first called, it initializes itself with an empty salt. However it also has a counter that counts how many times it has been called. When it has been called 32 times, it creates a salt using the SHA-384 algorithm on a string created from the old salt, the current nonce, and the result of the current obfuscation of the pseudo-Random data. Using a salt that changes often I am hoping will obfuscate any patterns that may exist in the SHA-256 algorithm that is used to help obfuscate the pseudo-Random data.

With respect to obfuscating the pseudo-Random data, I could have just used the nonce and the salt to hash the pseudo-Random data but I decided to instead create a hash just from the salt and nonce and then use that to XOR the pseudo-Random data. I did this because if SHA-256 itself has biases, those biases would be detectable as a pattern if I just output the result of a SHA-256 hash even if the input data was truly random. Just taking a hash could potentially reduce how random the result is!

However if instead I take the original $rnd and XOR it against the hash, I suspect (I’m not an expert) that the result will never be less random than the input. Please, do not use my function to try and make something for cryptography. I am fascinated by cryptography but I have never studied it at the academic level.

But think about it. If the SHA-256 was so biased that it always resulting in 32 bytes of 0s then it would be extremely predictable, so using its output as random data would be worthless. But if it was so biased that it always resulted in 32 bytes of 0s then a result of an XOR against the input would be the input. That is not a loss of any randomness.

The only way I can think of that a biased hash function being XORed against the random data to be obfuscated could reduce the randomness would be if it was a hash of the random data, but it’s not. The input is used to generate the salt every 32nd time the function is called, but that salt is then only used the to obfuscate input that comes after it was generated. The data being obfuscated is in no way a part of the creation of the hash used to obfuscate it with an XOR.

So how well does it work?

Test One – Intentionally Crappy Random Data

I wrote the following function to create a very very very bad RNG:

function shitty_random_32_bytes() {
    $rand = rand(0,1);
    if($rand === 0) {
        $n = '00';
    } else {
        $n = 'ff';
    }
    $return = '';
    for($i=0; $i<16; $i++) {
        $return = $return . $n . $n;
    }
    return hex2bin($return);
}

Sometimes that function produces 32 bytes that are all 0s and sometimes it produces 32 bytes that are all 1s.

What I did is called that function in a loop (16,777,216) times. After each call, the very bad data was written to a file and then sent through the filter function to obfuscate it and that was written to a file. Both files were 537.9 MB according to the Linux GUI finder (MATE desktop).

Using the rngtest utility to test FIPS 140-2 with 10,000 iterations, the data without the filter had 10,000 fails and 0 successes. However the same data that passed through the filter function before being written to file had only 9 failures and 9991 successes.

I also ran them them through DieHarder. Without the filtering, there were 88 failures, 20 where it reported weak, and only 6 where it passed. With the filter, it only failed 8 tests and reported weak on 13 tests. It passed 93 tests.

Dramatic improvement. However, the improvement does NOT mean the data is more random, just that it hides it’s lack of randomness better.

Test Two – Less Crappy but still Extremely Crappy Random Data

For the second test, I wrote this very very bad RNG that is only a little bit better:

function crappy_random_32_bytes(int $whatever=0)
{
    if ($whatever === 0) {
        $whatever = random_int(0,52);
    }
    // generates 32 bytes of data one byte at a time
    $return = '';
    static $seed = null;
    if(is_null($seed)) {
        $seed = ($whatever % 53);
    } else {
        $seed = $seed + 7;
    }
    for($i=0; $i<32; $i++) { $seed = $seed + 67; if($seed >= 256) {
            $seed = $seed - 256;
        }
        $byte = dechex($seed);
        $byte = str_pad($byte, 2, '0', STR_PAD_LEFT);
        $return .= $byte;
    }
    return hex2bin($return);
}

Note that it only uses the data from random_int the first time it is called. Calling random_int every time is just sloppy code on my part because it is only needed once. Anyway…

I did the same testing. The results?

With rngtest the random data that didn’t pass through the filter had 9995 failures and 5 successes. Better but still crap. With filtering, it had 9994 successes and only 4 failures.

With DieHarder, without filtering it had 114 failures, 0 weak, and 0 where it passed. It actually did worse. After filtering though, only 2 failed, 4 weak, and 108 of the tests passed. The filtered version of this test did better.

Test Three – A Proper pRNG

For the third round, I tested what is allegedly a Cryptograpgy Suitable RNG – the random_bytes() function built-in to PHP 7+.

Same testing procedure.

With the unfiltered data, rngtest had only 9 failures, and 9991 successes. Same data filtered, 12 failures and 9988 successes.

With DieHarder, the unfiltered test only had 4 failures, 4 identified as weak, and 106 passed. With filtering, only 3 failed tests, 3 as weak, and 108 identified as passed.

Conclusion

Using a filter that uses a SHA-256 hash of an incremented deterministic nonce and a somewhat frequently changing salt does a very good at obfuscating how bad a really bad RNG is, at least from test suites. With a proper pRNG the test differences are probably within normal variance so no claims can be made.

Using something like that filter but written in C may be useful to obfuscate non-random data harvested from listening to TLS traffic on the network and using it as an additional source of random data to mix in with the other sources of random data, but I doubt it would be wise to use it on its own.

1 thought on “Random Thoughts…

Leave a Reply

Your email address will not be published. Required fields are marked *

Anonymity protected with AWM Pluggable Unplugged