Introduction to random number generation in c++

This article is intended as a concise introduction into using the (pseudo)-random number generation facilities present in c++11. What I write about certainly isn’t new, but I need an easy-reading place I can point people to. That does not mean there are no problems with the standard services, there are some are fundamental, and are mentioned in the articles I cite, others are programmatic and could use some, well, circumventing, that comes with all the c++ intricacy delight, but I will leave that out waiting for a future second installment on the subject.

Structure of the problem


So you want a random number from a computer, right? You think it really is possible? After all, if your computer, a deterministic machine designed to follow your progam to the letter, started doing things at random, you would most likely declare it broken and either try to repair it or scrap it. No way. And yet, you play games, you do your Monte-Carlo simulations and all that claims to be using some sort of randomness. So where is it to be found in your computer? The answer has three parts.

Entropy


One source of randomness is not the computer, it is you or the surrounding world. You turn your computer on or run your application at a certain date and time. There is variation in the way you move your mouse or in the way you hit the keys of your keyboard. There are network packets that come at unpredicted times and with some variation in their contents. All that can be registered, operating systems do it to a varying degree. There are, however, several problems with using this data directly. One is that usually there will be too little data. A Monte-Carlo simulation for an option pricing exercise, where an entire price path over time is needed to produce just one figure, and that has to be repeated time and again, calls for thousands if not millions of floating point random numbers. The second is that the spectral properties of these numbers collected from the outside world will be unknown, in other words, it will not be possible to know upfront what the probability distribution that they are governed with is, nor whether subsequent numbers come from independent realizations.

Chaos


There exist functions, mathematical deterministic functions with all the precision that comes with their definition, that behave in a chaotic way. A chaotic function means that a small difference in initial conditions will result in wildly different behaviour of its evolution. Evolution of function f usually means iteration, in the simplest form we start with x, then have f(x), then we use that as the new starting point and have f(f(x)) and so on. Obviously, if this evolution runs in a finite set, then it is periodic, so the values are only useful until they wind around. The simplest such function is linear congruential: f(x)=ax+b mod n for carefully chosen a,b and n. How does the designation `chaotic’ relate to `random’? The sequence is fully deterministic, but if you try to forget about it for a while, and apply your statistical toolbox tests, you will find out that most will tell you these numbers come from realisation of independent random variables with uniform distribution over the numbers 1,2,…,n-1. So, such a sequence is just as useful as a truly random sequence (unless it isn’t, see for instance this paper by George Marsaglia, your statistical toolbox might not capture certain things, so for serious work you need to be careful).

More involved functions need more data than the last produced number to generate the next one. Symbolically, we would denote it by saying that f: S → S × V, where S is the set of possible states, V is the set of possible values, and of this compud outcome the value is given to the client, whereas the state needs to be kept to serve as argument for the subsequent calculation.

The caveat with chaotic sequences, however, is that being deterministic, they always produce the same sequence if they are started with the same initial value called seed (sometimes this is considered a feature though, as it allows for a full replay of an experiment). But hey, we have entropy, which, as we know already, is not good for massive generation, but is perfect for producing inital seeds.

Yet another problem with such sequences is when you need random values in many places of your code. Should you use one central instance of a chaotic sequence for all of them, or should you use separate instances? Usually, with ordinary generators not specifically designed for the purpose, it is better to have one instance, but one should be careful to avoid the naive implementation of `one instance’ that usually goes along the lines of global things, functions or objects.

Probability distributions


In the previous section we have established a way of generating integer numbers in the range 1,2,…n-1, for some n, that look random enough. What do we do if we want a sample from a different distribution, corresponding, say, to the result of rolling a die, that is, evenly distributed in the set 1,2,…,6? The naive way is to produce a chaotic number x, and calculate 1+x%6. That is not a good idea, though. First, your die is most likely loaded, by not much, but still, it all depends on whether n-1 is divisible by 6. Second, and more serious, you are using the least significant digits of your chaotic number, and the least significant digits are, well, less random than the most significant digits, on prominent examples of chaotic sequences. So it would be nice to have a decent library of probability distributions that would produce suitable values by transforming values of a chaotic sequence and avoiding such traps.

#include<random>


There are two sets of library solutions to the above constructs. One is the old rand together with srand functions available in cstdlib. The only advice about using them is: don’t. There’s no standard saying how they are supposed to be implemented, there is no probability distribution layer, and there have been implementations that were seriously flawed, see for instance here. Hopefully, with c++11 there comes a revamped random number generation library called random.

The backbone of the library are the chaotic sequence generators, called engines. They come as class templates and correspond to various algorithms known to produce decent chaotic sequences. The template parameters are in principle the integer type they are supposed to operate on, and the numeric parameters to the algorithms, like the a,b,n mentioned above where I discussed the linear congruential algorithm. However, good parameters are not easy to come by, so the library provides standard typedefed choices that can be used as type names. The most standard nowadays seems to be the Mersenne Twister, either the 32 bit version mt19937 or the 64 bit version mt19937_64. An older standard is the linear congruential algorithm minstd_rand. More can be found on cppreference.com. The constructors of all these classes can be given an integer, this allows for controlled and repeatable runs (this works even if the internal state is much bigger than just one number, the constructor takes care of that):

std::mt19937 seq1(17);
std::cout << seq1();

The entropy source is implemented as a class called random_device. It is supposed to hide the low-level details of getting entropy data from the operating system. It may get its figures from something like /dev/random if it is available, or by sampling the system timer if nothing more is present, we are supposed not to care about the details (again, unless we do, be careful if you’re into cryptography). One creates a default-constructed variable of this type and then one feeds it into the constructor of the engine:

std::random_device rd;
std::mt19937 seq2(rd());
std::cout << seq2();

You probably noticed the expression `supposed to’ above. In fact, this service is misspecified by the standard. First, it only produces a single 32 bit number, and this may be subpar to initialize a large state like the one in the Mersenne Twister (see for instance this post by Melissa E. O’Neill). Second, it is not required to be entropic. And while it works as expected on linux, on Windows with mingw TDM-gcc as of version 4.81, that most of my students now use, it is completely deterministic, it reproduces the same sequence everytime your application is started. So, one is better off using system time directly, by something like

auto s = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::mt19937 seq3(s);
std::cout << seq3();

One needs to remember though, that for serious work one would have to address the too-short-seed issue, by following the advice of another post of prof. O’Neill, or by switching to a different random number generation library suite, like boost or PCG. (I omit the discussion of the attempt of the standard at solving this issue, that is the std::seed_seq class, the first cited post by prof. O’Neill suggests it is so flawed as to be unusable altogether).

The above gives us a chaotic sequence with an entropic or deterministic start, but what we really need in real life are values from various probability distributions. In the standard library they come as a collection of classes and template classes, the template parameter determines the number type that is generated. They have constructors in which the parameters suitable for each distrubution kind are given, and an operator operator() that is given an engine and returns a random value from the said distribution. For instance, values that represent outcomes of rolls of a fair die can be obtained like that:

auto s = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::mt19937 seq3(s);
std::uniform_int_distribution<int> ui(1,6);
std::cout << ui(seq3) << "," << ui(seq3);

Functor calls ui(seq3) produce the numbers with the help of numbers that are obtained from seq3. Other distributions can be found at cppreference.com.

Avoiding global implementations


Global variables create dependencies in your code, that have various negative side-effects, difficult testability among them. So, how to provide a centralized instance of a chaotic sequence without a global variable? The standard makes it difficult to look at it from this point of view. The reason is that enven though the call seq3() requires just operator() to work, passing an instance of std::function<int(void)> to a distribution object fails to compile, as engines are supposed to provide several other services. Similarly, distributions decoupled from engines also cannot be put through std::function, since their operator() is a template that adapts to the engine type. A full parametrization at all the levels of abstraction thus calls for a solution that is either templated or using services similar to std::function but tailored to the specific types of objects. Unfortunately, these tailored services are not part of the standard, one has to write them oneself if needed, maybe with help of libraries such as Boost.TypeErasure.

But there is an easier way of looking at it. If in your application you agree to forget about the full structure of the solution, entropy, engines and distributions, and accept being given black-boxes as sources of pseudo-random numbers, then all you need is just a functor. Let’s assume your application is modelled by class objects, then the solution might look like that:

class Application
  {
  public:
    typedef std::function<int(void)> RNG;
    Application(RNG rng)
      : rng_(rng)
      {}
    auto Run(void) -> void
      {
      // application runs here
      std::cout << rng_();
      }
  private:
    RNG rng_;
  };

auto main(void) -> int
  {
  auto s = std::chrono::high_resolution_clock::now().time_since_epoch().count();
  std::mt19937 seq3(s);
  std::uniform_int_distribution<int> ui(1,6);
  std::function<int(void)> g = [&ui,&seq3](void){return ui(seq3);};
  Application app(g);
  app.Run();
  }

The expression [&ui,&seq3](void){return ui(seq3);} is a lambda, that is an ad-hoc function object. It has a function parameter, in this case it is void, and a return type deduced from the returned expression, here it is int. It also depends by reference on two other variables, ui and seq3, and that means that copies of that function object will depend by reference on the same variables ui and seq3, so there will be only one instance of the engine object used throughout the program. And you are free to replace those random numbers by something different, for instance distributed in another way, or, for testing purposes, by a deterministic sequence:

auto test(void) -> void
  {
  int i=5;
  std::function<int(void)> g = [&i](void){return 1+(++i%=6);};
  Application app(g);
  app.Run();
  }

This code feeds the application with values 1,2,3,4,5,6 in a cyclic way, where random are expected, and this test happens without changing a single semicolon in the code of your application.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: