Apache Commons logo Commons Math

2 Data Generation

2.1 Overview

The Commons Math o.a.c.m.random package includes utilities for

  • generating random numbers
  • generating random vectors
  • generating random strings
  • generating cryptographically secure sequences of random numbers or strings
  • generating random samples and permutations
  • analyzing distributions of values in an input file and generating values "like" the values in the file
  • generating data for grouped frequency distributions or histograms

These utilities rely on an underlying "source of randomness", which in most cases is a pseudo-random number generator (PRNG) that produces sequences of numbers that are uniformly distributed within their range. Commons Math depends on Commons Rng for the PRNG implementations.

A PRNG algorithm is often deterministic, i.e. it produces the same sequence when initialized with the same "seed". This property is important for some applications like Monte-Carlo simulations, but makes such a PRNG often unsuitable for cryptographic purposes.

2.2 Random Deviates

Random sequence of numbers from a probability distribution
There is no such thing as a single "random number." What can be generated are sequences of numbers that appear to be random. When using the built-in JDK function Math.random(), sequences of values generated follow the Uniform Distribution, which means that the values are evenly spread over the interval between 0 and 1, with no sub-interval having a greater probability of containing generated values than any other interval of the same length. The mathematical concept of a probability distribution basically amounts to asserting that different ranges in the set of possible values of a random variable have different probabilities of containing the value. Commons Math supports generating random sequences from each of the distributions defined in the o.a.c.m.distribution package. Please refer to the specific documentation for more details.
Cryptographically secure random sequences
It is possible for a sequence of numbers to appear random, but nonetheless to be predictable based on the algorithm used to generate the sequence. When in addition to randomness, strong unpredictability is required, a secure random number generator should be used to generate values (or strings), for example an instance of the JDK-provided SecureRandom generator. In general, such secure generator produce sequence based on a source of true randomness, and sequences started with the same seed will diverge. The RandomUtils class provides a method for wrapping a java.util.Random or java.security.SecureRandom instance in an object that implements the UniformRandomProvider interface:
UniformRandomProvider rg = RandomUtils.asUniformRandomProvider(new java.security.SecureRandom());

2.3 Random Vectors

Some algorithms require random vectors instead of random scalars. When the components of these vectors are uncorrelated, they may be generated simply one at a time and packed together in the vector. The UncorrelatedRandomVectorGenerator class simplifies this process by setting the mean and deviation of each component once and generating complete vectors. When the components are correlated however, generating them is much more difficult. The CorrelatedRandomVectorGenerator class provides this service. In this case, the user must set up a complete covariance matrix instead of a simple standard deviations vector. This matrix gathers both the variance and the correlation information of the probability law.

The main use for correlated random vector generation is for Monte-Carlo simulation of physical problems with several variables, for example to generate error vectors to be added to a nominal vector. A particularly common case is when the generated vector should be drawn from a Multivariate Normal Distribution.

Generating random vectors from a bivariate normal distribution
// Import common PRNG interface and factory class that instantiates the PRNG.
import org.apache.commons.rng.UniformRandomProvider;
import org.apache.commons.rng.RandomSource;

// Create (and possibly seed) a PRNG (could use any of the CM-provided generators).
long seed = 17399225432L; // Fixed seed means same results every time 
UniformRandomProvider rg = RandomSource.create(RandomSource.MT, seed);

// Create a GaussianRandomGenerator using "rg" as its source of randomness.
GaussianRandomGenerator rawGenerator = new GaussianRandomGenerator(rg);

// Create a CorrelatedRandomVectorGenerator using "rawGenerator" for the components.
CorrelatedRandomVectorGenerator generator = 
    new CorrelatedRandomVectorGenerator(mean, covariance, 1.0e-12 * covariance.getNorm(), rawGenerator);

// Use the generator to generate correlated vectors.
double[] randomVector = generator.nextVector();
The mean argument is a double[] array holding the means of the random vector components. In the bivariate case, it must have length 2. The covariance argument is a RealMatrix, which has to be 2 x 2. The main diagonal elements are the variances of the vector components and the off-diagonal elements are the covariances. For example, if the means are 1 and 2 respectively, and the desired standard deviations are 3 and 4, respectively, then we need to use
double[] mean = {1, 2};
double[][] cov = {{9, c}, {c, 16}};
RealMatrix covariance = MatrixUtils.createRealMatrix(cov); 
where "c" is the desired covariance. If you are starting with a desired correlation, you need to translate this to a covariance by multiplying it by the product of the standard deviations. For example, if you want to generate data that will give Pearson's R of 0.5, you would use c = 3 * 4 * 0.5 = 6.

In addition to multivariate normal distributions, correlated vectors from multivariate uniform distributions can be generated by creating a UniformRandomGenerator in place of the GaussianRandomGenerator above. More generally, any NormalizedRandomGenerator may be used.

Low discrepancy sequences
There exist several quasi-random sequences with the property that for all values of N, the subsequence x1, ..., xN has low discrepancy, which results in equi-distributed samples. While their quasi-randomness makes them unsuitable for most applications (i.e. the sequence of values is completely deterministic), their unique properties give them an important advantage for quasi-Monte Carlo simulations.
Currently, the following low-discrepancy sequences are supported:
// Create a Sobol sequence generator for 2-dimensional vectors
RandomVectorGenerator generator = new SobolSequence(2);

// Use the generator to generate vectors
double[] randomVector = generator.nextVector();
The figure below illustrates the unique properties of low-discrepancy sequences when generating N samples in the interval [0, 1]. Roughly speaking, such sequences "fill" the respective space more evenly which leads to faster convergence in quasi-Monte Carlo simulations.
Comparison of low-discrepancy sequences

2.4 Random Strings

The method nextHexString in RandomUtils.DataGenerator can be used to generate random strings of hexadecimal characters. It produces sequences of strings with good dispersion properties. A string can be generated in two different ways, depending on the value of the boolean argument passed to the method (see the Javadoc for more details).

2.5 Random Permutations, Combinations, Sampling

To select a random sample of objects in a collection, you can use the nextSample method provided by in RandomUtils.DataGenerator. Specifically, if c is a java.util.Collection<T> containing at least k objects, and randomData is a RandomUtils.DataGenerator instance randomData.nextSample(c, k) will return an List<T> instance of size k consisting of elements randomly selected from the collection. If c contains duplicate references, there may be duplicate references in the returned array; otherwise returned elements will be unique (i.e. the sampling is without replacement among the object references in the collection).

If n and k are integers with k < n, then randomData.nextPermutation(n, k) returns an int[] array of length k whose whose entries are selected randomly, without repetition, from the integers 0 through n-1 (inclusive).

2.6 Generating data like an input file

Using the EmpiricalDistribution class, you can generate data based on the values in an input file:

int binCount = 500;
EmpiricalDistribution empDist = new EmpiricalDistribution(binCount);
RealDistribution.Sampler sampler = empDist.createSampler(RandomSource.create(RandomSource.MT));
double value = sampler.nextDouble(); 
The entire input file is read and a probability density function is estimated based on data from the file. The estimation method is essentially the Variable Kernel Method with Gaussian smoothing. The created sampler will return random values whose probability distribution matches the empirical distribution (i.e. if you generate a large number of such values, their distribution should "look like" the distribution of the values in the input file. The values are not stored in memory in this case either, so there is no limit to the size of the input file.