Apache Commons logo Apache Commons Statistics

Apache Commons Statistics User Guide

Contents

Overview

Apache Commons Statistics provides utilities for statistical applications. The code originated in the commons-math project but was pulled out into a separate project for better maintainability and has since undergone numerous improvements.

Commons Statistics is divided into a number of submodules:

Example Modules

In addition to the modules above, the Commons Statistics source distribution contains example code demonstrating library functionality and/or providing useful development utilities. These modules are not part of the public API of the library and no guarantees are made concerning backwards compatibility. The example module parent page contains a listing of available modules.


Probability Distributions

Overview

The commons-statistics-distribution module provides a framework and implementations for some commonly used probability distributions. Continuous univariate distributions are represented by implementations of the ContinuousDistribution interface. Discrete distributions implement DiscreteDistribution (values must be mapped to integers).

API

The distribution framework provides the means to compute probability density, probability mass and cumulative probability functions for several well-known discrete (integer-valued) and continuous probability distributions. The API also allows for the computation of inverse cumulative probabilities and sampling from distributions.

For an instance f of a distribution F, and a domain value, x, f.cumulativeProbability(x) computes P(X <= x) where X is a random variable distributed as F. The complement of the cumulative probability, f.survivalProbability(x) computes P(X > x). Note that the survival probability is approximately equal to 1 - P(X <= x) but does not suffer from cancellation error as the cumulative probability approaches 1. The cancellation error may cause a (total) loss of accuracy when P(X <= x) ~ 1 (see complementary probabilities).

TDistribution t = TDistribution.of(29);
double lowerTail = t.cumulativeProbability(-2.656);   // P(T(29) <= -2.656)
double upperTail = t.survivalProbability(2.75);       // P(T(29) > 2.75)

For discrete F, the probability mass function is given by f.probability(x). For continuous F, the probability density function is given by f.density(x). Distributions also implement f.probability(x1, x2) for computing P(x1 < X <= x2).

PoissonDistribution pd = PoissonDistribution.of(1.23);
double p1 = pd.probability(5);
double p2 = pd.probability(5, 5);
double p3 = pd.probability(4, 5);
// p2 == 0
// p1 == p3

Inverse distribution functions can be computed using the inverseCumulativeProbability and inverseSurvivalProbability methods. For continuous f and p a probability, f.inverseCumulativeProbability(p) returns

\[ x = \begin{cases} \inf \{ x \in \mathbb R : P(X \le x) \ge p\} & \text{for } 0 \lt p \le 1 \\ \inf \{ x \in \mathbb R : P(X \le x) \gt 0 \} & \text{for } p = 0 \end{cases} \]

where X is distributed as F.
Likewise f.inverseSurvivalProbability(p) returns

\[ x = \begin{cases} \inf \{ x \in \mathbb R : P(X \ge x) \le p\} & \text{for } 0 \le p \lt 1 \\ \inf \{ x \in \mathbb R : P(X \ge x) \lt 1 \} & \text{for } p = 1 \end{cases} \]

NormalDistribution n = NormalDistribution.of(0, 1);
double x1 = n.inverseCumulativeProbability(1e-300);
double x2 = n.inverseSurvivalProbability(1e-300);
// x1 == -x2 ~ -37.0471

For discrete F, the definition is the same, with \( \mathbb Z \) (the integers) in place of \( \mathbb R \) (but note that, in the discrete case, the ≥ in the definition can make a difference when p is an attained value of the distribution).

All distributions provide accessors for the parameters used to create the distribution, and a mean and variance. The return value when the mean or variance is undefined is noted in the class javadoc.

ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42);
double df = chi2.getDegreesOfFreedom();    // 42
double mean = chi2.getMean();              // 42
double variance = chi2.getVariance();      // 84

CauchyDistribution cauchy = CauchyDistribution.of(1.23, 4.56);
double location = cauchy.getLocation();    // 1.23
double scale = cauchy.getScale();          // 4.56
double undefined1 = cauchy.getMean();      // NaN
double undefined2 = cauchy.getVariance();  // NaN

The supported domain of the distribution is provided by the getSupportLowerBound and getSupportUpperBound methods.

BinomialDistribution b = BinomialDistribution.of(13, 0.15);
int lower = b.getSupportLowerBound();  // 0
int upper = b.getSupportUpperBound();  // 13

All distributions implement a createSampler(UniformRandomProvider rng) method to support random sampling from the distribution, where UniformRandomProvider is an interface defined in Commons RNG. The sampler is a functional interface whose functional method is sample(), suitable for generation of double or int samples. Default samples() methods are provided to create a DoubleStream or IntStream.

// From Commons RNG Simple
UniformRandomProvider rng = RandomSource.KISS.create(123L);

NormalDistribution n = NormalDistribution.of(0, 1);
double x = n.createSampler(rng).sample();

// Generate a number of samples
GeometricDistribution g = GeometricDistribution.of(0.75);
int[] k = g.createSampler(rng).samples(100).toArray();
// k.length == 100

Note that even when distributions are immutable, the sampler is not immutable as it depends on the instance of the mutable UniformRandomProvider. Generation of many samples in a multi-threaded application should use a separate instance of UniformRandomProvider per thread. Any synchronization should be avoided for best performance. By default the streams returned from the samples() methods are sequential.

Implementation Details

Instances are constructed using factory methods, typically a static method in the distribution class named of. This allows the returned instance to be specialised to the distribution parameters.

Exceptions will be raised by the factory method when constructing the distribution using invalid parameters. See the class javadoc for exception conditions.

Unless otherwise noted, distribution instances are immutable. This allows sharing an instance between threads for computations.

Exceptions will not be raised by distributions for an invalid x argument to probability functions. Typically the cumulative probability functions will return 0 or 1 for an out-of-domain argument, depending on which the side of the domain bound the argument falls, and the density or probability mass functions return 0. Return values for x arguments when the result is undefined should be documented in the class javadoc. For example the beta distribution is undefined for x = 0, alpha < 1 or x = 1, beta < 1. Note: This out-of-domain behaviour may be different from distributions in the org.apache.commons.math3.distribution package. Users upgrading from commons-math should check the appropriate class javadoc.

An exception will be raised by distributions for an invalid p argument to inverse probability functions. The argument must be in the range [0, 1].

Complementary Probabilities

The distributions provide the cumulative probability p and its complement, the survival probability, q = 1 - p. When the probability q is small use of the cumulative probability to compute q can result in dramatic loss of accuracy. This is due to the distribution of floating-point numbers having a log-uniform distribution as the limiting distribution. There are far more representable numbers as the probability value approaches zero than when it approaches one.

The difference is illustrated with the result of computing the upper tail of a probability distribution.

ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42);
double q1 = 1 - chi2.cumulativeProbability(168);
double q2 = chi2.survivalProbability(168);
// q1 == 0
// q2 != 0

In this case the value 1 - p has only a single bit of information as x approaches 168. For example the value 1 - p(x=167) is 2-53 (or approximately 1.11e-16). The complement q retains information much further into the long tail as shown in the following table:

Chi-squared distribution, 42 degrees of freedom
x 1 - p q
166 1.11e-16 1.16e-16
167 1.11e-16 7.96e-17
168 0 5.43e-17
...
200 0 1.19e-22

Probability computations should use the appropriate cumulative or survival function to calculate the lower or upper tail respectively. The same care should be applied when inverting probability distributions. It is preferred to compute either p ≤ 0.5 or q ≤ 0.5 without loss of accuracy and then invert respectively the cumulative probability using p or the survival probabilty using q to obtain x.

ChiSquaredDistribution chi2 = ChiSquaredDistribution.of(42);
double q = 5.43e-17;
// Incorrect: p = 1 - q == 1.0 !!!
double x1 = chi2.inverseCumulativeProbability(1 - q);
// Correct: invert q
double x2 = chi2.inverseSurvivalProbability(q);
// x1 == +infinity
// x2 ~ 168.0

Note: The survival probability functions were not present in the org.apache.commons.math3.distribution package. Users upgrading from commons-math should update usage of the cumulative probability functions where appropriate.