org.apache.commons.math4.distribution

## Class EmpiricalDistribution

• All Implemented Interfaces:
Serializable, org.apache.commons.statistics.distribution.ContinuousDistribution

public class EmpiricalDistribution
extends AbstractRealDistribution
implements org.apache.commons.statistics.distribution.ContinuousDistribution

Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.

An EmpiricalDistribution maintains data structures, called distribution digests, that describe empirical distributions and support the following operations:

• dividing the input data into "bin ranges" and reporting bin frequency counts (data for histogram)
• reporting univariate statistics describing the full set of data values as well as the observations within each bin
• generating random values from the distribution
Applications can use EmpiricalDistribution to build grouped frequency histograms representing the input data or to generate random values "like" those in the input file -- i.e., the values generated will follow the distribution of the values in the file.

The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:

Digesting the input file

1. Pass the file once to compute min and max.
2. Divide the range from min-max into binCount "bins."
3. Pass the data file again, computing bin counts and univariate statistics (mean, std dev.) for each of the bins
4. Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
Generating random values from the distribution
1. Generate a uniformly distributed value in (0,1)
2. Select the subinterval to which the value belongs.
3. Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.

EmpiricalDistribution implements the ContinuousDistribution interface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then set P(X < x) = P(B-) + P(B) * K(x) / K(B) where K(x) is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.

USAGE NOTES:
• The binCount is set by default to 1000. A good rule of thumb is to set the bin count to approximately the length of the input file divided by 10.
• The input file must be a plain text file containing one valid numeric entry per line.
Serialized Form

• ### Nested classes/interfaces inherited from interface org.apache.commons.statistics.distribution.ContinuousDistribution

org.apache.commons.statistics.distribution.ContinuousDistribution.Sampler
• ### Field Summary

Fields
Modifier and Type Field and Description
static int DEFAULT_BIN_COUNT
Default bin count
• ### Fields inherited from class org.apache.commons.math4.distribution.AbstractRealDistribution

SOLVER_DEFAULT_ABSOLUTE_ACCURACY
• ### Constructor Summary

Constructors
Constructor and Description
EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.
EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.
• ### Method Summary

All Methods
Modifier and Type Method and Description
org.apache.commons.statistics.distribution.ContinuousDistribution.Sampler createSampler(org.apache.commons.rng.UniformRandomProvider rng)
double cumulativeProbability(double x)
double density(double x)
int getBinCount()
Returns the number of bins.
List<SummaryStatistics> getBinStats()
Returns a List of SummaryStatistics instances containing statistics describing the values in each of the bins.
double[] getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution.
protected org.apache.commons.statistics.distribution.ContinuousDistribution getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel.
double getMean()
StatisticalSummary getSampleStats()
Returns a StatisticalSummary describing this distribution.
double getSupportLowerBound()
double getSupportUpperBound()
double[] getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins.
double getVariance()
double inverseCumulativeProbability(double p)
The default implementation returns ContinuousDistribution.getSupportLowerBound() for p = 0, ContinuousDistribution.getSupportUpperBound() for p = 1.
boolean isLoaded()
Property indicating whether or not the distribution has been loaded.
boolean isSupportConnected()
void load(double[] in)
Computes the empirical distribution from the provided array of numbers.
void load(File file)
Computes the empirical distribution from the input file.
void load(URL url)
Computes the empirical distribution using data read from a URL.
double probability(double x)
• ### Methods inherited from class org.apache.commons.math4.distribution.AbstractRealDistribution

getSolverAbsoluteAccuracy, logDensity, probability, sample
• ### Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
• ### Methods inherited from interface org.apache.commons.statistics.distribution.ContinuousDistribution

logDensity, probability
• ### Field Detail

• #### DEFAULT_BIN_COUNT

public static final int DEFAULT_BIN_COUNT
Default bin count
Constant Field Values
• ### Constructor Detail

• #### EmpiricalDistribution

public EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.
• #### EmpiricalDistribution

public EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.
Parameters:
binCount - number of bins. Must be strictly positive.
Throws:
NotStrictlyPositiveException - if binCount <= 0.
• ### Method Detail

public void load(double[] in)
throws NullArgumentException
Computes the empirical distribution from the provided array of numbers.
Parameters:
in - the input data array
Throws:
NullArgumentException - if in is null

public void load(URL url)
throws IOException,
NullArgumentException,
ZeroException
Computes the empirical distribution using data read from a URL.

The input file must be an ASCII text file containing one valid numeric entry per line.

Parameters:
url - url of the input file
Throws:
IOException - if an IO error occurs
NullArgumentException - if url is null
ZeroException - if URL contains no data

public void load(File file)
throws IOException,
NullArgumentException
Computes the empirical distribution from the input file.

The input file must be an ASCII text file containing one valid numeric entry per line.

Parameters:
file - the input file
Throws:
IOException - if an IO error occurs
NullArgumentException - if file is null
• #### getSampleStats

public StatisticalSummary getSampleStats()
Returns a StatisticalSummary describing this distribution. Preconditions:
• the distribution must be loaded before invoking this method
Returns:
the sample statistics
Throws:
IllegalStateException - if the distribution has not been loaded
• #### getBinCount

public int getBinCount()
Returns the number of bins.
Returns:
the number of bins.
• #### getBinStats

public List<SummaryStatistics> getBinStats()
Returns a List of SummaryStatistics instances containing statistics describing the values in each of the bins. The list is indexed on the bin number.
Returns:
List of bin statistics.
• #### getUpperBounds

public double[] getUpperBounds()

Returns a fresh copy of the array of upper bounds for the bins. Bins are:
[min,upperBounds],(upperBounds,upperBounds],..., (upperBounds[binCount-2], upperBounds[binCount-1] = max].

Note: In versions 1.0-2.0 of commons-math, this method incorrectly returned the array of probability generator upper bounds now returned by getGeneratorUpperBounds().

Returns:
array of bin upper bounds
Since:
2.1
• #### getGeneratorUpperBounds

public double[] getGeneratorUpperBounds()

Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.

Preconditions:
• the distribution must be loaded before invoking this method

In versions 1.0-2.0 of commons-math, this array was (incorrectly) returned by getUpperBounds().

Returns:
array of upper bounds of subintervals used in data generation
Throws:
NullPointerException - unless a load method has been called beforehand.
Since:
2.1

public boolean isLoaded()
Property indicating whether or not the distribution has been loaded.
Returns:
true if the distribution has been loaded
• #### probability

public double probability(double x)
Specified by:
probability in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Overrides:
probability in class AbstractRealDistribution
Returns:
zero.
Since:
3.1
• #### density

public double density(double x)

Returns the kernel density normalized so that its integral over each bin equals the bin mass.

Algorithm description:

1. Find the bin B that x belongs to.
2. Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
3. Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.
Specified by:
density in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### cumulativeProbability

public double cumulativeProbability(double x)

Algorithm description:

1. Find the bin B that x belongs to.
2. Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
3. Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
4. Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.
If K is a constant distribution, we return P(B-) + P(B) (counting the full mass of B).
Specified by:
cumulativeProbability in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### inverseCumulativeProbability

public double inverseCumulativeProbability(double p)
throws OutOfRangeException
The default implementation returns
• ContinuousDistribution.getSupportLowerBound() for p = 0,
• ContinuousDistribution.getSupportUpperBound() for p = 1.

Algorithm description:

1. Find the smallest i such that the sum of the masses of the bins through i is at least p.
2. Let K be the within-bin kernel distribution for bin i.
Let K(B) be the mass of B under K.
Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
Let P(B) be the probability of bin i.
Let P(B-) be the sum of the bin masses below bin i.
Let pCrit = p - P(B-)
3. Return the inverse of K evaluated at
K(B-) + pCrit * K(B) / P(B)
Specified by:
inverseCumulativeProbability in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Overrides:
inverseCumulativeProbability in class AbstractRealDistribution
Throws:
OutOfRangeException
Since:
3.1
• #### getMean

public double getMean()
Specified by:
getMean in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### getVariance

public double getVariance()
Specified by:
getVariance in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### getSupportLowerBound

public double getSupportLowerBound()
Specified by:
getSupportLowerBound in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### getSupportUpperBound

public double getSupportUpperBound()
Specified by:
getSupportUpperBound in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### isSupportConnected

public boolean isSupportConnected()
Specified by:
isSupportConnected in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Since:
3.1
• #### createSampler

public org.apache.commons.statistics.distribution.ContinuousDistribution.Sampler createSampler(org.apache.commons.rng.UniformRandomProvider rng)
Specified by:
createSampler in interface org.apache.commons.statistics.distribution.ContinuousDistribution
Overrides:
createSampler in class AbstractRealDistribution
• #### getKernel

protected org.apache.commons.statistics.distribution.ContinuousDistribution getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel. Returns a Gaussian distribution parameterized by bStats, unless the bin contains only one observation, in which case a constant distribution is returned.
Parameters:
bStats - summary statistics for the bin
Returns:
within-bin kernel parameterized by bStats