org.apache.commons.math3.stat.inference

## Class KolmogorovSmirnovTest

• java.lang.Object
• org.apache.commons.math3.stat.inference.KolmogorovSmirnovTest

• public class KolmogorovSmirnovTest
extends Object
Implementation of the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.

The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is $$D_n=\sup_x |F_n(x)-F(x)|$$, where $$F$$ is the expected distribution and $$F_n$$ is the empirical distribution of the $$n$$ sample data points. The distribution of $$D_n$$ is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].

Two-sample tests are also supported, evaluating the null hypothesis that the two samples x and y come from the same underlying distribution. In this case, the test statistic is $$D_{n,m}=\sup_t | F_n(t)-F_m(t)|$$ where $$n$$ is the length of x, $$m$$ is the length of y, $$F_n$$ is the empirical distribution that puts mass $$1/n$$ at each of the values in x and $$F_m$$ is the empirical distribution of the y values. The default 2-sample test method, kolmogorovSmirnovTest(double[], double[]) works as follows:

• For very small samples (where the product of the sample sizes is less than 200), the exact distribution is used to compute the p-value for the 2-sample test.
• For mid-size samples (product of sample sizes greater than or equal to 200 but less than 10000), Monte Carlo simulation is used to compute the p-value. The simulation randomly generates partitions of $$m + n$$ into an $$m$$-set and an $$n$$-set and reports the proportion that give $$D$$ values exceeding the observed value.
• When the product of the sample sizes exceeds 10000, the asymptotic distribution of $$D_{n,m}$$ is used. See approximateP(double, int, int) for details on the approximation.

In the two-sample case, $$D_{n,m}$$ has a discrete distribution. This makes the p-value associated with the null hypothesis $$H_0 : D_{n,m} \ge d$$ differ from $$H_0 : D_{n,m} > d$$ by the mass of the observed value $$d$$. To distinguish these, the two-sample tests use a boolean strict parameter. This parameter is ignored for large samples.

The methods used by the 2-sample default implementation are also exposed directly:

References:

Note that [1] contains an error in computing h, refer to MATH-437 for details.

Since:
3.3
Version:
$Id: KolmogorovSmirnovTest.html 908881 2014-05-15 07:10:28Z luc$
• ### Field Summary

Fields
Modifier and Type Field and Description
protected static double KS_SUM_CAUCHY_CRITERION
protected static int LARGE_SAMPLE_PRODUCT
When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution for strict inequality p-value.
protected static int MAXIMUM_PARTIAL_SUM_COUNT
Bound on the number of partial sums in ksSum(double, double, int)
protected static int MONTE_CARLO_ITERATIONS
protected static int SMALL_SAMPLE_PRODUCT
When product of sample sizes is less than this value, 2-sample K-S test is exact
• ### Constructor Summary

Constructors
Constructor and Description
KolmogorovSmirnovTest()
Construct a KolmogorovSmirnovTest instance with a default random data generator.
KolmogorovSmirnovTest(RandomGenerator rng)
Construct a KolmogorovSmirnovTest with the provided random data generator.
• ### Method Summary

Methods
Modifier and Type Method and Description
double approximateP(double d, int n, int m)
Uses the Kolmogorov-Smirnov distribution to approximate $$P(D_{n,m} > d)$$ where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic.
double cdf(double d, int n)
Calculates $$P(D_n < d)$$ using the method described in [1] with quick decisions for extreme values given in [2] (see above).
double cdf(double d, int n, boolean exact)
Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).
double cdfExact(double d, int n)
Calculates P(D_n < d).
double exactP(double d, int n, int m, boolean strict)
Computes $$P(D_{n,m} > d)$$ if strict is true; otherwise $$P(D_{n,m} \ge d)$$, where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic.
double kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, $$D_{n,m}=\sup_x |F_n(x)-F_m(x)|$$ where $$n$$ is the length of x, $$m$$ is the length of y, $$F_n$$ is the empirical distribution that puts mass $$1/n$$ at each of the values in x and $$F_m$$ is the empirical distribution of the y values.
double kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, $$D_n=\sup_x |F_n(x)-F(x)|$$ where $$F$$ is the distribution (cdf) function associated with distribution, $$n$$ is the length of data and $$F_n$$ is the empirical distribution that puts mass $$1/n$$ at each of the values in data.
double kolmogorovSmirnovTest(double[] x, double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
double kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
double kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
boolean kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
double ksSum(double t, double tolerance, int maxIterations)
Computes $$1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}$$ stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed.
double monteCarloP(double d, int n, int m, boolean strict, int iterations)
Uses Monte Carlo simulation to approximate $$P(D_{n,m} > d)$$ where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic.
• ### Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
• ### Constructor Detail

• #### KolmogorovSmirnovTest

public KolmogorovSmirnovTest()
Construct a KolmogorovSmirnovTest instance with a default random data generator.
• #### KolmogorovSmirnovTest

public KolmogorovSmirnovTest(RandomGenerator rng)
Construct a KolmogorovSmirnovTest with the provided random data generator.
Parameters:
rng - random data generator used by monteCarloP(double, int, int, boolean, int)
• ### Method Detail

• #### kolmogorovSmirnovTest

public double kolmogorovSmirnovTest(RealDistribution distribution,
double[] data,
boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution. If exact is true, the distribution used to compute the p-value is computed using extended precision. See cdfExact(double, int).
Parameters:
distribution - reference distribution
data - sample being being evaluated
exact - whether or not to force exact computation of the p-value
Returns:
the p-value associated with the null hypothesis that data is a sample from distribution
Throws:
InsufficientDataException - if data does not have length at least 2
NullArgumentException - if data is null
• #### kolmogorovSmirnovStatistic

public double kolmogorovSmirnovStatistic(RealDistribution distribution,
double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, $$D_n=\sup_x |F_n(x)-F(x)|$$ where $$F$$ is the distribution (cdf) function associated with distribution, $$n$$ is the length of data and $$F_n$$ is the empirical distribution that puts mass $$1/n$$ at each of the values in data.
Parameters:
distribution - reference distribution
data - sample being evaluated
Returns:
Kolmogorov-Smirnov statistic $$D_n$$
Throws:
InsufficientDataException - if data does not have length at least 2
NullArgumentException - if data is null
• #### kolmogorovSmirnovTest

public double kolmogorovSmirnovTest(double[] x,
double[] y,
boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that the kolmogorovSmirnovStatistic(double[], double[]) associated with a randomly selected partition of the combined sample into subsamples of sizes x.length and y.length will strictly exceed (if strict is true) or be at least as large as strict = false) as kolmogorovSmirnovStatistic(x, y).
• For very small samples (where the product of the sample sizes is less than 200), the exact distribution is used to compute the p-value. This is accomplished by enumerating all partitions of the combined sample into two subsamples of the respective sample sizes, computing $$D_{n,m}$$ for each partition and returning the proportion of partitions that give $$D$$ values exceeding the observed value.
• For mid-size samples (product of sample sizes greater than or equal to 200 but less than 10000), Monte Carlo simulation is used to compute the p-value. The simulation randomly generates partitions and reports the proportion that give $$D$$ values exceeding the observed value.
• When the product of the sample sizes exceeds 10000, the asymptotic distribution of $$D_{n,m}$$ is used. See approximateP(double, int, int) for details on the approximation.
Parameters:
x - first sample dataset
y - second sample dataset
strict - whether or not the probability to compute is expressed as a strict inequality (ignored for large samples)
Returns:
p-value associated with the null hypothesis that x and y represent samples from the same distribution
Throws:
InsufficientDataException - if either x or y does not have length at least 2
NullArgumentException - if either x or y is null
• #### kolmogorovSmirnovTest

public double kolmogorovSmirnovTest(double[] x,
double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. See kolmogorovSmirnovTest(RealDistribution, double[], boolean).
Parameters:
x - first sample dataset
y - second sample dataset
Returns:
p-value associated with the null hypothesis that x and y represent samples from the same distribution
Throws:
InsufficientDataException - if either x or y does not have length at least 2
NullArgumentException - if either x or y is null
• #### kolmogorovSmirnovStatistic

public double kolmogorovSmirnovStatistic(double[] x,
double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, $$D_{n,m}=\sup_x |F_n(x)-F_m(x)|$$ where $$n$$ is the length of x, $$m$$ is the length of y, $$F_n$$ is the empirical distribution that puts mass $$1/n$$ at each of the values in x and $$F_m$$ is the empirical distribution of the y values.
Parameters:
x - first sample
y - second sample
Returns:
test statistic $$D_{n,m}$$ used to evaluate the null hypothesis that x and y represent samples from the same underlying distribution
Throws:
InsufficientDataException - if either x or y does not have length at least 2
NullArgumentException - if either x or y is null
• #### kolmogorovSmirnovTest

public double kolmogorovSmirnovTest(RealDistribution distribution,
double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
Parameters:
distribution - reference distribution
data - sample being being evaluated
Returns:
the p-value associated with the null hypothesis that data is a sample from distribution
Throws:
InsufficientDataException - if data does not have length at least 2
NullArgumentException - if data is null
• #### kolmogorovSmirnovTest

public boolean kolmogorovSmirnovTest(RealDistribution distribution,
double[] data,
double alpha)
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
Parameters:
distribution - reference distribution
data - sample being being evaluated
alpha - significance level of the test
Returns:
true iff the null hypothesis that data is a sample from distribution can be rejected with confidence 1 - alpha
Throws:
InsufficientDataException - if data does not have length at least 2
NullArgumentException - if data is null
• #### cdf

public double cdf(double d,
int n)
throws MathArithmeticException
Calculates $$P(D_n < d)$$ using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as with cdfExact(double, int) because calculations are based on double rather than BigFraction.
Parameters:
d - statistic
n - sample size
Returns:
$$P(D_n < d)$$
Throws:
MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as $$(k - h) / m$$ for integer k, m and $$0 \le h < 1$$
• #### cdfExact

public double cdfExact(double d,
int n)
throws MathArithmeticException
Calculates P(D_n < d). The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choose cdf(double, int). See the class javadoc for definitions and algorithm description.
Parameters:
d - statistic
n - sample size
Returns:
$$P(D_n < d)$$
Throws:
MathArithmeticException - if the algorithm fails to convert h to a BigFraction in expressing d as $$(k - h) / m$$ for integer k, m and $$0 \le h < 1$$
• #### cdf

public double cdf(double d,
int n,
boolean exact)
throws MathArithmeticException
Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).
Parameters:
d - statistic
n - sample size
exact - whether the probability should be calculated exact using BigFraction everywhere at the expense of very slow execution time, or if double should be used convenient places to gain speed. Almost never choose true in real applications unless you are very sure; true is almost solely for verification purposes.
Returns:
$$P(D_n < d)$$
Throws:
MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as $$(k - h) / m$$ for integer k, m and $$0 \le h < 1$$.
• #### ksSum

public double ksSum(double t,
double tolerance,
int maxIterations)
Computes $$1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}$$ stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed. If the sum does not converge before maxIterations iterations a TooManyIterationsException is thrown.
Parameters:
t - argument
tolerance - Cauchy criterion for partial sums
maxIterations - maximum number of partial sums to compute
Returns:
Kolmogorov sum evaluated at t
Throws:
TooManyIterationsException - if the series does not converge
• #### exactP

public double exactP(double d,
int n,
int m,
boolean strict)
Computes $$P(D_{n,m} > d)$$ if strict is true; otherwise $$P(D_{n,m} \ge d)$$, where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of $$D_{n,m}$$.

The returned probability is exact, obtained by enumerating all partitions of m + n into m and n sets, computing $$D_{n,m}$$ for each partition and counting the number of partitions that yield $$D_{n,m}$$ values exceeding (resp. greater than or equal to) d.

USAGE NOTE: Since this method enumerates all combinations in $${m+n} \choose {n}$$, it is very slow if called for large m, n. For this reason, kolmogorovSmirnovTest(double[], double[]) uses this only for m * n <  200.

Parameters:
d - D-statistic value
n - first sample size
m - second sample size
strict - whether or not the probability to compute is expressed as a strict inequality
Returns:
probability that a randomly selected m-n partition of m + n generates $$D_{n,m}$$ greater than (resp. greater than or equal to) d
• #### approximateP

public double approximateP(double d,
int n,
int m)
Uses the Kolmogorov-Smirnov distribution to approximate $$P(D_{n,m} > d)$$ where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of $$D_{n,m}$$.

Specifically, what is returned is $$1 - k(d \sqrt{mn / (m + n)})$$ where $$k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}$$. See ksSum(double, double, int) for details on how convergence of the sum is determined. This implementation passes ksSum 1.0E-20 as tolerance and 100000 as maxIterations.

Parameters:
d - D-statistic value
n - first sample size
m - second sample size
Returns:
approximate probability that a randomly selected m-n partition of m + n generates $$D_{n,m}$$ greater than d
• #### monteCarloP

public double monteCarloP(double d,
int n,
int m,
boolean strict,
int iterations)
Uses Monte Carlo simulation to approximate $$P(D_{n,m} > d)$$ where $$D_{n,m}$$ is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of $$D_{n,m}$$.

The simulation generates iterations random partitions of m + n into an n set and an m set, computing $$D_{n,m}$$ for each partition and returning the proportion of values that are greater than d, or greater than or equal to d if strict is false.

Parameters:
d - D-statistic value
n - first sample size
m - second sample size
iterations - number of random partitions to generate
strict - whether or not the probability to compute is expressed as a strict inequality
Returns:
proportion of randomly generated m-n partitions of m + n that result in $$D_{n,m}$$ greater than (resp. greater than or equal to) d