Class KolmogorovSmirnovTest
 java.lang.Object

 org.apache.commons.math4.legacy.stat.inference.KolmogorovSmirnovTest

public class KolmogorovSmirnovTest extends Object
Implementation of the KolmogorovSmirnov (KS) test for equality of continuous distributions.The KS test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For onesample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x F_n(x)F(x)\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Twosample tests are also supported, evaluating the null hypothesis that the two samples
x
andy
come from the same underlying distribution. In this case, the test statistic is \(D_{n,m}=\sup_t  F_n(t)F_m(t)\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values. The default 2sample test method,kolmogorovSmirnovTest(double[], double[])
works as follows: When the product of the sample sizes is less than 10000, the method presented in [4] is used to compute the exact pvalue for the 2sample test.
 When the product of the sample sizes is larger, the asymptotic
distribution of \(D_{n,m}\) is used. See
approximateP(double, int, int)
for details on the approximation.
For small samples (former case), if the data contains ties, random jitter is added to the sample data to break ties before applying the algorithm above. Alternatively, the
bootstrap(double[],double[],int,boolean,UniformRandomProvider)
method, modeled after ks.boot in the R Matching package [3], can be used if ties are known to be present in the data.In the twosample case, \(D_{n,m}\) has a discrete distribution. This makes the pvalue associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). To distinguish these, the twosample tests use a boolean
strict
parameter. This parameter is ignored for large samples.The methods used by the 2sample default implementation are also exposed directly:
exactP(double, int, int, boolean)
computes exact 2sample pvaluesapproximateP(double, int, int)
uses the asymptotic distribution Theboolean
arguments in the first two methods allow the probability used to estimate the pvalue to be expressed using strict or nonstrict inequality. SeekolmogorovSmirnovTest(double[], double[], boolean)
.
References:
 [1] Evaluating Kolmogorov's Distribution by George Marsaglia, Wai Wan Tsang, and Jingbo Wang
 [2] Computing the TwoSided KolmogorovSmirnov Distribution by Richard Simard and Pierre L'Ecuyer
 [3] Jasjeet S. Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R Journal of Statistical Software, 42(7): 152.
 [4] Wilcox, Rand. 2012. Introduction to Robust Estimation and Hypothesis Testing, Chapter 5, 3rd Ed. Academic Press.
Note that [1] contains an error in computing h, refer to MATH437 for details. Since:
 3.3


Constructor Summary
Constructors Constructor Description KolmogorovSmirnovTest()

Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description double
approximateP(double d, int n, int m)
Uses the KolmogorovSmirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic.double
bootstrap(double[] x, double[] y, int iterations, boolean strict, org.apache.commons.rng.UniformRandomProvider rng)
Estimates the pvalue of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
cdf(double d, int n)
Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdf(double d, int n, boolean exact)
CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdfExact(double d, int n)
CalculatesP(D_n < d)
.double
exactP(double d, int n, int m, boolean strict)
Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic.double
kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the twosample KolmogorovSmirnov test statistic, \(D_{n,m}=\sup_x F_n(x)F_m(x)\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.double
kolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the onesample KolmogorovSmirnov test statistic, \(D_n=\sup_x F_n(x)F(x)\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
.double
kolmogorovSmirnovTest(double[] x, double[] y)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, boolean exact)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
.boolean
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, double alpha)
Performs a KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
ksSum(double t, double tolerance, int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (1)^i e^{2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed.double
monteCarloP(double d, int n, int m, boolean strict, int iterations, org.apache.commons.rng.UniformRandomProvider rng)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic.double
pelzGood(double d, int n)
Computes the PelzGood approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.



Constructor Detail

KolmogorovSmirnovTest
public KolmogorovSmirnovTest()


Method Detail

kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, boolean exact)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
. Ifexact
is true, the distribution used to compute the pvalue is computed using extended precision. SeecdfExact(double, int)
. Parameters:
distribution
 reference distributiondata
 sample being being evaluatedexact
 whether or not to force exact computation of the pvalue Returns:
 the pvalue associated with the null hypothesis that
data
is a sample fromdistribution
 Throws:
InsufficientDataException
 ifdata
does not have length at least 2NullArgumentException
 ifdata
is null

kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the onesample KolmogorovSmirnov test statistic, \(D_n=\sup_x F_n(x)F(x)\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
. Parameters:
distribution
 reference distributiondata
 sample being evaluated Returns:
 KolmogorovSmirnov statistic \(D_n\)
 Throws:
InsufficientDataException
 ifdata
does not have length at least 2NullArgumentException
 ifdata
is null

kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that thekolmogorovSmirnovStatistic(double[], double[])
associated with a randomly selected partition of the combined sample into subsamples of sizesx.length
andy.length
will strictly exceed (ifstrict
istrue
) or be at least as large as (ifstrict
isfalse
) askolmogorovSmirnovStatistic(x, y)
. Parameters:
x
 first sample dataset.y
 second sample dataset.strict
 whether or not the probability to compute is expressed as a strict inequality (ignored for large samples). Returns:
 pvalue associated with the null hypothesis that
x
andy
represent samples from the same distribution.  Throws:
InsufficientDataException
 if eitherx
ory
does not have length at least 2.NullArgumentException
 if eitherx
ory
is null.NotANumberException
 if the input arrays contain NaN values. See Also:
bootstrap(double[],double[],int,boolean,UniformRandomProvider)

kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the pvalue. SeekolmogorovSmirnovTest(ContinuousDistribution, double[], boolean)
. Parameters:
x
 first sample datasety
 second sample dataset Returns:
 pvalue associated with the null hypothesis that
x
andy
represent samples from the same distribution  Throws:
InsufficientDataException
 if eitherx
ory
does not have length at least 2NullArgumentException
 if eitherx
ory
is null

kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the twosample KolmogorovSmirnov test statistic, \(D_{n,m}=\sup_x F_n(x)F_m(x)\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values. Parameters:
x
 first sampley
 second sample Returns:
 test statistic \(D_{n,m}\) used to evaluate the null hypothesis that
x
andy
represent samples from the same underlying distribution  Throws:
InsufficientDataException
 if eitherx
ory
does not have length at least 2NullArgumentException
 if eitherx
ory
is null

kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
. Parameters:
distribution
 reference distributiondata
 sample being being evaluated Returns:
 the pvalue associated with the null hypothesis that
data
is a sample fromdistribution
 Throws:
InsufficientDataException
 ifdata
does not have length at least 2NullArgumentException
 ifdata
is null

kolmogorovSmirnovTest
public boolean kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, double alpha)
Performs a KolmogorovSmirnov test evaluating the null hypothesis thatdata
conforms todistribution
. Parameters:
distribution
 reference distributiondata
 sample being being evaluatedalpha
 significance level of the test Returns:
 true iff the null hypothesis that
data
is a sample fromdistribution
can be rejected with confidence 1 alpha
 Throws:
InsufficientDataException
 ifdata
does not have length at least 2NullArgumentException
 ifdata
is null

bootstrap
public double bootstrap(double[] x, double[] y, int iterations, boolean strict, org.apache.commons.rng.UniformRandomProvider rng)
Estimates the pvalue of a twosample KolmogorovSmirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. This method estimates the pvalue by repeatedly sampling sets of sizex.length
andy.length
from the empirical distribution of the combined sample. Whenstrict
is true, this is equivalent to the algorithm implemented in the R functionks.boot
, described inJasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 152.
 Parameters:
x
 First sample.y
 Second sample.iterations
 Number of bootstrap resampling iterations.strict
 Whether or not the null hypothesis is expressed as a strict inequality.rng
 RNG for creating the sampling sets. Returns:
 the estimated pvalue.

cdf
public double cdf(double d, int n)
Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as withcdfExact(double, int)
because calculations are based ondouble
rather thanBigFraction
. Parameters:
d
 statisticn
 sample size Returns:
 \(P(D_n < d)\)
 Throws:
MathArithmeticException
 if algorithm fails to converth
to aBigFraction
in expressingd
as \((k  h) / m\) for integerk, m
and \(0 \le h < 1\)

cdfExact
public double cdfExact(double d, int n)
CalculatesP(D_n < d)
. The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choosecdf(double, int)
. See the class javadoc for definitions and algorithm description. Parameters:
d
 statisticn
 sample size Returns:
 \(P(D_n < d)\)
 Throws:
MathArithmeticException
 if the algorithm fails to converth
to aBigFraction
in expressingd
as \((k  h) / m\) for integerk, m
and \(0 \le h < 1\)

cdf
public double cdf(double d, int n, boolean exact)
CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above). Parameters:
d
 statisticn
 sample sizeexact
 whether the probability should be calculated exact usingBigFraction
everywhere at the expense of very slow execution time, or ifdouble
should be used convenient places to gain speed. Almost never choosetrue
in real applications unless you are very sure;true
is almost solely for verification purposes. Returns:
 \(P(D_n < d)\)
 Throws:
MathArithmeticException
 if algorithm fails to converth
to aBigFraction
in expressingd
as \((k  h) / m\) for integerk, m
and \(0 \le h < 1\).

pelzGood
public double pelzGood(double d, int n)
Computes the PelzGood approximation for \(P(D_n < d)\) as described in [2] in the class javadoc. Parameters:
d
 value of dstatistic (x in [2])n
 sample size Returns:
 \(P(D_n < d)\)
 Since:
 3.4

ksSum
public double ksSum(double t, double tolerance, int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (1)^i e^{2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed. If the sum does not converge beforemaxIterations
iterations aTooManyIterationsException
is thrown. Parameters:
t
 argumenttolerance
 Cauchy criterion for partial sumsmaxIterations
 maximum number of partial sums to compute Returns:
 Kolmogorov sum evaluated at t
 Throws:
TooManyIterationsException
 if the series does not converge

exactP
public double exactP(double d, int n, int m, boolean strict)
Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
 Parameters:
d
 Dstatistic valuen
 first sample sizem
 second sample sizestrict
 whether or not the probability to compute is expressed as a strict inequality Returns:
 probability that a randomly selected mn partition of m + n generates \(D_{n,m}\)
greater than (resp. greater than or equal to)
d

approximateP
public double approximateP(double d, int n, int m)
Uses the KolmogorovSmirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).Specifically, what is returned is \(1  k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (1)^i e^{2 i^2 t^2}\). See
ksSum(double, double, int)
for details on how convergence of the sum is determined. Parameters:
d
 Dstatistic valuen
 first sample sizem
 second sample size Returns:
 approximate probability that a randomly selected mn partition of m + n generates
\(D_{n,m}\) greater than
d

monteCarloP
public double monteCarloP(double d, int n, int m, boolean strict, int iterations, org.apache.commons.rng.UniformRandomProvider rng)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).The simulation generates
iterations
random partitions ofm + n
into ann
set and anm
set, computing \(D_{n,m}\) for each partition and returning the proportion of values that are greater thand
, or greater than or equal tod
ifstrict
isfalse
. Parameters:
d
 Dstatistic value.n
 First sample size.m
 Second sample size.iterations
 Number of random partitions to generate.strict
 whether or not the probability to compute is expressed as a strict inequalityrng
 RNG used for generating the partitions. Returns:
 proportion of randomly generated mn partitions of m + n that result in \(D_{n,m}\)
greater than (resp. greater than or equal to)
d
.

