The statistics package provides frameworks and implementations for basic Descriptive statistics, frequency distributions, bivariate regression, and t-, chi-square and ANOVA test statistics.
Descriptive statistics
Frequency distributions
Simple Regression
Multiple Regression
Rank transformations
Covariance and correlation
Statistical Tests
The stat package includes a framework and default implementations for the following Descriptive statistics:
With the exception of percentiles and the median, all of these statistics can be computed without maintaining the full list of input data values in memory. The stat package provides interfaces and implementations that do not require value storage as well as implementations that operate on arrays of stored values.
The top level interface is UnivariateStatistic. This interface, implemented by all statistics, consists of evaluate() methods that take double[] arrays as arguments and return the value of the statistic. This interface is extended by StorelessUnivariateStatistic, which adds increment(), getResult() and associated methods to support "storageless" implementations that maintain counters, sums or other state information as values are added using the increment() method.
Abstract implementations of the top level interfaces are provided in AbstractUnivariateStatistic and AbstractStorelessUnivariateStatistic respectively.
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and each extends one of the abstract classes above (depending on whether or not value storage is required to compute the statistic). There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is generally more convenient (and efficient) to access them using the provided aggregates, DescriptiveStatistics and SummaryStatistics.
DescriptiveStatistics maintains the input data in memory and has the capability of producing "rolling" statistics computed from a "window" consisting of the most recently added values.
SummaryStatistics does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be computed in one pass through the data without access to the full array of values.
Aggregate | Statistics Included | Values stored? | "Rolling" capability? |
---|---|---|---|
DescriptiveStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, median | Yes | Yes |
SummaryStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance | No | No |
SummaryStatistics can be aggregated using AggregateSummaryStatistics. This class can be used to concurrently gather statistics for multiple datasets as well as for a combined sample including all of the data.
MultivariateSummaryStatistics is similar to SummaryStatistics but handles n-tuple values instead of scalar values. It can also compute the full covariance matrix for the input data.
Neither DescriptiveStatistics nor SummaryStatistics is thread-safe. SynchronizedDescriptiveStatistics and SynchronizedSummaryStatistics, respectively, provide thread-safe versions for applications that require concurrent access to statistical aggregates by multiple threads. SynchronizedMultivariateSummaryStatistics provides thread-safe MultivariateSummaryStatistics.
There is also a utility class, StatUtils, that provides static methods for computing statistics directly from double[] arrays.
Here are some examples showing how to compute Descriptive statistics.
// Get a DescriptiveStatistics instance DescriptiveStatistics stats = new DescriptiveStatistics(); // Add the data from the array for( int i = 0; i < inputArray.length; i++) { stats.addValue(inputArray[i]); } // Compute some statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); double median = stats.getPercentile(50);
// Get a SummaryStatistics instance SummaryStatistics stats = new SummaryStatistics(); // Read data from an input stream, // adding values and updating sums, counters, etc. while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); } in.close(); // Compute the statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); //double median = stats.getMedian(); <-- NOT AVAILABLE
// Compute statistics directly from the array // assume values is a double[] array double mean = StatUtils.mean(values); double std = StatUtils.variance(values); double median = StatUtils.percentile(values, 50); // Compute the mean of the first three values in the array mean = StatUtils.mean(values, 0, 3);
// Create a DescriptiveStats instance and set the window size to 100 DescriptiveStatistics stats = new DescriptiveStatistics(); stats.setWindowSize(100); // Read data from an input stream, // displaying the mean of the most recent 100 observations // after every 100 observations long nLines = 0; while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); if (nLines == 100) { nLines = 0; System.out.println(stats.getMean()); } } in.close();
// Create a SynchronizedDescriptiveStatistics instance and // use as any other DescriptiveStatistics instance DescriptiveStatistics stats = new SynchronizedDescriptiveStatistics();
// Create a AggregateSummaryStatistics instance to accumulate the overall statistics // and AggregatingSummaryStatistics for the subsamples AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics(); SummaryStatistics setOneStats = aggregate.createContributingStatistics(); SummaryStatistics setTwoStats = aggregate.createContributingStatistics(); // Add values to the subsample aggregates setOneStats.addValue(2); setOneStats.addValue(3); setTwoStats.addValue(2); setTwoStats.addValue(4); ... // Full sample data is reported by the aggregate double totalSampleSum = aggregate.getSum();
// Create SummaryStatistics instances for the subsample data SummaryStatistics setOneStats = new SummaryStatistics(); SummaryStatistics setTwoStats = new SummaryStatistics(); // Add values to the subsample SummaryStatistics instances setOneStats.addValue(2); setOneStats.addValue(3); setTwoStats.addValue(2); setTwoStats.addValue(4); ... // Aggregate the subsample statistics Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>(); aggregate.add(setOneStats); aggregate.add(setTwoStats); StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate); // Full sample data is reported by aggregatedStats double totalSampleSum = aggregatedStats.getSum();
Frequency provides a simple interface for maintaining counts and percentages of discrete values.
Strings, integers, longs and chars are all supported as value types, as well as instances of any class that implements Comparable. The ordering of values used in computing cumulative frequencies is by default the natural ordering, but this can be overridden by supplying a Comparator to the constructor. Adding values that are not comparable to those that have already been added results in an IllegalArgumentException.
Here are some examples.
Frequency f = new Frequency(); f.addValue(1); f.addValue(new Integer(1)); f.addValue(new Long(1)); f.addValue(2); f.addValue(new Integer(-1)); System.out.prinltn(f.getCount(1)); // displays 3 System.out.println(f.getCumPct(0)); // displays 0.2 System.out.println(f.getPct(new Integer(1))); // displays 0.6 System.out.println(f.getCumPct(-2)); // displays 0 System.out.println(f.getCumPct(10)); // displays 1
Frequency f = new Frequency(); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 1 System.out.println(f.getCumPct("Z")); // displays 0.5 System.out.println(f.getCumPct("Ot")); // displays 0.25
Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 3 System.out.println(f.getCumPct("z")); // displays 1
SimpleRegression provides ordinary least squares regression with one independent variable estimating the linear model:
y = intercept + slope * x
or
y = slope * x
Standard errors for intercept and slope are available as well as ANOVA, r-square and Pearson's r statistics.
Observations (x,y pairs) can be added to the model one at a time or they can be provided in a 2-dimensional array. The observations are not stored in memory, so there is no limit to the number of observations that can be added to the model.
Usage Notes:
Implementation Notes:
Here are some examples.
regression = new SimpleRegression(); regression.addData(1d, 2d); // At this point, with only one observation, // all regression statistics will return NaN regression.addData(3d, 3d); // With only two observations, // slope and intercept can be computed // but inference statistics will return NaN regression.addData(3d, 3d); // Now all statistics are defined.
System.out.println(regression.getIntercept()); // displays intercept of regression line System.out.println(regression.getSlope()); // displays slope of regression line System.out.println(regression.getSlopeStdErr()); // displays slope standard error
System.out.println(regression.predict(1.5d) // displays predicted y value for x = 1.5
double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; SimpleRegression regression = new SimpleRegression(); regression.addData(data);
System.out.println(regression.getIntercept()); // displays intercept of regression line System.out.println(regression.getSlope()); // displays slope of regression line System.out.println(regression.getSlopeStdErr()); // displays slope standard error
double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; SimpleRegression regression = new SimpleRegression(false); //the argument, false, tells the class not to include a constant regression.addData(data);
System.out.println(regression.getIntercept()); // displays intercept of regression line, since we have constrained the constant, 0.0 is returned System.out.println(regression.getSlope()); // displays slope of regression line System.out.println(regression.getSlopeStdErr()); // displays slope standard error System.out.println(regression.getInterceptStdErr() ); // will return Double.NaN, since we constrained the parameter to zero
OLSMultipleLinearRegression and GLSMultipleLinearRegression provide least squares regression to fit the linear model:
Y=X*b+u
where Y is an n-vector regressand, X is a [n,k] matrix whose k columns are called regressors, b is k-vector of regression parameters and u is an n-vector of error terms or residuals.
OLSMultipleLinearRegression provides Ordinary Least Squares Regression, and GLSMultipleLinearRegression implements Generalized Least Squares. See the javadoc for these classes for details on the algorithms and formulas used.
Data for OLS models can be loaded in a single double[] array, consisting of concatenated rows of data, each containing the regressand (Y) value, followed by regressor values; or using a double[][] array with rows corresponding to observations. GLS models also require a double[][] array representing the covariance matrix of the error terms. See AbstractMultipleLinearRegression#newSampleData(double[],int,int), OLSMultipleLinearRegression#newSampleData(double[], double[][]) and GLSMultipleLinearRegression#newSampleData(double[],double[][],double[][]) for details.
Usage Notes:
Here are some examples.
OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression(); double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; double[] x = new double[6][]; x[0] = new double[]{0, 0, 0, 0, 0}; x[1] = new double[]{2.0, 0, 0, 0, 0}; x[2] = new double[]{0, 3.0, 0, 0, 0}; x[3] = new double[]{0, 0, 4.0, 0, 0}; x[4] = new double[]{0, 0, 0, 5.0, 0}; x[5] = new double[]{0, 0, 0, 0, 6.0}; regression.newSample(y, x);
double[] beta = regression.estimateRegressionParameters(); double[] residuals = regression.estimateResiduals(); double[][] parametersVariance = regression.estimateRegressionParametersVariance(); double regressandVariance = regression.estimateRegressandVariance(); double rSquared = regression.calculateRSquared(); double sigma = regression.estimateRegressionStandardError();
GLSMultipleLinearRegression regression = new GLSMultipleLinearRegression(); double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; double[] x = new double[6][]; x[0] = new double[]{0, 0, 0, 0, 0}; x[1] = new double[]{2.0, 0, 0, 0, 0}; x[2] = new double[]{0, 3.0, 0, 0, 0}; x[3] = new double[]{0, 0, 4.0, 0, 0}; x[4] = new double[]{0, 0, 0, 5.0, 0}; x[5] = new double[]{0, 0, 0, 0, 6.0}; double[][] omega = new double[6][]; omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; omega[5] = new double[]{0, 0, 0, 0, 0, 6.6}; regression.newSampleData(y, x, omega);
Some statistical algorithms require that input data be replaced by ranks. The org.apache.commons.math3.stat.ranking package provides rank transformation. RankingAlgorithm defines the interface for ranking. NaturalRanking provides an implementation that has two configuration options.
Examples:
NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL, TiesStrategy.MAXIMUM); double[] data = { 20, 17, 30, 42.3, 17, 50, Double.NaN, Double.NEGATIVE_INFINITY, 17 }; double[] ranks = ranking.rank(exampleData);
new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData);
The default NaNStrategy is NaNStrategy.MAXIMAL. This makes NaN values larger than any other value (including Double.POSITIVE_INFINITY). The default TiesStrategy is TiesStrategy.AVERAGE, which assigns tied values the average of the ranks applicable to the sequence of ties. See the NaturalRanking for more examples and TiesStrategy and NaNStrategy for details on these configuration options.
The org.apache.commons.math3.stat.correlation package computes covariances and correlations for pairs of arrays or columns of a matrix. Covariance computes covariances, PearsonsCorrelation provides Pearson's Product-Moment correlation coefficients, SpearmansCorrelation computes Spearman's rank correlation and KendallsCorrelation computes Kendall's tau rank correlation.
Implementation Notes
Examples:
new Covariance().covariance(x, y)
covariance(x, y, false)
new Covariance().computeCovarianceMatrix(data)
computeCovarianceMatrix(data, false)
new PearsonsCorrelation().correlation(x, y)
new PearsonsCorrelation().computeCorrelationMatrix(data)
PearsonsCorrelation correlation = new PearsonsCorrelation(data);
correlation.getCorrelationStandardErrors();
correlation.getCorrelationPValues()
new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
new SpearmansCorrelation().correlation(x, y)
RankingAlgorithm ranking = new NaturalRanking(); new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
new KendallsCorrelation().correlation(x, y)
The org.apache.commons.math3.stat.inference package provides implementations for Student's t, Chi-Square, G Test, One-Way ANOVA, Mann-Whitney U, Wilcoxon signed rank and Binomial test statistics as well as p-values associated with t-, Chi-Square, G, One-Way ANOVA, Mann-Whitney U and Wilcoxon signed rank tests. The respective test classes are TTest, ChiSquareTest, GTest, OneWayAnova, MannWhitneyUTest, WilcoxonSignedRankTest and BinomialTest. The TestUtils class provides static methods to get test instances or to compute test statistics directly. The examples below all use the static methods in TestUtils to execute tests. To get test object instances, either use e.g., TestUtils.getTTest() or use the implementation constructors directly, e.g. new TTest().
Implementation Notes
Examples:
double[] observed = {1d, 2d, 3d}; double mu = 2.5d; System.out.println(TestUtils.t(mu, observed));
double[] observed ={1d, 2d, 3d}; double mu = 2.5d; SummaryStatistics sampleStats = new SummaryStatistics(); for (int i = 0; i < observed.length; i++) { sampleStats.addValue(observed[i]); } System.out.println(TestUtils.t(mu, observed));
double[] observed = {1d, 2d, 3d}; double mu = 2.5d; System.out.println(TestUtils.tTest(mu, observed));
TestUtils.tTest(mu, observed, alpha);
To compute the t-statistic:
TestUtils.pairedT(sample1, sample2);
To compute the p-value:
TestUtils.pairedTTest(sample1, sample2);
To perform a fixed significance level test with alpha = .05:
TestUtils.pairedTTest(sample1, sample2, .05);
First create the StatisticalSummary instances. Both DescriptiveStatistics and SummaryStatistics implement this interface. Assume that summary1 and summary2 are SummaryStatistics instances, each of which has had at least 2 values added to the (virtual) dataset that it describes. The sample sizes do not have to be the same -- all that is required is that both samples have at least 2 elements.
Note: The SummaryStatistics class does not store the dataset that it describes in memory, but it does compute all statistics necessary to perform t-tests, so this method can be used to conduct t-tests with very large samples. One-sample tests can also be performed this way. (See Descriptive statistics for details on the SummaryStatistics class.)
To compute the t-statistic:
TestUtils.t(summary1, summary2);
To compute the p-value:
TestUtils.tTest(sample1, sample2);
To perform a fixed significance level test with alpha = .05:
TestUtils.tTest(sample1, sample2, .05);
In each case above, the test does not assume that the subpopulation variances are equal. To perform the tests under this assumption, replace "t" at the beginning of the method name with "homoscedasticT"
long[] observed = {10, 9, 11}; double[] expected = {10.1, 9.8, 10.3}; System.out.println(TestUtils.chiSquare(expected, observed));
TestUtils.chiSquareTest(expected, observed);
TestUtils.chiSquareTest(expected, observed, alpha);
TestUtils.chiSquareTest(counts);
TestUtils.chiSquareTest(counts);
TestUtils.chiSquareTest(counts, alpha);
double[] expected = new double[]{0.54d, 0.40d, 0.05d, 0.01d}; long[] observed = new long[]{70, 79, 3, 4}; System.out.println(TestUtils.g(expected, observed));
TestUtils.gTest(expected, observed);
TestUtils.gTest(expected, observed, alpha);
long[] obs1 = new long[]{268, 199, 42}; long[] obs2 = new long[]{807, 759, 184}; System.out.println(TestUtils.gDataSetsComparison(obs1, obs2)); // G statistic System.out.println(TestUtils.gTestDataSetsComparison(obs1, obs2)); // p-value
new GTest().rootLogLikelihoodRatio(5, 1995, 0, 100000);
double[] classA = {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; double[] classB = {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 }; double[] classC = {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 }; List classes = new ArrayList(); classes.add(classA); classes.add(classB); classes.add(classC);
double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value
TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean // true means reject null hypothesis