org.apache.commons.math3.stat.regression
Class SimpleRegression

java.lang.Object
  extended by org.apache.commons.math3.stat.regression.SimpleRegression
All Implemented Interfaces:
Serializable, UpdatingMultipleLinearRegression

public class SimpleRegression
extends Object
implements Serializable, UpdatingMultipleLinearRegression

Estimates an ordinary least squares regression model with one independent variable.

y = intercept + slope * x

Standard errors for intercept and slope are available as well as ANOVA, r-square and Pearson's r statistics.

Observations (x,y pairs) can be added to the model one at a time or they can be provided in a 2-dimensional array. The observations are not stored in memory, so there is no limit to the number of observations that can be added to the model.

Usage Notes:

Version:
$Id: SimpleRegression.java 1416643 2012-12-03 19:37:14Z tn $
See Also:
Serialized Form

Constructor Summary
SimpleRegression()
          Create an empty SimpleRegression instance
SimpleRegression(boolean includeIntercept)
          Create a SimpleRegression instance, specifying whether or not to estimate an intercept.
 
Method Summary
 void addData(double[][] data)
          Adds the observations represented by the elements in data.
 void addData(double x, double y)
          Adds the observation (x,y) to the regression data set.
 void addObservation(double[] x, double y)
          Adds one observation to the regression model.
 void addObservations(double[][] x, double[] y)
          Adds a series of observations to the regression model.
 void clear()
          Clears all data from the model.
 double getIntercept()
          Returns the intercept of the estimated regression line, if hasIntercept() is true; otherwise 0.
 double getInterceptStdErr()
          Returns the standard error of the intercept estimate, usually denoted s(b0).
 double getMeanSquareError()
          Returns the sum of squared errors divided by the degrees of freedom, usually abbreviated MSE.
 long getN()
          Returns the number of observations that have been added to the model.
 double getR()
          Returns Pearson's product moment correlation coefficient, usually denoted r.
 double getRegressionSumSquares()
          Returns the sum of squared deviations of the predicted y values about their mean (which equals the mean of y).
 double getRSquare()
          Returns the coefficient of determination, usually denoted r-square.
 double getSignificance()
          Returns the significance level of the slope (equiv) correlation.
 double getSlope()
          Returns the slope of the estimated regression line.
 double getSlopeConfidenceInterval()
          Returns the half-width of a 95% confidence interval for the slope estimate.
 double getSlopeConfidenceInterval(double alpha)
          Returns the half-width of a (100-100*alpha)% confidence interval for the slope estimate.
 double getSlopeStdErr()
          Returns the standard error of the slope estimate, usually denoted s(b1).
 double getSumOfCrossProducts()
          Returns the sum of crossproducts, xi*yi.
 double getSumSquaredErrors()
          Returns the sum of squared errors (SSE) associated with the regression model.
 double getTotalSumSquares()
          Returns the sum of squared deviations of the y values about their mean.
 double getXSumSquares()
          Returns the sum of squared deviations of the x values about their mean.
 boolean hasIntercept()
          Returns true if the model includes an intercept term.
 double predict(double x)
          Returns the "predicted" y value associated with the supplied x value, based on the data that has been added to the model when this method is activated.
 RegressionResults regress()
          Performs a regression on data present in buffers and outputs a RegressionResults object.
 RegressionResults regress(int[] variablesToInclude)
          Performs a regression on data present in buffers including only regressors indexed in variablesToInclude and outputs a RegressionResults object
 void removeData(double[][] data)
          Removes observations represented by the elements in data.
 void removeData(double x, double y)
          Removes the observation (x,y) from the regression data set.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleRegression

public SimpleRegression()
Create an empty SimpleRegression instance


SimpleRegression

public SimpleRegression(boolean includeIntercept)
Create a SimpleRegression instance, specifying whether or not to estimate an intercept.

Use false to estimate a model with no intercept. When the hasIntercept property is false, the model is estimated without a constant term and getIntercept() returns 0.

Parameters:
includeIntercept - whether or not to include an intercept term in the regression model
Method Detail

addData

public void addData(double x,
                    double y)
Adds the observation (x,y) to the regression data set.

Uses updating formulas for means and sums of squares defined in "Algorithms for Computing the Sample Variance: Analysis and Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American Statistician, vol. 37, pp. 242-247, referenced in Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985.

Parameters:
x - independent variable value
y - dependent variable value

removeData

public void removeData(double x,
                       double y)
Removes the observation (x,y) from the regression data set.

Mirrors the addData method. This method permits the use of SimpleRegression instances in streaming mode where the regression is applied to a sliding "window" of observations, however the caller is responsible for maintaining the set of observations in the window.

The method has no effect if there are no points of data (i.e. n=0)

Parameters:
x - independent variable value
y - dependent variable value

addData

public void addData(double[][] data)
             throws ModelSpecificationException
Adds the observations represented by the elements in data.

(data[0][0],data[0][1]) will be the first observation, then (data[1][0],data[1][1]), etc.

This method does not replace data that has already been added. The observations represented by data are added to the existing dataset.

To replace all data, use clear() before adding the new data.

Parameters:
data - array of observations to be added
Throws:
ModelSpecificationException - if the length of data[i] is not greater than or equal to 2

addObservation

public void addObservation(double[] x,
                           double y)
                    throws ModelSpecificationException
Adds one observation to the regression model.

Specified by:
addObservation in interface UpdatingMultipleLinearRegression
Parameters:
x - the independent variables which form the design matrix
y - the dependent or response variable
Throws:
ModelSpecificationException - if the length of x does not equal the number of independent variables in the model

addObservations

public void addObservations(double[][] x,
                            double[] y)
                     throws ModelSpecificationException
Adds a series of observations to the regression model. The lengths of x and y must be the same and x must be rectangular.

Specified by:
addObservations in interface UpdatingMultipleLinearRegression
Parameters:
x - a series of observations on the independent variables
y - a series of observations on the dependent variable The length of x and y must be the same
Throws:
ModelSpecificationException - if x is not rectangular, does not match the length of y or does not contain sufficient data to estimate the model

removeData

public void removeData(double[][] data)
Removes observations represented by the elements in data.

If the array is larger than the current n, only the first n elements are processed. This method permits the use of SimpleRegression instances in streaming mode where the regression is applied to a sliding "window" of observations, however the caller is responsible for maintaining the set of observations in the window.

To remove all data, use clear().

Parameters:
data - array of observations to be removed

clear

public void clear()
Clears all data from the model.

Specified by:
clear in interface UpdatingMultipleLinearRegression

getN

public long getN()
Returns the number of observations that have been added to the model.

Specified by:
getN in interface UpdatingMultipleLinearRegression
Returns:
n number of observations that have been added.

predict

public double predict(double x)
Returns the "predicted" y value associated with the supplied x value, based on the data that has been added to the model when this method is activated.

predict(x) = intercept + slope * x

Preconditions:

Parameters:
x - input x value
Returns:
predicted y value

getIntercept

public double getIntercept()
Returns the intercept of the estimated regression line, if hasIntercept() is true; otherwise 0.

The least squares estimate of the intercept is computed using the normal equations. The intercept is sometimes denoted b0.

Preconditions:

Returns:
the intercept of the regression line if the model includes an intercept; 0 otherwise
See Also:
SimpleRegression(boolean)

hasIntercept

public boolean hasIntercept()
Returns true if the model includes an intercept term.

Specified by:
hasIntercept in interface UpdatingMultipleLinearRegression
Returns:
true if the regression includes an intercept; false otherwise
See Also:
SimpleRegression(boolean)

getSlope

public double getSlope()
Returns the slope of the estimated regression line.

The least squares estimate of the slope is computed using the normal equations. The slope is sometimes denoted b1.

Preconditions:

Returns:
the slope of the regression line

getSumSquaredErrors

public double getSumSquaredErrors()
Returns the sum of squared errors (SSE) associated with the regression model.

The sum is computed using the computational formula

SSE = SYY - (SXY * SXY / SXX)

where SYY is the sum of the squared deviations of the y values about their mean, SXX is similarly defined and SXY is the sum of the products of x and y mean deviations.

The sums are accumulated using the updating algorithm referenced in addData(double, double).

The return value is constrained to be non-negative - i.e., if due to rounding errors the computational formula returns a negative result, 0 is returned.

Preconditions:

Returns:
sum of squared errors associated with the regression model

getTotalSumSquares

public double getTotalSumSquares()
Returns the sum of squared deviations of the y values about their mean.

This is defined as SSTO here.

If n < 2, this returns Double.NaN.

Returns:
sum of squared deviations of y values

getXSumSquares

public double getXSumSquares()
Returns the sum of squared deviations of the x values about their mean. If n < 2, this returns Double.NaN.

Returns:
sum of squared deviations of x values

getSumOfCrossProducts

public double getSumOfCrossProducts()
Returns the sum of crossproducts, xi*yi.

Returns:
sum of cross products

getRegressionSumSquares

public double getRegressionSumSquares()
Returns the sum of squared deviations of the predicted y values about their mean (which equals the mean of y).

This is usually abbreviated SSR or SSM. It is defined as SSM here

Preconditions:

Returns:
sum of squared deviations of predicted y values

getMeanSquareError

public double getMeanSquareError()
Returns the sum of squared errors divided by the degrees of freedom, usually abbreviated MSE.

If there are fewer than three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
sum of squared deviations of y values

getR

public double getR()
Returns Pearson's product moment correlation coefficient, usually denoted r.

Preconditions:

Returns:
Pearson's r

getRSquare

public double getRSquare()
Returns the coefficient of determination, usually denoted r-square.

Preconditions:

Returns:
r-square

getInterceptStdErr

public double getInterceptStdErr()
Returns the standard error of the intercept estimate, usually denoted s(b0).

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Additionally, a Double.NaN is returned when the intercept is constrained to be zero

Returns:
standard error associated with intercept estimate

getSlopeStdErr

public double getSlopeStdErr()
Returns the standard error of the slope estimate, usually denoted s(b1).

If there are fewer that three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
standard error associated with slope estimate

getSlopeConfidenceInterval

public double getSlopeConfidenceInterval()
                                  throws OutOfRangeException
Returns the half-width of a 95% confidence interval for the slope estimate.

The 95% confidence interval is

(getSlope() - getSlopeConfidenceInterval(), getSlope() + getSlopeConfidenceInterval())

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

Returns:
half-width of 95% confidence interval for the slope estimate
Throws:
OutOfRangeException - if the confidence interval can not be computed.

getSlopeConfidenceInterval

public double getSlopeConfidenceInterval(double alpha)
                                  throws OutOfRangeException
Returns the half-width of a (100-100*alpha)% confidence interval for the slope estimate.

The (100-100*alpha)% confidence interval is

(getSlope() - getSlopeConfidenceInterval(), getSlope() + getSlopeConfidenceInterval())

To request, for example, a 99% confidence interval, use alpha = .01

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

Preconditions:

Parameters:
alpha - the desired significance level
Returns:
half-width of 95% confidence interval for the slope estimate
Throws:
OutOfRangeException - if the confidence interval can not be computed.

getSignificance

public double getSignificance()
Returns the significance level of the slope (equiv) correlation.

Specifically, the returned value is the smallest alpha such that the slope confidence interval with significance level equal to alpha does not include 0. On regression output, this is often denoted Prob(|t| > 0)

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
significance level for slope/correlation
Throws:
MaxCountExceededException - if the significance level can not be computed.

regress

public RegressionResults regress()
                          throws ModelSpecificationException,
                                 NoDataException
Performs a regression on data present in buffers and outputs a RegressionResults object.

If there are fewer than 3 observations in the model and hasIntercept is true a NoDataException is thrown. If there is no intercept term, the model must contain at least 2 observations.

Specified by:
regress in interface UpdatingMultipleLinearRegression
Returns:
RegressionResults acts as a container of regression output
Throws:
ModelSpecificationException - if the model is not correctly specified
NoDataException - if there is not sufficient data in the model to estimate the regression parameters

regress

public RegressionResults regress(int[] variablesToInclude)
                          throws MathIllegalArgumentException
Performs a regression on data present in buffers including only regressors indexed in variablesToInclude and outputs a RegressionResults object

Specified by:
regress in interface UpdatingMultipleLinearRegression
Parameters:
variablesToInclude - an array of indices of regressors to include
Returns:
RegressionResults acts as a container of regression output
Throws:
MathIllegalArgumentException - if the variablesToInclude array is null or zero length
OutOfRangeException - if a requested variable is not present in model


Copyright © 2003-2012 The Apache Software Foundation. All Rights Reserved.