• NLREG has been selected as the "Editor"s Pick" by SoftSeek.

    link to

  • NLREG is in use at hundreds of universities, laboratories, and government agencies around the world (over 20 countries). For a list of a few organizations using NLREG click here.

  • If you have categorical variables, you may want to use a Decision Tree to model your data. Check out the DTREG Decision Tree Builder.

  • You also should check out the News Rover program that automatically scans Usenet newsgroups, downloads messages of interest to you, decodes binary file attachments, reconstructs files split across multiple messages, and eliminates spam and duplicate files.

    Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} o\:* {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} w\:* {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} .shape {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} NLREG -- Understanding the Results

    Understanding the Results of an Analysis


    Descriptive Statistics for Variables

    NLREG prints a variety of statistics at the end of each analysis. For each variable, NLREG lists the minimum value, the maximum value, the mean value, and the standard deviation. You should confirm that these values are within the ranges you expect.

    Parameter Estimates

    For each parameter, NLREG displays the initial parameter estimate (which you specified on the PARAMETER statement, or 1 by default), the final (maximum likelihood) estimate, the standard error of the estimated parameter value, the "t'' statistic comparing the estimated parameter value with zero, and the significance of the t statistic. Nine significant digits are displayed for the parameter estimates. If you need to determine the parameters to greater precision, use the POUTPUT statement.


    The final estimate parameter values are the results of the analysis. By substituting these values in the equation you specified to be fitted to the data, you will have a function that can be used to predict the value of the dependent variable based on a set of values for the independent variables. For example, if the equation being fitted is


    y = p0 + p1*x


    and the final estimates are 1.5 for p0 and 3 for p1, then the equation


    y = 1.5 + 3*x


    is the best equation of this form that will predict the value of y based on the value of x.


    t Statistic

    The "t'' statistic is computed by dividing the estimated value of the parameter by its standard error. This statistic is a measure of the likelihood that the actual value of the parameter is not zero. The larger the absolute value of t, the less likely that the actual value of the parameter could be zero.


    The "Prob(t)'' value is the probability of obtaining the estimated value of the parameter if the actual parameter value is zero. The smaller the value of Prob(t), the more significant the parameter and the less likely that the actual parameter value is zero. For example, assume the estimated value of a parameter is 1.0 and its standard error is 0.7. Then the t value would be 1.43 (1.0/0.7). If the computed Prob(t) value was 0.05 then this indicates that there is only a 0.05 (5%) chance that the actual value of the parameter could be zero. If Prob(t) was 0.001 this indicates there is only 1 chance in 1000 that the parameter could be zero. If Prob(t) was 0.92 this indicates that there is a 92% probability that the actual value of the parameter could be zero; this implies that the term of the regression equation containing the parameter can be eliminated without significantly affecting the accuracy of the regression.


    One thing that can cause Prob(t) to be 1.00 (or near 1.00) is having redundant parameters. If at the end of an analysis several parameters have Prob(t) values of 1.00, check the function carefully to see if one or more of the parameters can be removed. Also try using a DOUBLE statement to set one or more of the parameters to a reasonable fixed value; if the other parameters suddenly become significant (i.e., Prob(t) much less than 1.00) then the parameters are mutually dependent and one or more should be removed.


    The t statistic probability is computed using a two-sided test. The CONFIDENCE statement can be used to cause NLREG to print confidence intervals for parameter values. The SQUARE.NLR example regression includes an extraneous parameter (p0) whose estimated value is much smaller than its standard error; the Prob(t) value is 0.99982 indicating that there is a high probability that the value is zero.

    Final Sum of Squared Deviations

    In addition to the variable and parameter values, NLREG displays several statistics that indicate how well the equation fits the data. The "Final sum of squared deviations'' is the sum of the squared differences between the actual value of the dependent variable for each observation and the value predicted by the function, using the final parameter estimates.

    Average and Maximum Deviation


    The "Average deviation'' is the average over all observations of the absolute value of the difference between the actual value of the dependent variable and its predicted value.


    The "Maximum deviation for any observation'' is the maximum difference (ignoring sign) between the actual and predicted value of the dependent variable for any observation.

    Proportion of Variance Explained

    The "Proportion of variance explained (R2)'' indicates how much better the function predicts the dependent variable than just using the mean value of the dependent variable. This is also known as the "coefficient of multiple determination.'' It is computed as follows: Suppose that we did not fit an equation to the data and ignored all information about the independent variables in each observation. Then, the best prediction for the dependent variable value for any observation would be the mean value of the dependent variable over all observations. The "variance'' is the sum of the squared differences between the mean value and the value of the dependent variable for each observation. Now, if we use our fitted function to predict the value of the dependent variable, rather than using the mean value, a second kind of variance can be computed by taking the sum of the squared difference between the value of the dependent variable predicted by the function and the actual value. Hopefully, the variance computed by using the values predicted by the function is better (i.e., a smaller value) than the variance computed using the mean value. The "Proportion of variance explained'' is computed as 1 (variance using predicted value / variance using mean). If the function perfectly predicts the observed data, the value of this statistic will be 1.00 (100%). If the function does no better a job of predicting the dependent variable than using the mean, the value will be 0.00.

    Adjusted Coefficient of Multiple Determination

    The "adjusted coefficient of multiple determination (Ra2)'' is an R2 statistic adjusted for the number of parameters in the equation and the number of data observations. It is a more conservative estimate of the percent of variance explained, especially when the sample size is small compared to the number of parameters.

    Durbin-Watson Statistic

    The "Durbin-Watson test for autocorrelation'' is a statistic that indicates the likelihood that the deviation (error) values for the regression have a first-order autoregression component. The regression models assume that the error deviations are uncorrelated.


    In business and economics, many regression applications involve time series data. If a non-periodic function, such as a straight line, is fitted to periodic data, the deviations have a periodic form and are positively correlated over time; these deviations are said to be "autocorrelated'' or "serially correlated.'' Autocorrelated deviations may also indicate that the form (shape) of the function being fitted is inappropriate for the data values (e.g., a linear equation fitted to quadratic data).


    If the deviations are autocorrelated, there may be a number of consequences for the computed results: 1) The estimated regression coefficients no longer have the minimum variance property; 2) the mean square error (MSE) may seriously underestimate the variance of the error terms; 3) the computed standard error of the estimated parameter values may underestimate the true standard error, in which case the t values and confidence intervals may be incorrect. Note that if an appropriate periodic function is fitted to periodic data, the deviations from the regression will be uncorrelated because the cycle of the data values is accounted for by the fitted function.


    Small values of the Durbin-Watson statistic indicate the presence of autocorrelation. Consult significance tables in a good statistics book for exact interpretations; however, a value less than 0.80 usually indicates that autocorrelation is likely. If the Durbin-Watson statistic indicates that the residual values are autocorrelated, it is recommended that you use the RPLOT and/or NPLOT statements to display a plot of the residual values.


    If the data has a regular, periodic component you can try including a sin term in your function. The TREND.NLR example fits a function with a sin term to data that has a linear growth with a superimposed sin component. With the sin term the function has a residual value of 29.39 and a Durbin-Watson value of 2.001; without the sin term (i.e., fitting only a linear function) the residual value is 119.16 and the Durbin-Watson value is 0.624 indicating strong autocorrelation. The general form of a sin term is

    amplitude * sin(2*pi*(x-phase)/period)

    where amplitude is a parameter that determines the magnitude of the sin component, period determines the period of the oscillation, and phase determines the phase relative to the starting value. If you know the period (e.g., 12 for monthly data with an annual cycle) you should specify it rather than having NLREG attempt to determine it.


    If an NPLOT statement is used to produce a normal probability plot of the residuals, the correlation between the residuals and their expected values (assuming they are normally distributed) is printed in the listing. If the residuals are normally distributed, the correlation should be close to 1.00. A correlation less than 0.94 suggests that the residuals are not normally distributed.

    Analysis of Variance Table

    An "Analysis of Variance'' table provides statistics about the overall significance of the model being fitted.


    F Value and Prob(F)


    The "F value'' and "Prob(F)'' statistics test the overall significance of the regression model. Specifically, they test the null hypothesis that all of the regression coefficients are equal to zero. This tests the full model against a model with no variables and with the estimate of the dependent variable being the mean of the values of the dependent variable. The F value is the ratio of the mean regression sum of squares divided by the mean error sum of squares. Its value will range from zero to an arbitrarily large number.


    The value of Prob(F) is the probability that the null hypothesis for the full model is true (i.e., that all of the regression coefficients are zero). For example, if Prob(F) has a value of 0.01000 then there is 1 chance in 100 that all of the regression parameters are zero. This low a value would imply that at least some of the regression parameters are nonzero and that the regression equation does have some validity in fitting the data (i.e., the independent variables are not purely random with respect to the dependent variable).

    Correlation Matrix

    The CORRELATE statement can be used to cause NLREG to print a correlation matrix. A "correlation coefficient'' is a value that indicates whether there is a linear relationship between two variables. The absolute value of the correlation coefficient will be in the range 0 to 1. A value of 0 indicates that there is no relationship whereas a value of 1 indicates that there is a perfect correlation and the two variables vary together. The sign of the correlation coefficient will be negative if there is an inverse relationship between the variables (i.e., as one increases the other decreases).


    For example, consider a study measuring the height and weight of a group of individuals. The correlation coefficient between height and weight will likely have a positive value somewhat less than one because tall people tend to weigh more than short people. A study comparing number of cigarettes smoked with age at death will probably have a negative correlation value.


    A correlation matrix shows the correlation between each pair of variables. The diagonal of the matrix has values of 1.00 because a variable always has a perfect correlation with itself. The matrix is symmetric about the diagonal because X correlated with Y is the same as Y correlated with X.


    Problems occur in regression analysis when a function is specified that has multiple independent variables that are highly correlated. The common interpretation of the computed regression parameters as measuring the change in the expected value of the dependent variable when the corresponding independent variable is varied while all other independent variables are held constant is not fully applicable when a high degree of correlation exists. This is due to the fact that with highly correlated independent variables it is difficult to attribute changes in the dependent variable to one of the independent variables rather than another. The following are effects of fitting a function with high correlated independent variables:


    1.         Large changes in the estimated regression parameters may occur when a variable is added or deleted, or when an observation is added or deleted.


    2.         Individual tests on the regression parameters may show the parameters to be nonsignificant.


    3.         Regression parameters may have the opposite algebraic sign than expected from theoretical or practical considerations.


    4.         The confidence intervals for important regression parameters may be be much wider than would otherwise be the case. The solution to these problems may be to select the most significant of the correlated variables and use only it in the function.


    Note: the correlation coefficients indicate the degree of linear association between variables. Variables may be highly related in a nonlinear fashion and still have a correlation coefficient near 0.


    NLREG home page