Regression analysis is a tough subject. Learning regression in a college statistics class is often made more challenging by the corequisite task of learning a software package to manage the data. I myself had to be subjected to Minitab, which can best be compared to what Excel would look like if you ran it on an Atari. I would recommend that anybody studying this subject consider using R: it’s free, not terribly difficult to use, and runs easily on most computers without a lot of system drag. This post is written for someone who may not have used R before, or who is not a programmer, or is just trying to pass a stats class. Or maybe you just want a set of tools to help follow a textbook or other statistics tutorial. I hope it’s helpful.
In order to sharpen my chops at regression analysis, I took a look at the standby text “Regression Analysis by Example” by Chatterjee and Price. I thought I would take their example data (“computer repair times”) and run it in R. These notes and code seek to create vectors out of the regression components discussed in Chapter One. Writing the code helped me clamp down on the meaning of each variable and get the basic regression output down more solidly. Source the code into R and you can run the analyses just by changing the data vectors into whatever you need.
The data consists of repair times and quantities of computers serviced during 14 service visits by a fictional computer repair firm. Let’s get it into a data frame.
# regressionexample.r
# regression analysis by example, by chatterjee and price
# example uses data of service call length as a function of number of units
# table 1.2, chapter 1
units <- c(1,2,3,4,4,5,6,6,7,8,9,9,10,10)
units
## [1] 1 2 3 4 4 5 6 6 7 8 9 9 10 10
minutes <- c(23,29,49,64,74,87,96,97,109,119,149,145,154,166)
minutes
## [1] 23 29 49 64 74 87 96 97 109 119 149 145 154 166
service <- data.frame(units, minutes)
service
## units minutes
## 1 1 23
## 2 2 29
## 3 3 49
## 4 4 64
## 5 4 74
## 6 5 87
## 7 6 96
## 8 6 97
## 9 7 109
## 10 8 119
## 11 9 149
## 12 9 145
## 13 10 154
## 14 10 166
Now that we have the data in place we can use the generic linear model function to run the least squares regression. This model is placed into a variable which can be called as a command. Then we can create an object for the ANOVA table and the summary() function.
model1 <- lm(minutes~units, service)
model1
##
## Call:
## lm(formula = minutes ~ units, data = service)
##
## Coefficients:
## (Intercept) units
## 4.162 15.509
anova1 <- anova(model1)
anova1
## Analysis of Variance Table
##
## Response: minutes
## Df Sum Sq Mean Sq F value Pr(>F)
## units 1 27419.5 27419.5 943.2 8.916e-13 ***
## Residuals 12 348.8 29.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary1 <- summary(model1)
summary1
##
## Call:
## lm(formula = minutes ~ units, data = service)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2318 -3.3415 -0.7143 4.7769 7.8033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.162 3.355 1.24 0.239
## units 15.509 0.505 30.71 8.92e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.392 on 12 degrees of freedom
## Multiple R-squared: 0.9874, Adjusted R-squared: 0.9864
## F-statistic: 943.2 on 1 and 12 DF, p-value: 8.916e-13
The fitted() function allows me to create a vector of fitted values.
# extracting elements from a linear fit
modelfits <- round(fitted(model1), 2)
modelfits
## 1 2 3 4 5 6 7 8 9 10
## 19.67 35.18 50.69 66.20 66.20 81.71 97.21 97.21 112.72 128.23
## 11 12 13 14
## 143.74 143.74 159.25 159.25
The resid() function creates a vector of residuals. I use the round() function to trim the output a little.
modelresids <- round(resid(model1), 4)
modelresids
## 1 2 3 4 5 6 7 8 9
## 3.3296 -6.1792 -1.6880 -2.1967 7.8033 5.2945 -1.2143 -0.2143 -3.7231
## 10 11 12 13 14
## -9.2318 5.2594 1.2594 -5.2494 6.7506
It’s helpful to have a table of observations, fitted values, and residuals, and this command turns our original data frame into exactly that.
service2 <- cbind(service, modelfits, modelresids)
service2
## units minutes modelfits modelresids
## 1 1 23 19.67 3.3296
## 2 2 29 35.18 -6.1792
## 3 3 49 50.69 -1.6880
## 4 4 64 66.20 -2.1967
## 5 4 74 66.20 7.8033
## 6 5 87 81.71 5.2945
## 7 6 96 97.21 -1.2143
## 8 6 97 97.21 -0.2143
## 9 7 109 112.72 -3.7231
## 10 8 119 128.23 -9.2318
## 11 9 149 143.74 5.2594
## 12 9 145 143.74 1.2594
## 13 10 154 159.25 -5.2494
## 14 10 166 159.25 6.7506
The summary() function stores coefficients in a matrix. We can store this matrix in a variable (a vector, since everything is a vector in R), and then access its components when we want to evaluate the adequacy of the explanatory variable.
modelcoefs <- summary(model1)$coefficients
modelcoefs
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.161654 3.3551004 1.240396 2.385344e-01
## units 15.508772 0.5049813 30.711576 8.916254e-13
# returns 2x4 matrix [1:2] = intercepts, units, [1:4] = est, se, t value, Pr(>|t|)
Having viewed the basic output, we can employ some tests to help decide whether this is a “good” regression.
Chatterjee and Price run through some basic procedures to ensure the validity of the regression model (i.e. making sure you don’t overlook glaring aspects of the display of the data that call the use of the model into question). Chapter One in their book is, in fact, all about using basic plot output to evaluate whether or not it is appropriate to use a regression method at all (before delving into the matter of selecting between regression methods).
The first set of assumptions to test revolve around the existence and relatie linearity of a basic relationship between the variables. Plotting the data lets us visually assess the correlation between units and minutes.
Let’s plot the data, and see if there’s something to this. We will simply plot the “service” object we created when we set up the data frame.
plot(service, pch=20, col="blue", ylab="length of service call (minutes)")
title(main=list("Repair Times for Computers", cex=1.5,
col="blue", font=3))
# returns main plot
There appears to be a distinct relationship, so we can move on. The relationship appears linear, and the high R squared value (in the summary output above) indicates that “units” is a variable worthy of consideration, however we may view the model when we’re all done.
Then let’s get the four plots that are “contained” within the model. We do this by plotting the model itself (not the data frame object). We will only be concerned with the first of these plots for now. This is the plot of residuals vs fitted values. We want to look at the spread of data points and see if it is free of obvious patterns. The expectation (if this regression is to be useful) is that the residuals will be randomly distributed. If the residuals showed a trend of some sort (say, large deviations at high x values and small deviations at low x values) we would need to reconsider the formula used to derive the regression line. In that scenario it is possible that a regression is still appropriate, but some transformations of variables (the subject of Chapter Two) would be required.
You will also notice that certain data points are identified by number in the plots. Disregard this for now. It will matter later.
plot(model1, pch=20, col="blue", main=list("Repair Times for Computers", cex=1.5,
col="blue", font=3))
# note: original ratio is 665 X 513
# returns 4 plots
As it turns out, the residuals form a nice unidfferentiated mass of points on the plot. We are free to move on. As the authors put it, “we conclude that the model specification is satisfactory,” and that will work for me.
The analysis proceeds to the step of evaluating the explanatory variable. The high R-squared value (in the summary) assures us that the variance (error) in length of service calls can be attributed to the number of computers repaired. Another way to state this would be that the “B1” value in the regression equation should not equal zero. If it could equal zero, that suggests that the baseline observation (the “B0” value) is adequate as a predictor, which would mean repair time is constant regardless of number of computers fixed, and that way lies madness.
In this spirit, the authors propose a test of a null hypothesis: “B1” = 0. This test relies on values that are part of the summary output for our model (the “t stat,” the standard error of “B1,”) but which are not immediately recognizable as objects to mess with in our R code. Those coefficients are accessible as objects, but it will be helpful to formally name them as such for the event in which we want to write R code to run the test programatically.
The test also requires a critical value of “t.” If you’re in a stats class, this value is arrived at by consulting a table. You rely on the degrees of freedom (df, which annoyingly can be conflated with “df” for “data frame” in R parlance) and the desired significane level to get this number. Use the value for “two-tailed test” (alpha/2, not alpha, as far as consulting that grid is concerned). I use the qt() function to get this value, with “.975” as the “significance” argument and “length(units)-2” as the “degrees of freedom” argument.
I would go as far as to say the main goal of this post is to break down the authors’ material into usable code chunks as much as possible. The more you look at regression equations (especially the equations that define the components) the more it begins to look like some hellish algebraic stew. It can be very handy to have all of the components and subcomponents of the equations easily identified and encapsulated in variables so we can plug and play and let R do the heavy lifting. All of this is accomplished on the mathematical end already (what else is math at its heart if not a way to “name” the previously unknown?), but writing code to handle these tests will require coming up with our own variables in which to store the components. Let’s do this now with the components of the hypothesis test.
# test explanatory variable
# test Ho(B1=0) where B1 denotes the regression coefficient of x on y
b1coef <- modelcoefs[2,1] # returns a single number (B1 value)
b1coef
## [1] 15.50877
b1se <- modelcoefs[2,2] # returns s.e. of B1
b1se
## [1] 0.5049813
b1tstat <- modelcoefs[2,3] # returns t value for testing Ho(B1=0)
b1tstat
## [1] 30.71158
b1pval <- modelcoefs[2,4] # returns p value
b1pval
## [1] 8.916254e-13
b1tcrit <- qt(.975, length(units)-2)
b1tcrit
## [1] 2.178813
# returns critical t value at 95% level of significance (two-tailed test, 1-(alpha/2) = .975)
I created a variable named “adequate” that holds the result of that test. It returns “TRUE” if the t stat exceeds the hypothesized critical value of t, and “FALSE” otherwise.
adequate <- b1tstat > b1tcrit # returns TRUE if explanatory variable is appropriate
adequate
## [1] TRUE
In the follow-up post I will wrap that test in a function that can be used on any data set resembling the toy example from Chapter One. Then I’ll get into the last bit of the chapter, which covers confidence and prediction intervals and an expanded case of the repair time data (in which additional variables call the model into question).