Linear Regression
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
x<-cars$speed
y<-cars$dist
cars.lm <- lm(dist ~ speed, data = cars)
b<-cars.lm$coefficients[1]
m<-cars.lm$coefficients[2]
# note we can reproduce cars.lm$fitted.values like this...
# fitted_recalc<-(m * x) + b
# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(y ~ x, main=title, xlab="speed", ylab="distance")
abline(cars.lm) We denote \(\hat{y}\) for the fitted or predicted values and \(y_i\) for the actual observed values
The residuals are the differences between \(y_i\) and \(\hat{y}\)
note that \(y_i\) equals cars.lm\(fitted.values + cars.lm\)residuals
sum_cars.lm<-summary(cars.lm)
sum_cars.lm##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
One indicator of a good fit is that the residuals are normal.
hist(sum_cars.lm$residuals,breaks=20)Another way to visualize normality is through a qq plot.
qqnorm(resid(sum_cars.lm))
qqline(resid(sum_cars.lm))One indicator is that the Standard Error column is at least 5 times smaller than the Estimate column.
The Pr(> |t|) column indicates the probability that the coefficient is “not relevant”.
The significant codes are :
\[** \ \ \ 0.001 < p \le 0.01\]
\[*** \ \ \ 0 < p \le 0.001\]
The t statistic divides the coefficient estimate by the standard error and represents the number of deviations on the T distribution.
You can reproduce this like this …
sum_cars.lm$coefficients[,1]/sum_cars.lm$coefficients[,2]## (Intercept) speed
## -2.601058 9.463990
The 48 degrees of freedom is the number of observations minus the number of coefficients (2).
The F-statistic compares the current model to a model that has one fewer parameters. It is only relevant for multiple category models like Anova.
Here we compare the fitted to the residuals. A good model would display this as small but uniform dots around 0. This model appears ok.
plot(fitted(cars.lm),resid(cars.lm))The Residual Standard Error (RSE) is the RSS divided by the degrees of freedom. Its stored in sum_cars.lm$sigma
\[SSTot \ = \ \sum_{i=1}^{n}{(y_i \ - \ \bar{y} )^2}\]
calcSSTot <- function(observations,mean_y) {
SSTot<-sum((observations - mean_y)^2)
return(SSTot)
}\[RSS \ = \ \sum_{i=1}^{n}{(y_i \ - \ \hat{y} )^2}\]
# RSS aka SSE aka Unexplained aka Error
calcRSS <- function(observations,predicted){
diffs<-observations-predicted
RSS<-sum(diffs^2)
return(RSS)
}\[SSReg \ = \ \sum_{i=1}^{n}{(\hat{y} \ - \ \bar{y} )^2}\]
calcSSReg <- function(fittedvalues,mean_y) {
mean_yy<-rep(mean_y,length(mean_y))
SSReg<-sum((fittedvalues - mean_yy)^2)
return(SSReg)
}calcRSE <- function(observations, fittedvalues,df) {
RSE<-sqrt(calcRSS(observations, fittedvalues) / df)
return(RSE)
}
calcRSE(cars$dist, cars.lm$fitted.values,cars.lm$df.residual)## [1] 15.37959
sum_cars.lm$sigma## [1] 15.37959
#Multiple R-Squared (Coefficient of Determination)
# residuals squared minus
calc_MRSquared <-function(fittedvalues, residuals) {
SSreg<-sum((fittedvalues-mean(fittedvalues))^2)
SSE=sum(residuals^2)
return((SSreg-SSE)/SSreg)
}
sum_cars.lm$r.squared## [1] 0.6510794
calc_MRSquared(cars$dist, cars.lm$residuals)## [1] 0.6510794
cor(cars$dist, cars$speed)^2## [1] 0.6510794
The adjusted R2 is always smaller than the R2 value. The more coefficients you have the smaller it will be.
sum_cars.lm$adj.r.squared## [1] 0.6438102
n=length(cars$dist)
k=length(sum_cars.lm$coefficients[1,])-1 #Subtract one to ignore intercept
SSE=sum(sum_cars.lm$residuals^2)
SSreg=sum((cars$dist-mean(cars$dist))^2)
1-((SSE/SSreg)*(n-1)/(n-(k-1)))## [1] 0.6438102
This is a very good document for explaining the data behind the summary function.