Linear Regression






Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)



Cars Linear Model



x<-cars$speed
y<-cars$dist

cars.lm <- lm(dist ~ speed, data = cars)

b<-cars.lm$coefficients[1]
m<-cars.lm$coefficients[2]

# note we can reproduce cars.lm$fitted.values like this...
# fitted_recalc<-(m * x)  + b

# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(y ~ x, main=title, xlab="speed", ylab="distance")
abline(cars.lm) 


We denote \(\hat{y}\) for the fitted or predicted values and \(y_i\) for the actual observed values

The residuals are the differences between \(y_i\) and \(\hat{y}\)

note that \(y_i\) equals cars.lm\(fitted.values + cars.lm\)residuals


sum_cars.lm<-summary(cars.lm)

sum_cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12


One indicator of a good fit is that the residuals are normal.


hist(sum_cars.lm$residuals,breaks=20)


Another way to visualize normality is through a qq plot.


qqnorm(resid(sum_cars.lm))
qqline(resid(sum_cars.lm))


One indicator is that the Standard Error column is at least 5 times smaller than the Estimate column.


The Pr(> |t|) column indicates the probability that the coefficient is “not relevant”.

The significant codes are :
\[** \ \ \ 0.001 < p \le 0.01\]


\[*** \ \ \ 0 < p \le 0.001\]

The t statistic divides the coefficient estimate by the standard error and represents the number of deviations on the T distribution.


You can reproduce this like this …

sum_cars.lm$coefficients[,1]/sum_cars.lm$coefficients[,2]
## (Intercept)       speed 
##   -2.601058    9.463990


The 48 degrees of freedom is the number of observations minus the number of coefficients (2).


The F-statistic compares the current model to a model that has one fewer parameters. It is only relevant for multiple category models like Anova.


Here we compare the fitted to the residuals. A good model would display this as small but uniform dots around 0. This model appears ok.

plot(fitted(cars.lm),resid(cars.lm))


The Residual Standard Error (RSE) is the RSS divided by the degrees of freedom. Its stored in sum_cars.lm$sigma



Sum of Squared Functions




\[SSTot \ = \ \sum_{i=1}^{n}{(y_i \ - \ \bar{y} )^2}\]

calcSSTot <- function(observations,mean_y) {
  SSTot<-sum((observations - mean_y)^2)
  return(SSTot)
}


\[RSS \ = \ \sum_{i=1}^{n}{(y_i \ - \ \hat{y} )^2}\]

# RSS aka SSE aka Unexplained aka Error
calcRSS <- function(observations,predicted){
  diffs<-observations-predicted
  RSS<-sum(diffs^2)
  return(RSS)
}


\[SSReg \ = \ \sum_{i=1}^{n}{(\hat{y} \ - \ \bar{y} )^2}\]

calcSSReg <- function(fittedvalues,mean_y) {
  mean_yy<-rep(mean_y,length(mean_y))
  SSReg<-sum((fittedvalues - mean_yy)^2)
  return(SSReg)
}



calcRSE <- function(observations, fittedvalues,df) {
  RSE<-sqrt(calcRSS(observations, fittedvalues) / df)
  return(RSE)
}


calcRSE(cars$dist, cars.lm$fitted.values,cars.lm$df.residual)
## [1] 15.37959
sum_cars.lm$sigma
## [1] 15.37959


#Multiple R-Squared (Coefficient of Determination)
# residuals squared minus 
calc_MRSquared <-function(fittedvalues, residuals) {
SSreg<-sum((fittedvalues-mean(fittedvalues))^2)
SSE=sum(residuals^2)
return((SSreg-SSE)/SSreg)
}

sum_cars.lm$r.squared
## [1] 0.6510794
calc_MRSquared(cars$dist, cars.lm$residuals)
## [1] 0.6510794
cor(cars$dist, cars$speed)^2
## [1] 0.6510794


The adjusted R2 is always smaller than the R2 value. The more coefficients you have the smaller it will be.


sum_cars.lm$adj.r.squared
## [1] 0.6438102
n=length(cars$dist)
k=length(sum_cars.lm$coefficients[1,])-1   #Subtract one to ignore intercept

SSE=sum(sum_cars.lm$residuals^2)
SSreg=sum((cars$dist-mean(cars$dist))^2)
1-((SSE/SSreg)*(n-1)/(n-(k-1)))
## [1] 0.6438102


This is a very good document for explaining the data behind the summary function.