Regressions and Splines

We fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR)
library(readxl)
library(readr)
library(boot)
data(Carseats)
lmfit=lm(Sales~Price+Urban+US, data=Carseats)
summary(lmfit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

For every $1 increase in the "Price" variable, there is on average, an approximate 0.05459 unit decrease in sales while all other predictors remain fixed. The sales in urban locations are on average, approximately 21.916 units less than non-urban settings while all other predictors remain fixed. The sales in the US are on average, 1200.573 units more than in any other location, while all other predictors remain fixed.

The equation for this model is the following: y= 13.043469 - (0.054459)Price - (0.021916)UrbanYes + (1.200573)USYes + error

This model holds true only if Urban =1 if in an urban setting and Urban=0 if not. It also holds true only if US = 1 if in the US and US=0 if not. We can reject the null hypothesis for the Price and US variables since the p-value is smaller than 0.05.

Here, we take out the Urban variable since we observed it not being a significant predictor of sales.

lmfitnew=lm(Sales~Price+US, data=Carseats)
summary(lmfitnew)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

The smaller model fits the data better due to the smaller residual standard error and a bigger F-statistic. The Urban variable caused a larger standard error in our linear plot.

Here we generate a confidence interval.

confint(lmfitnew)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Based on the plots above, we can identify outliers. Looking at the standardized residual plot, we can see that we have multiple outliers, the values that exceed a standardized residual value of 2. There are some leverage points present since the leverage statistic's value has been exceeded.

par(mfrow = c(2, 2))
plot(lmfitnew)

Looking at the last equation, we can ascertain that the intercept is 2+rnorm(100), beta1 =2, and beta2=0.3

set.seed(1)
x1 = runif (100)
x2 = 0.5*x1+rnorm (100)/10
y = 2+2*x1+0.3*x2+rnorm (100)

We find the correlation plot for x1 and x2. The correlation coefficient is 0.8351212 as shown below. This indicates a strong positive correlation between the two variables.

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2)

As one can see from the coefficient and p-values, these are not good predictors of the y variable. Betahat0= 2.1305, Betahat1 = 1.4396 and Betahat2=1.0097. We can reject the null hypothesis that Beta1=0 with 95% since the p-value for Betahat1 is less than 0.05. For Beta2, we fail to reject the null hypothesis since the p-value is greater than 0.05.

lm14=lm(y~x1+x2)
summary(lm14)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

We find a regression with the x1 variable only to predict y. We can reject the null hypothesis since the p-value is far lower than 0.05. The R-squared value explains approximately 20% of the changes in the response variable y as well.

lmx1=lm(y~x1)
summary(lmx1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

We find a regression with the x1 variable only to predict y. We can reject the null hypothesis since the p-value is far lower than 0.05. The R-squared value explains approximately 17% of the changes in the response variable y as well.

lmx2=lm(y~x2)
summary(lmx2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The results from c-e contradict each other indeed. In the first regression with both of the predictors, we noticed that they are both nearly insignificant to insignificant, indicating that they are poor predictors. However, taking regressions of them separately showed differing results, both telling us that they are significant predictors of y.

The R-squared value improved slightly in this model, more variation is explained now. The x2 variable is now significant instead of the x2 variable. No outliers can be seen, however there seems to be a leverage point as seen in the residuals vs leverage plot. There is no outlier since in the residuals vs fitted graph, there are no points above the standardized residual value of 2.

x3=c(x1, 0.1)
x4=c(x2, 0.8)
y=c(y,6)

lm.fitg =lm(y~x3+x4)
summary(lm.fitg);

## 
## Call:
## lm(formula = y ~ x3 + x4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x3            0.5394     0.5922   0.911  0.36458    
## x4            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

par(mfrow = c(2, 2))
plot(lm.fitg)

This model performed worse than the first model due to its lower R-squared value. It accounts for the variance in y less and thus is a worse model. X1 is the variable that remains statistically significant in both models. There seems to be outliers present since there are data points above the value of 2 in the residuals vs fitted graph. There seems to be leverage points as well looking at the bottom right graph.

lmxg1=lm(y~x3)
summary(lmxg1)

## 
## Call:
## lm(formula = y ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x3            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

par(mfrow = c(2, 2))
plot(lmxg1)

In both models, the variable x2 is significant in explaining the variation in y. The R squared value for this model is greater than the first one, meaning that it is a better model. It explains the variability in y to a greater extent. There are no outliers or leverage points.

lmxg4=lm(y~x4)
summary(lmxg4)

## 
## Call:
## lm(formula = y ~ x4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x4            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

par(mfrow = c(2, 2))
plot(lmxg4)

Here we generate all possible QQplots for the Auto dataset. We can see that there are negative correlations present that seemingly have curving data, which makes us suspect there is non-linear data present.

library(ISLR)
data(Auto)
pairs(Auto)

We see what degree polynomial will give us the smallest test MSE.

deltas <- rep(NA, 15)
set.seed(1)
for (i in 1:15) {
    fit <- glm(mpg ~ poly(displacement, i), data = Auto)
    deltas[i] <- cv.glm(Auto, fit, K = 10)$delta[1]
}
plot(1:15, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)

Here, we find the natural cubic spline for the dataset.

library(ISLR)
attach(Auto)
fit=lm(mpg~poly(displacement,4),data=Auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ poly(displacement, 4), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7755  -2.3666  -0.2723   2.1005  20.4053 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              23.4459     0.2207 106.217  < 2e-16 ***
## poly(displacement, 4)1 -124.2585     4.3704 -28.432  < 2e-16 ***
## poly(displacement, 4)2   31.0895     4.3704   7.114  5.5e-12 ***
## poly(displacement, 4)3   -4.4655     4.3704  -1.022    0.308    
## poly(displacement, 4)4    0.7747     4.3704   0.177    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.37 on 387 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6865 
## F-statistic:   215 on 4 and 387 DF,  p-value: < 2.2e-16

coef(summary(fit))

##                            Estimate Std. Error     t value      Pr(>|t|)
## (Intercept)              23.4459184  0.2207366 106.2167374 2.323731e-288
## poly(displacement, 4)1 -124.2584881  4.3703611 -28.4320870  8.253315e-97
## poly(displacement, 4)2   31.0895299  4.3703611   7.1137210  5.504761e-12
## poly(displacement, 4)3   -4.4655059  4.3703611  -1.0217705  3.075278e-01
## poly(displacement, 4)4    0.7747124  4.3703611   0.1772651  8.593929e-01

library(splines)
fit=lm(mpg~bs(displacement,knots=c(129,258,387)),data=Auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ bs(displacement, knots = c(129, 258, 387)), 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8578  -2.4612  -0.2946   2.2582  19.7510 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   30.559      1.490  20.514
## bs(displacement, knots = c(129, 258, 387))1    3.288      2.180   1.508
## bs(displacement, knots = c(129, 258, 387))2  -10.203      1.639  -6.227
## bs(displacement, knots = c(129, 258, 387))3  -10.574      2.385  -4.434
## bs(displacement, knots = c(129, 258, 387))4  -18.896      2.077  -9.099
## bs(displacement, knots = c(129, 258, 387))5  -15.574      2.739  -5.685
## bs(displacement, knots = c(129, 258, 387))6  -17.455      2.546  -6.857
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## bs(displacement, knots = c(129, 258, 387))1    0.132    
## bs(displacement, knots = c(129, 258, 387))2 1.25e-09 ***
## bs(displacement, knots = c(129, 258, 387))3 1.21e-05 ***
## bs(displacement, knots = c(129, 258, 387))4  < 2e-16 ***
## bs(displacement, knots = c(129, 258, 387))5 2.58e-08 ***
## bs(displacement, knots = c(129, 258, 387))6 2.82e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.331 on 385 degrees of freedom
## Multiple R-squared:  0.6969, Adjusted R-squared:  0.6921 
## F-statistic: 147.5 on 6 and 385 DF,  p-value: < 2.2e-16

range(displacement)

## [1]  68 455

dim(bs(displacement,knots=c(129,258,387)))               #computing dimension of basis functions

## [1] 392   6

dim(bs(displacement,knots=c(129,258,387),degree=1))

## [1] 392   4

summary(fit)

## 
## Call:
## lm(formula = mpg ~ bs(displacement, knots = c(129, 258, 387)), 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8578  -2.4612  -0.2946   2.2582  19.7510 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                   30.559      1.490  20.514
## bs(displacement, knots = c(129, 258, 387))1    3.288      2.180   1.508
## bs(displacement, knots = c(129, 258, 387))2  -10.203      1.639  -6.227
## bs(displacement, knots = c(129, 258, 387))3  -10.574      2.385  -4.434
## bs(displacement, knots = c(129, 258, 387))4  -18.896      2.077  -9.099
## bs(displacement, knots = c(129, 258, 387))5  -15.574      2.739  -5.685
## bs(displacement, knots = c(129, 258, 387))6  -17.455      2.546  -6.857
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## bs(displacement, knots = c(129, 258, 387))1    0.132    
## bs(displacement, knots = c(129, 258, 387))2 1.25e-09 ***
## bs(displacement, knots = c(129, 258, 387))3 1.21e-05 ***
## bs(displacement, knots = c(129, 258, 387))4  < 2e-16 ***
## bs(displacement, knots = c(129, 258, 387))5 2.58e-08 ***
## bs(displacement, knots = c(129, 258, 387))6 2.82e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.331 on 385 degrees of freedom
## Multiple R-squared:  0.6969, Adjusted R-squared:  0.6921 
## F-statistic: 147.5 on 6 and 385 DF,  p-value: < 2.2e-16

displacementlims=range(displacement)
displacement.grid=seq(from=displacementlims[1],to=displacementlims[2])
preds=predict(fit,newdata=list(displacement=displacement.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)


pred=predict(fit,newdata=list(displacement=displacement.grid),se=T)
plot(displacement,mpg,col="gray")
lines(displacement.grid,pred$fit,lwd=2)
lines(displacement.grid,pred$fit+2*pred$se,lty="dashed")
lines(displacement.grid,pred$fit-2*pred$se,lty="dashed")

Here we find the regression spline for the dataset.

library(ISLR)
attach(Auto)

## The following objects are masked from Auto (pos = 4):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year

library(splines)
fit=lm(mpg~bs(displacement,df=6),data=Auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ bs(displacement, df = 6), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6466  -2.4539  -0.3409   2.1079  20.4956 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                24.5533     1.8142  13.534  < 2e-16 ***
## bs(displacement, df = 6)1  14.2102     2.5554   5.561 5.03e-08 ***
## bs(displacement, df = 6)2   0.5911     1.8515   0.319 0.749697    
## bs(displacement, df = 6)3   0.3101     2.4356   0.127 0.898770    
## bs(displacement, df = 6)4 -11.9498     2.5442  -4.697 3.68e-06 ***
## bs(displacement, df = 6)5  -9.5233     2.7208  -3.500 0.000519 ***
## bs(displacement, df = 6)6 -11.6205     2.4565  -4.731 3.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.169 on 385 degrees of freedom
## Multiple R-squared:  0.719,  Adjusted R-squared:  0.7147 
## F-statistic: 164.2 on 6 and 385 DF,  p-value: < 2.2e-16

dim(bs(displacement,df=6))

## [1] 392   6

attr(bs(displacement,df=6),"knots")

##    25%    50%    75% 
## 105.00 151.00 275.75

summary(fit)

## 
## Call:
## lm(formula = mpg ~ bs(displacement, df = 6), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6466  -2.4539  -0.3409   2.1079  20.4956 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                24.5533     1.8142  13.534  < 2e-16 ***
## bs(displacement, df = 6)1  14.2102     2.5554   5.561 5.03e-08 ***
## bs(displacement, df = 6)2   0.5911     1.8515   0.319 0.749697    
## bs(displacement, df = 6)3   0.3101     2.4356   0.127 0.898770    
## bs(displacement, df = 6)4 -11.9498     2.5442  -4.697 3.68e-06 ***
## bs(displacement, df = 6)5  -9.5233     2.7208  -3.500 0.000519 ***
## bs(displacement, df = 6)6 -11.6205     2.4565  -4.731 3.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.169 on 385 degrees of freedom
## Multiple R-squared:  0.719,  Adjusted R-squared:  0.7147 
## F-statistic: 164.2 on 6 and 385 DF,  p-value: < 2.2e-16

displacementlims=range(displacement)
displacement.grid=seq(from=displacementlims[1],to=displacementlims[2])
preds=predict(fit,newdata=list(displacement=displacement.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)

pred=predict(fit,newdata=list(displacement=displacement.grid),se=T)
plot(displacement,mpg,col="gray")
lines(displacement.grid,pred$fit,lwd=2)
lines(displacement.grid,pred$fit+2*pred$se,lty="dashed")
lines(displacement.grid,pred$fit-2*pred$se,lty="dashed")

Here we generate a smoothing spline for our dataset.

library(ISLR)
attach(Auto)

## The following objects are masked from Auto (pos = 3):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year

## The following objects are masked from Auto (pos = 5):
## 
##     acceleration, cylinders, displacement, horsepower, mpg, name,
##     origin, weight, year

library(splines)
plot(displacement,mpg)
fit=smooth.spline(displacement,mpg,df=60)
fit2=smooth.spline(displacement,mpg,df=16)
fit3=smooth.spline(displacement,mpg,cv=TRUE)

## Warning in smooth.spline(displacement, mpg, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful

fit3$df

## [1] 20.0332

#Comparison to Polynomial Regression
fit4=lm(mpg~I(displacement)+I(displacement^2)+I(displacement^3),data=Auto)
lines(fit,col="red",lwd=2)
lines(fit2,col="green",lwd=2)
lines(fit3,col="blue",lwd=2)

#lines(fit4,col="yellow",lwd=2)
new=data.frame(displacement=65)
#predict(fit, new, se.fit=T, interval='prediction')
#predict(fit2, new, se.fit=T, interval='prediction')
#predict(fit3, new, se.fit=T, interval='prediction')
predict(fit, new) #df=60

## $x
##   displacement
## 1           65
## 
## $y
##   displacement
## 1     214.7417

predict(fit2, new) #df=16

## $x
##   displacement
## 1           65
## 
## $y
##   displacement
## 1     25.50258

predict(fit3, new) #df=6.8

## $x
##   displacement
## 1           65
## 
## $y
##   displacement
## 1     23.55645

predict(fit4, new) #polynomial df=4

##        1 
## 34.34687

We first import the dataset.

library(readr)
Speed <- read_csv("~/Downloads/Speed.csv")

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   FatalityRate = col_double(),
##   StateControl = col_double()
## )

attach(Speed)
show(Speed)

## # A tibble: 21 x 3
##     Year FatalityRate StateControl
##    <dbl>        <dbl>        <dbl>
##  1  1987         2.41            0
##  2  1988         2.32            0
##  3  1989         2.17            0
##  4  1990         2.08            0
##  5  1991         1.91            0
##  6  1992         1.75            0
##  7  1993         1.75            0
##  8  1994         1.73            0
##  9  1995         1.72            1
## 10  1996         1.69            1
## # … with 11 more rows

We take the summary to get the slope coefficient. The slope of the least squares regression line is -0.044870. This means that for every additional 1000 unit increase in year, the fatality rate decreases by 44.870 units. The residual vs fitted graph is V-shaped, which is unique

lmfit<-lm(FatalityRate~Year, data=Speed)
summary(lmfit)

## 
## Call:
## lm(formula = FatalityRate ~ Year, data = Speed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18959 -0.07550 -0.02576  0.09346  0.24606 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 91.320887   8.374227    10.9 1.28e-09 ***
## Year        -0.044870   0.004193   -10.7 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1164 on 19 degrees of freedom
## Multiple R-squared:  0.8577, Adjusted R-squared:  0.8502 
## F-statistic: 114.5 on 1 and 19 DF,  p-value: 1.75e-09

par(mfrow = c(2, 2))
plot(lmfit)

Here, a regression was run using an interaction between two variables. From the summaries below, we can see that there is a significant change in the relationship between fatality rate and year starting in 1995 since the p-value is less than 0.05

lmfit1<-lm(FatalityRate~Year+StateControl+Year:StateControl, data=subset(Speed))
summary(lmfit1)

## 
## Call:
## lm(formula = FatalityRate ~ Year + StateControl + Year:StateControl, 
##     data = subset(Speed))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.103571 -0.020769  0.004048  0.022473  0.091667 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.162e+02  1.303e+01   16.59 6.19e-12 ***
## Year              -1.076e-01  6.548e-03  -16.44 7.19e-12 ***
## StateControl      -1.614e+02  1.447e+01  -11.15 3.07e-09 ***
## Year:StateControl  8.097e-02  7.264e-03   11.15 3.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04243 on 17 degrees of freedom
## Multiple R-squared:  0.9831, Adjusted R-squared:  0.9801 
## F-statistic:   329 on 3 and 17 DF,  p-value: 2.998e-15

par(mfrow = c(2, 2))
plot(lmfit1)

We run regressions for datapoints before 1995 and after it.

Post1995<-lm(FatalityRate[Year>1995]~ Year[Year>1995] )
Pre1995<-lm(FatalityRate[Year<1995]~ Year[Year<1995] )

Here we plot the data for post 1995 datapoints.

par(mfrow = c(2, 2))
plot(Post1995)

summary(Post1995)

## 
## Call:
## lm(formula = FatalityRate[Year > 1995] ~ Year[Year > 1995])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.028287 -0.018986 -0.004371  0.019458  0.035769 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       52.182343   4.051988   12.88 1.50e-07 ***
## Year[Year > 1995] -0.025315   0.002024  -12.50 1.98e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02421 on 10 degrees of freedom
## Multiple R-squared:  0.9399, Adjusted R-squared:  0.9339 
## F-statistic: 156.4 on 1 and 10 DF,  p-value: 1.982e-07

par(mfrow = c(2, 2))
plot(Pre1995)

summary(Pre1995)

## 
## Call:
## lm(formula = FatalityRate[Year < 1995] ~ Year[Year < 1995])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.103571 -0.017619  0.007619  0.022738  0.091667 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       216.23071   19.24719   11.23 2.97e-05 ***
## Year[Year < 1995]  -0.10762    0.00967  -11.13 3.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06267 on 6 degrees of freedom
## Multiple R-squared:  0.9538, Adjusted R-squared:  0.9461 
## F-statistic: 123.9 on 1 and 6 DF,  p-value: 3.136e-05

The fitted equations relating fatality rate to year in both scenarios is the following: Before 1995: Fatality Rate = 216.23071-0.10762(Year)+error After 1995: Fatality Rate = 52.182343-0.025315(Year)+error