Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

df_uk <- data.frame(datasets::Seatbelts)

summary(df_uk)
##  DriversKilled      drivers         front             rear      
##  Min.   : 60.0   Min.   :1057   Min.   : 426.0   Min.   :224.0  
##  1st Qu.:104.8   1st Qu.:1462   1st Qu.: 715.5   1st Qu.:344.8  
##  Median :118.5   Median :1631   Median : 828.5   Median :401.5  
##  Mean   :122.8   Mean   :1670   Mean   : 837.2   Mean   :401.2  
##  3rd Qu.:138.0   3rd Qu.:1851   3rd Qu.: 950.8   3rd Qu.:456.2  
##  Max.   :198.0   Max.   :2654   Max.   :1299.0   Max.   :646.0  
##       kms         PetrolPrice        VanKilled           law        
##  Min.   : 7685   Min.   :0.08118   Min.   : 2.000   Min.   :0.0000  
##  1st Qu.:12685   1st Qu.:0.09258   1st Qu.: 6.000   1st Qu.:0.0000  
##  Median :14987   Median :0.10448   Median : 8.000   Median :0.0000  
##  Mean   :14994   Mean   :0.10362   Mean   : 9.057   Mean   :0.1198  
##  3rd Qu.:17202   3rd Qu.:0.11406   3rd Qu.:12.000   3rd Qu.:0.0000  
##  Max.   :21626   Max.   :0.13303   Max.   :17.000   Max.   :1.0000
glimpse(df_uk)
## Rows: 192
## Columns: 8
## $ DriversKilled <dbl> 107, 97, 102, 87, 119, 106, 110, 106, 107, 134, 147, 180…
## $ drivers       <dbl> 1687, 1508, 1507, 1385, 1632, 1511, 1559, 1630, 1579, 16…
## $ front         <dbl> 867, 825, 806, 814, 991, 945, 1004, 1091, 958, 850, 1109…
## $ rear          <dbl> 269, 265, 319, 407, 454, 427, 522, 536, 405, 437, 434, 4…
## $ kms           <dbl> 9059, 7685, 9963, 10955, 11823, 12391, 13460, 14055, 121…
## $ PetrolPrice   <dbl> 0.10297181, 0.10236300, 0.10206249, 0.10087330, 0.101019…
## $ VanKilled     <dbl> 12, 6, 12, 8, 10, 13, 11, 6, 10, 16, 13, 14, 14, 6, 8, 1…
## $ law           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Including Plots

You can also embed plots, for example:

plot(df_uk$PetrolPrice, df_uk$DriversKilled)

model <- lm(DriversKilled ~ PetrolPrice + I(PetrolPrice^2) + law + PetrolPrice*law, data=df_uk)
summary(model)
## 
## Call:
## lm(formula = DriversKilled ~ PetrolPrice + I(PetrolPrice^2) + 
##     law + PetrolPrice * law, data = df_uk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.264 -16.141  -4.538  13.755  61.445 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        382.08     122.01   3.131  0.00202 **
## PetrolPrice      -4423.33    2398.92  -1.844  0.06678 . 
## I(PetrolPrice^2) 18480.39   11679.10   1.582  0.11526   
## law                 63.24     292.79   0.216  0.82922   
## PetrolPrice:law   -692.24    2514.91  -0.275  0.78342   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.98 on 187 degrees of freedom
## Multiple R-squared:  0.1974, Adjusted R-squared:  0.1802 
## F-statistic:  11.5 on 4 and 187 DF,  p-value: 2.29e-08

PetrolPrice is close to being significant at a 95% confidence level and is once the quadratic term is removed (shown below). The dichotomous variable “law” indicating when a law requiring seat belts be worn was in effect is surprisingly not significant, but again once the quadratic and interaction terms are removed it becomes significant (shown below). The interaction between PetrolPrice and law is also not significant.

model2 <- lm(DriversKilled ~ PetrolPrice + law, data=df_uk)
summary(model2)
## 
## Call:
## lm(formula = DriversKilled ~ PetrolPrice + law, data = df_uk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.805 -17.280  -5.101  14.178  62.703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  190.591     15.236  12.509  < 2e-16 ***
## PetrolPrice -635.306    148.549  -4.277 3.01e-05 ***
## law          -16.326      5.556  -2.939  0.00371 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.01 on 189 degrees of freedom
## Multiple R-squared:  0.1866, Adjusted R-squared:  0.178 
## F-statistic: 21.68 on 2 and 189 DF,  p-value: 3.329e-09
plot(model2)

Nothing too alarming is showing in the residual plots. Residuals vs Fitted shows there does not seem to be heteroskedasticity. The Q-Q plot shows the data is approximately normal. Residuals vs Leverage is showing no high leverage outliers.

Conclusion: Linear model was appropriate based on the constant negative correlation seen in the scatterplot and the p-values showing significance to support it.