Multiple Regression

We can use the Seatbelts dataset built into R. First, we can set it up as a dataframe in R

# Load the dataset into a dataframe
df <- as.data.frame(Seatbelts)

We can use the law variable as our dichotomous value, as this takes a value of 0 or 1, depending on if the seatbelt law was in effect that month. We want to predict the value of DriversKilled, which will be our dependent variable. We can use the front variable as our quadratice term, and the kms (distance driven) as our quantitative term

# set up the number of front-seat passengers killed as our quantitative term
df$front2 <- df$front ** 2

Now, we can create our linear model

mlm <- lm(DriversKilled ~ front2 + kms + law, df)

summary(mlm)
## 
## Call:
## lm(formula = DriversKilled ~ front2 + kms + law, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.748 -12.231  -1.382  13.969  41.662 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.260e+01  8.944e+00  10.353   <2e-16 ***
## front2       6.050e-05  5.058e-06  11.963   <2e-16 ***
## kms         -9.865e-04  5.095e-04  -1.936   0.0544 .  
## law          6.193e+00  4.969e+00   1.246   0.2142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.86 on 188 degrees of freedom
## Multiple R-squared:  0.5126, Adjusted R-squared:  0.5048 
## F-statistic: 65.91 on 3 and 188 DF,  p-value: < 2.2e-16

The coefficients represent the weights for each term in our model. For each input variable, an increase of one “unit” will increase our dependent variable by \(9.260x10^1\). The \(R^2\) value tells us the amount of variability in our dependent variable that can be accounted for by our independent variables.

Our F-statistic compares our model to a model with no independent variables. Since our p-value for this F-statistic is less than the typical significance level (\(\alpha=0.05\)), we can accept the alternative hypothesis that our model outperforms a model with no independent variables

Testing our Regression Assumptions

# Plotting our variables to see if there's a general linear relationship
df$x <- df$front2 + df$law + df$front

plot(df$x, df$DriversKilled)

The above plot looks to show a general linear relationship between our dependent variable (DriversKilled) and our independent variables

We can use the plot.lm function to plot the needed graphs for testing our regression assumptions.

plot(mlm)

Our QQ-plot is relatively linear from an eye test, and there does not appear to be any pattern in our plot of residuals vs fitted values as well.

Normality of residuals

hist(mlm$residuals)

This plot looks linear, but it’d be nice to quantify this a bit more. From last week’s discussion board, we can use a Schapiro-Wilk test to

shapiro.test(mlm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mlm$residuals
## W = 0.99086, p-value = 0.2639

Our p-value is higher than our significance level of 0.05, so we can not reject the null hypothesis that our residuals are not normally distributed

Homoscedasticity

From last week’s discussion, we can test the assumption of homoscedasticity (equal variance of residuals for each value of our predictor variables) with a Breusch-Pagan test. We can use the bptest in R from the lmtest library to test this assumption at a significance level of \(\alpha=0.05\)

bptest(mlm)
## 
##  studentized Breusch-Pagan test
## 
## data:  mlm
## BP = 21.776, df = 3, p-value = 7.261e-05

Since our p-value is less than 0.05, we can assume homoscedasticity.