We can use the Seatbelts dataset built into R. First, we
can set it up as a dataframe in R
# Load the dataset into a dataframe
df <- as.data.frame(Seatbelts)
We can use the law variable as our dichotomous value, as
this takes a value of 0 or 1, depending on if the seatbelt law was in
effect that month. We want to predict the value of
DriversKilled, which will be our dependent
variable. We can use the front variable as our
quadratice term, and the kms (distance driven) as our
quantitative term
# set up the number of front-seat passengers killed as our quantitative term
df$front2 <- df$front ** 2
Now, we can create our linear model
mlm <- lm(DriversKilled ~ front2 + kms + law, df)
summary(mlm)
##
## Call:
## lm(formula = DriversKilled ~ front2 + kms + law, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.748 -12.231 -1.382 13.969 41.662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.260e+01 8.944e+00 10.353 <2e-16 ***
## front2 6.050e-05 5.058e-06 11.963 <2e-16 ***
## kms -9.865e-04 5.095e-04 -1.936 0.0544 .
## law 6.193e+00 4.969e+00 1.246 0.2142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.86 on 188 degrees of freedom
## Multiple R-squared: 0.5126, Adjusted R-squared: 0.5048
## F-statistic: 65.91 on 3 and 188 DF, p-value: < 2.2e-16
The coefficients represent the weights for each term in our model. For each input variable, an increase of one “unit” will increase our dependent variable by \(9.260x10^1\). The \(R^2\) value tells us the amount of variability in our dependent variable that can be accounted for by our independent variables.
Our F-statistic compares our model to a model with no independent variables. Since our p-value for this F-statistic is less than the typical significance level (\(\alpha=0.05\)), we can accept the alternative hypothesis that our model outperforms a model with no independent variables
# Plotting our variables to see if there's a general linear relationship
df$x <- df$front2 + df$law + df$front
plot(df$x, df$DriversKilled)
The above plot looks to show a general linear relationship between our
dependent variable (
DriversKilled) and our independent
variables
We can use the plot.lm function to plot the needed
graphs for testing our regression assumptions.
plot(mlm)
Our QQ-plot is relatively linear from an eye test, and there does not appear to be any pattern in our plot of residuals vs fitted values as well.
hist(mlm$residuals)
This plot looks linear, but it’d be nice to quantify this a bit more. From last week’s discussion board, we can use a Schapiro-Wilk test to
shapiro.test(mlm$residuals)
##
## Shapiro-Wilk normality test
##
## data: mlm$residuals
## W = 0.99086, p-value = 0.2639
Our p-value is higher than our significance level of 0.05, so we can not reject the null hypothesis that our residuals are not normally distributed
From last week’s discussion, we can test the assumption of
homoscedasticity (equal variance of residuals for each value of our
predictor variables) with a Breusch-Pagan
test. We can use the bptest in R from the
lmtest library to test this assumption at a significance
level of \(\alpha=0.05\)
bptest(mlm)
##
## studentized Breusch-Pagan test
##
## data: mlm
## BP = 21.776, df = 3, p-value = 7.261e-05
Since our p-value is less than 0.05, we can assume homoscedasticity.