Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
For this assignment, I will use the built-in dataset
USArrests to build a multiple regression model. The dataset
contains statistics, in arrests per 100,000 residents for assault,
murder and rape in each of the US states. It also contains the
percentage of the population living in Urban Areas.
I will try to predict the amount of arrests for murder based on the Urban Population Percentage and the amount of arrests due to rape. We will also add a dichotomous variable and include that in our model.
# View data
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# Check for missing data
USArrests[!complete.cases(USArrests),]
## [1] Murder Assault UrbanPop Rape
## <0 rows> (or 0-length row.names)
str(USArrests)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
We have no missing values in our dataset and all our columns are numeric. This should make it easy to build linear models.
We will now add a dichotomous variable to our data to use in the model. We will add whether a State is in the South of the US or not. Whether a State is considered part of the Southern US was sourced from Wikipedia.
southern_states <- c(
"Alabama", "Arkansas", "Delaware", "Florida", "Georgia",
"Kentucky", "Louisiana", "Maryland", "Mississippi", "North Carolina",
"Oklahoma", "South Carolina", "Tennessee", "Texas", "Virginia",
"West Virginia"
)
USArrests$South <- ifelse(row.names(USArrests) %in% southern_states, 1, 0)
head(USArrests)
## Murder Assault UrbanPop Rape South
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 0
## Arizona 8.1 294 80 31.0 0
## Arkansas 8.8 190 50 19.5 1
## California 9.0 276 91 40.6 0
## Colorado 7.9 204 78 38.7 0
model <- lm(Murder ~ UrbanPop + I(UrbanPop^2) + Rape + South + South:Rape, data = USArrests)
summary(model)
##
## Call:
## lm(formula = Murder ~ UrbanPop + I(UrbanPop^2) + Rape + South +
## South:Rape, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4998 -1.7083 -0.3685 1.5834 5.8105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2743126 6.5504100 0.653 0.517
## UrbanPop -0.1191110 0.2081713 -0.572 0.570
## I(UrbanPop^2) 0.0009314 0.0016046 0.580 0.565
## Rape 0.2476449 0.0420637 5.887 4.94e-07 ***
## South 2.5811772 2.7819081 0.928 0.359
## Rape:South 0.1542135 0.1244193 1.239 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.429 on 44 degrees of freedom
## Multiple R-squared: 0.7207, Adjusted R-squared: 0.689
## F-statistic: 22.71 on 5 and 44 DF, p-value: 3.436e-11
The model is pretty good. It has an adjusted r-squared of 0.69, meaning 69% of the variance is explained by these factors. It has a very low p-value as well, indicating a statistically significant model. The coefficients of both UrbanPop and and the quadratic termof UrbanPop are very low and do not contribute much to the target variable. Rape, South and the interaction of the two all have relatively large coefficients. However, Rape on its own is the only variable with a statistically very significant p-value.
Interestingly, only the Rape variable has a statistically significant coefficient but recreating the model using only the Rape variable loses most of the accuracy of the model and lowers the r-squared value to 0.30.
test <- lm(Murder ~ Rape, data = USArrests)
summary(test)
##
## Call:
## lm(formula = Murder ~ Rape, data = USArrests)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0900 -2.4025 -0.5731 1.7095 9.3949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.22367 1.28456 1.731 0.0899 .
## Rape 0.26207 0.05544 4.727 2.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.635 on 48 degrees of freedom
## Multiple R-squared: 0.3176, Adjusted R-squared: 0.3034
## F-statistic: 22.34 on 1 and 48 DF, p-value: 2.031e-05
We now evaluate the model’s assumptions by examining residuals.
par(mfrow=c(2,2))
hist(model$residuals)
plot(model$fitted.values, model$residuals)
abline(h=0)
qqnorm(model$residuals)
qqline(model$residuals)
plot(model, which = 5)
The histogram and the Q-Q plot of the residuals show that the residuals are pretty normally distributed, albeit with a bit of a skew. The residuals vs fitted values plot shows that the residuals are homoscedastic, and the residuals are independent. The assumptions of linearity, independence, homoscedasticity, and normality are met. We can conclude that the transformed linear regression model is a much better fit for the data. However, the transformation of the variables makes it difficult to interpret the coefficients of the model and how to make predictions on new data. Since the UrbanPop coefficients don’t seem to be contributing much, recreating the model without those variables might be a good idea.