Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

For this assignment, I will use the built-in dataset USArrests to build a multiple regression model. The dataset contains statistics, in arrests per 100,000 residents for assault, murder and rape in each of the US states. It also contains the percentage of the population living in Urban Areas.

I will try to predict the amount of arrests for murder based on the Urban Population Percentage and the amount of arrests due to rape. We will also add a dichotomous variable and include that in our model.

# View data
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
# Check for missing data
USArrests[!complete.cases(USArrests),]
## [1] Murder   Assault  UrbanPop Rape    
## <0 rows> (or 0-length row.names)
str(USArrests)
## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

We have no missing values in our dataset and all our columns are numeric. This should make it easy to build linear models.

We will now add a dichotomous variable to our data to use in the model. We will add whether a State is in the South of the US or not. Whether a State is considered part of the Southern US was sourced from Wikipedia.

southern_states <- c(
  "Alabama", "Arkansas", "Delaware", "Florida", "Georgia", 
  "Kentucky", "Louisiana", "Maryland", "Mississippi", "North Carolina", 
  "Oklahoma", "South Carolina", "Tennessee", "Texas", "Virginia", 
  "West Virginia"
)
USArrests$South <- ifelse(row.names(USArrests) %in% southern_states, 1, 0)
head(USArrests)
##            Murder Assault UrbanPop Rape South
## Alabama      13.2     236       58 21.2     1
## Alaska       10.0     263       48 44.5     0
## Arizona       8.1     294       80 31.0     0
## Arkansas      8.8     190       50 19.5     1
## California    9.0     276       91 40.6     0
## Colorado      7.9     204       78 38.7     0

Build model

model <- lm(Murder ~ UrbanPop + I(UrbanPop^2) + Rape + South + South:Rape, data = USArrests)
summary(model)
## 
## Call:
## lm(formula = Murder ~ UrbanPop + I(UrbanPop^2) + Rape + South + 
##     South:Rape, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4998 -1.7083 -0.3685  1.5834  5.8105 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.2743126  6.5504100   0.653    0.517    
## UrbanPop      -0.1191110  0.2081713  -0.572    0.570    
## I(UrbanPop^2)  0.0009314  0.0016046   0.580    0.565    
## Rape           0.2476449  0.0420637   5.887 4.94e-07 ***
## South          2.5811772  2.7819081   0.928    0.359    
## Rape:South     0.1542135  0.1244193   1.239    0.222    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.429 on 44 degrees of freedom
## Multiple R-squared:  0.7207, Adjusted R-squared:  0.689 
## F-statistic: 22.71 on 5 and 44 DF,  p-value: 3.436e-11

The model is pretty good. It has an adjusted r-squared of 0.69, meaning 69% of the variance is explained by these factors. It has a very low p-value as well, indicating a statistically significant model. The coefficients of both UrbanPop and and the quadratic termof UrbanPop are very low and do not contribute much to the target variable. Rape, South and the interaction of the two all have relatively large coefficients. However, Rape on its own is the only variable with a statistically very significant p-value.

Interestingly, only the Rape variable has a statistically significant coefficient but recreating the model using only the Rape variable loses most of the accuracy of the model and lowers the r-squared value to 0.30.

test <- lm(Murder ~ Rape, data = USArrests)
summary(test)
## 
## Call:
## lm(formula = Murder ~ Rape, data = USArrests)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0900 -2.4025 -0.5731  1.7095  9.3949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.22367    1.28456   1.731   0.0899 .  
## Rape         0.26207    0.05544   4.727 2.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.635 on 48 degrees of freedom
## Multiple R-squared:  0.3176, Adjusted R-squared:  0.3034 
## F-statistic: 22.34 on 1 and 48 DF,  p-value: 2.031e-05

Residual analysis

We now evaluate the model’s assumptions by examining residuals.

par(mfrow=c(2,2))
hist(model$residuals)
plot(model$fitted.values, model$residuals)
abline(h=0)
qqnorm(model$residuals)
qqline(model$residuals)
plot(model, which = 5)

The histogram and the Q-Q plot of the residuals show that the residuals are pretty normally distributed, albeit with a bit of a skew. The residuals vs fitted values plot shows that the residuals are homoscedastic, and the residuals are independent. The assumptions of linearity, independence, homoscedasticity, and normality are met. We can conclude that the transformed linear regression model is a much better fit for the data. However, the transformation of the variables makes it difficult to interpret the coefficients of the model and how to make predictions on new data. Since the UrbanPop coefficients don’t seem to be contributing much, recreating the model without those variables might be a good idea.