Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitaive interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
We’ll be looking at the air quality data that comes with R.
library(dplyr)
library(GGally)
library(car)
df <- airquality
# Separate Summer Months (June, July, August)
df$Summer <- ifelse(df$Month %in% c(6,7,8), 1, 0)
head(df)
## Ozone Solar.R Wind Temp Month Day Summer
## 1 41 190 7.4 67 5 1 0
## 2 36 118 8.0 72 5 2 0
## 3 12 149 12.6 74 5 3 0
## 4 18 313 11.5 62 5 4 0
## 5 NA NA 14.3 56 5 5 0
## 6 28 NA 14.9 66 5 6 0
It looks like there are missing values.
# exclude any rows that don't have ozone measurments
df <- df[!is.na(df$Ozone),] %>%
select(-Month, -Day)
# check for nulls in other columns
sum(is.na(df$Solar.R))
## [1] 5
sum(is.na(df$Wind))
## [1] 0
sum(is.na(df$Temp))
## [1] 0
# fill missing solar values with average
avg_solar <- mean(df$Solar.R, na.rm = T)
df[is.na(df$Solar.R),]$Solar.R <- avg_solar
Lets take a quick look at how the variables
ggpairs(df)
It looks like both Wind and Temp both show curves when comparing to Ozone – using a quadratic forumula would lead to a more accurate model.
Let’s create a linear regression model and check the results.
lm <- lm(Ozone ~ Solar.R + Wind + Temp + Summer, data = df)
summary(lm)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Summer, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.061 -13.698 -3.007 11.241 95.376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.86965 25.45140 -2.313 0.02256 *
## Solar.R 0.06218 0.02341 2.656 0.00907 **
## Wind -3.15869 0.65105 -4.852 4.02e-06 ***
## Temp 1.52076 0.30399 5.003 2.14e-06 ***
## Summer 4.25315 4.91478 0.865 0.38870
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.34 on 111 degrees of freedom
## Multiple R-squared: 0.5959, Adjusted R-squared: 0.5814
## F-statistic: 40.93 on 4 and 111 DF, p-value: < 2.2e-16
It looks like the Summer season affects the air quality, so we can remove that variable. Let’s also use quadratic terms for Wind and Temp.
powerTransform(select(df, -Summer))
## Estimated transformation parameters
## Ozone Solar.R Wind Temp
## 0.2709045 0.9675028 0.3723312 2.4243091
df$Wind2 <- df$Wind ^ 0.3723312
df$Temp2 <- df$Temp ^ 2.4243091
lm2 <- lm(Ozone ~ Solar.R + Wind2 + Temp2, data = df)
summary(lm2)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind2 + Temp2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.163 -12.500 -2.906 11.264 90.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.555e+01 2.110e+01 3.107 0.00240 **
## Solar.R 6.045e-02 2.184e-02 2.767 0.00661 **
## Wind2 -3.853e+01 6.810e+00 -5.657 1.19e-07 ***
## Temp2 1.376e-03 2.041e-04 6.745 7.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.04 on 112 degrees of freedom
## Multiple R-squared: 0.6405, Adjusted R-squared: 0.6309
## F-statistic: 66.52 on 3 and 112 DF, p-value: < 2.2e-16
Finally, let’s check the residuals.
plot(fitted(lm2), resid(lm2))
abline(h=0)
qqnorm(resid(lm2))
qqline(resid(lm2))
The model seems okay, but the residuals and qq plot show that the variance widens at higher ozone values.