Topic

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitaive interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Source

We’ll be looking at the air quality data that comes with R.

library(dplyr)
library(GGally)
library(car)

df <- airquality

# Separate Summer Months (June, July, August)
df$Summer <- ifelse(df$Month %in% c(6,7,8), 1, 0)

head(df)
##   Ozone Solar.R Wind Temp Month Day Summer
## 1    41     190  7.4   67     5   1      0
## 2    36     118  8.0   72     5   2      0
## 3    12     149 12.6   74     5   3      0
## 4    18     313 11.5   62     5   4      0
## 5    NA      NA 14.3   56     5   5      0
## 6    28      NA 14.9   66     5   6      0

It looks like there are missing values.

# exclude any rows that don't have ozone measurments
df <- df[!is.na(df$Ozone),] %>%
  select(-Month, -Day)

# check for nulls in other columns
sum(is.na(df$Solar.R))
## [1] 5
sum(is.na(df$Wind))
## [1] 0
sum(is.na(df$Temp))
## [1] 0
# fill missing solar values with average
avg_solar <- mean(df$Solar.R, na.rm = T)

df[is.na(df$Solar.R),]$Solar.R <- avg_solar

Lets take a quick look at how the variables

ggpairs(df)

It looks like both Wind and Temp both show curves when comparing to Ozone – using a quadratic forumula would lead to a more accurate model.


Let’s create a linear regression model and check the results.

lm <- lm(Ozone ~ Solar.R + Wind + Temp + Summer, data = df)
summary(lm)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp + Summer, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.061 -13.698  -3.007  11.241  95.376 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.86965   25.45140  -2.313  0.02256 *  
## Solar.R       0.06218    0.02341   2.656  0.00907 ** 
## Wind         -3.15869    0.65105  -4.852 4.02e-06 ***
## Temp          1.52076    0.30399   5.003 2.14e-06 ***
## Summer        4.25315    4.91478   0.865  0.38870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.34 on 111 degrees of freedom
## Multiple R-squared:  0.5959, Adjusted R-squared:  0.5814 
## F-statistic: 40.93 on 4 and 111 DF,  p-value: < 2.2e-16

It looks like the Summer season affects the air quality, so we can remove that variable. Let’s also use quadratic terms for Wind and Temp.

powerTransform(select(df, -Summer))
## Estimated transformation parameters 
##     Ozone   Solar.R      Wind      Temp 
## 0.2709045 0.9675028 0.3723312 2.4243091
df$Wind2 <- df$Wind ^ 0.3723312
df$Temp2 <- df$Temp ^ 2.4243091

lm2 <- lm(Ozone ~ Solar.R + Wind2 + Temp2, data = df)
summary(lm2)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind2 + Temp2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.163 -12.500  -2.906  11.264  90.560 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.555e+01  2.110e+01   3.107  0.00240 ** 
## Solar.R      6.045e-02  2.184e-02   2.767  0.00661 ** 
## Wind2       -3.853e+01  6.810e+00  -5.657 1.19e-07 ***
## Temp2        1.376e-03  2.041e-04   6.745 7.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.04 on 112 degrees of freedom
## Multiple R-squared:  0.6405, Adjusted R-squared:  0.6309 
## F-statistic: 66.52 on 3 and 112 DF,  p-value: < 2.2e-16

Finally, let’s check the residuals.

plot(fitted(lm2), resid(lm2))
abline(h=0)

qqnorm(resid(lm2))
qqline(resid(lm2))

The model seems okay, but the residuals and qq plot show that the variance widens at higher ozone values.