Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
# Summary statistics of the dataset
library(datasets)
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
pairs(airquality)
# Creating the linear regression model (all variables)
airq_lm <- lm(Ozone ~ Solar.R+Wind+Temp ,data = airquality)
summary(airq_lm)
##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
# Backward Elimination
# Removing Solar.R as the least significant variable
airq_lm2 <- lm(Ozone ~ Wind+Temp ,data = airquality)
summary(airq_lm2)
##
## Call:
## lm(formula = Ozone ~ Wind + Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.251 -13.695 -2.856 11.390 100.367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.0332 23.5780 -3.013 0.0032 **
## Wind -3.0555 0.6633 -4.607 1.08e-05 ***
## Temp 1.8402 0.2500 7.362 3.15e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.85 on 113 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.5611
## F-statistic: 74.5 on 2 and 113 DF, p-value: < 2.2e-16
# Combining two variables: Wind and Temp
airq_lm3 <- lm(Ozone ~ Wind*Temp ,data = airquality)
summary(airq_lm3)
##
## Call:
## lm(formula = Ozone ~ Wind * Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.906 -13.048 -2.263 8.726 99.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -248.51530 48.14038 -5.162 1.07e-06 ***
## Wind 14.33503 4.23874 3.382 0.000992 ***
## Temp 4.07575 0.58754 6.937 2.73e-10 ***
## Wind:Temp -0.22391 0.05399 -4.147 6.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.44 on 112 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.6261, Adjusted R-squared: 0.6161
## F-statistic: 62.52 on 3 and 112 DF, p-value: < 2.2e-16
# Removing NAs from the dataset
airquality2 <- na.omit(airquality)
airq_lm4 <- lm(Ozone ~ Wind*Temp ,data = airquality2)
summary(airq_lm4)
##
## Call:
## lm(formula = Ozone ~ Wind * Temp, data = airquality2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.930 -11.193 -3.034 8.193 97.456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -239.8918 48.6200 -4.934 2.97e-06 ***
## Wind 13.5975 4.2835 3.174 0.001961 **
## Temp 4.0005 0.5935 6.741 8.26e-10 ***
## Wind:Temp -0.2173 0.0545 -3.987 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.37 on 107 degrees of freedom
## Multiple R-squared: 0.6355, Adjusted R-squared: 0.6253
## F-statistic: 62.19 on 3 and 107 DF, p-value: < 2.2e-16
plot(airq_lm4$fitted.values, airq_lm4$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="red")
qqnorm(airq_lm4$residuals)
qqline(airq_lm4$residuals)
# Residuals plot shows a relatively constant variability with no clearly defined patterns
# Q-Q plot shows the residuals tightly following the theoretical straight line (except on the ends), which denotes a normal distribution