Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Air Quality Dataset - New York Air Quality Measurements
Data Visualization
# Summary statistics of the dataset
library(datasets)
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
pairs(airquality)

Modeling & Evaluation
# Creating the linear regression model (all variables)
airq_lm <- lm(Ozone ~ Solar.R+Wind+Temp ,data = airquality)

summary(airq_lm)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
# Backward Elimination 

# Removing Solar.R as the least significant variable

airq_lm2 <- lm(Ozone ~ Wind+Temp ,data = airquality)

summary(airq_lm2)
## 
## Call:
## lm(formula = Ozone ~ Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.251 -13.695  -2.856  11.390 100.367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -71.0332    23.5780  -3.013   0.0032 ** 
## Wind         -3.0555     0.6633  -4.607 1.08e-05 ***
## Temp          1.8402     0.2500   7.362 3.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.85 on 113 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5611 
## F-statistic:  74.5 on 2 and 113 DF,  p-value: < 2.2e-16
# Combining two variables: Wind and Temp

airq_lm3 <- lm(Ozone ~ Wind*Temp ,data = airquality)

summary(airq_lm3)
## 
## Call:
## lm(formula = Ozone ~ Wind * Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.906 -13.048  -2.263   8.726  99.306 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -248.51530   48.14038  -5.162 1.07e-06 ***
## Wind          14.33503    4.23874   3.382 0.000992 ***
## Temp           4.07575    0.58754   6.937 2.73e-10 ***
## Wind:Temp     -0.22391    0.05399  -4.147 6.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.44 on 112 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.6261, Adjusted R-squared:  0.6161 
## F-statistic: 62.52 on 3 and 112 DF,  p-value: < 2.2e-16
# Removing NAs from the dataset

airquality2 <- na.omit(airquality)
airq_lm4 <- lm(Ozone ~ Wind*Temp ,data = airquality2)

summary(airq_lm4)
## 
## Call:
## lm(formula = Ozone ~ Wind * Temp, data = airquality2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.930 -11.193  -3.034   8.193  97.456 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -239.8918    48.6200  -4.934 2.97e-06 ***
## Wind          13.5975     4.2835   3.174 0.001961 ** 
## Temp           4.0005     0.5935   6.741 8.26e-10 ***
## Wind:Temp     -0.2173     0.0545  -3.987 0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.37 on 107 degrees of freedom
## Multiple R-squared:  0.6355, Adjusted R-squared:  0.6253 
## F-statistic: 62.19 on 3 and 107 DF,  p-value: < 2.2e-16
Performance of model2 went down by eliminating one of the variables: R_squared went from 0.6059 to 0.5687
Performance of model3 went up by combining Wind and Temp variables: R_squared went from 0.5687 to 0.6261
Performance of model4 went up by combining Wind and Temp variables and cleaning up observations with NAs: R_squared went from 0.6261 to 0.6355
Residuals Analysis
plot(airq_lm4$fitted.values, airq_lm4$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="red")

qqnorm(airq_lm4$residuals)
qqline(airq_lm4$residuals)

# Residuals plot shows a relatively constant variability with no clearly defined patterns
# Q-Q plot shows the residuals tightly following the theoretical straight line (except on the ends), which denotes a normal distribution
Conclusion
The multi-factor linear regression model looks relatively good and appears to describe about 63.5% of the data variability. Wind and Temp are directly correlated to the concentration of Ozone levels in the air and by cleaning up the data set, model performance improves