Data 605 Discussion week 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Data:

New York Air Quality Measurements

Ozone numeric Ozone (ppb) Solar.R numeric Solar R (lang) Wind numeric Wind (mph) Temp numeric Temperature (degrees F) Month numeric Month (1–12) Day numeric Day of month (1–31)

#forest <- read.csv("forestfires.csv")
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
pairs(airquality)

Multiple Regression Model

To predict numeric Ozone (ppb)

one quadratic term: Temp ^2

# create quadratic term
temp_squared <- (airquality$Temp)^2

# create model
airqualitylm <- lm(airquality$Ozone ~ temp_squared + airquality$Solar.R + airquality$Wind, data = airquality)
summary(airqualitylm)
## 
## Call:
## lm(formula = airquality$Ozone ~ temp_squared + airquality$Solar.R + 
##     airquality$Wind, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.830 -13.799  -3.252  10.089  96.996 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -5.718665  14.186985  -0.403   0.6877    
## temp_squared        0.011208   0.001621   6.916 3.53e-10 ***
## airquality$Solar.R  0.059356   0.022733   2.611   0.0103 *  
## airquality$Wind    -3.218499   0.644853  -4.991 2.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.81 on 107 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6195, Adjusted R-squared:  0.6089 
## F-statistic: 58.08 on 3 and 107 DF,  p-value: < 2.2e-16

The model is as follows: Ozone = \(Ozone = -5.718665 + 0.011208*Temp^{2} + 0.059356*Solar - 3.218499*Wind\)

It is a a statistically significant predictor of evaluation score with p-value less than 0.05. For Multiple R-squared, the model is around 61% fits the data.

Residual Analysis

plot(airqualitylm$fitted.values, airqualitylm$residuals)
abline(0,0)

# qqplot
qqnorm(airqualitylm$residuals)
qqline(airqualitylm$residuals)

Q-Q plot are not uniformly scattered and have deviation at lower and quantiles. The residuals does not show randomly.