Airquality Dataset

I picked one of the R’s built-in datasets. I picked airquality - a data set containing daily air quality measurements in New York recorded from May to September 1973. The model compares Ozone (mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island) with Temp (maximum daily temperature in degrees Fahrenheit at La Guardia Airport).

plot(airquality$Temp, airquality$Ozone, 
     xlab='Max Daily Temperature at La Guardia Airport (F)',
     ylab='Ozone Concentration at Roosevelt Island (ppb)',
     main='Temperature vs Ozone (May 1 - Sept 30, 1973)')

After a simple plot, it looks like there is some correlation.

# Simple linear regression model
model <- lm(airquality$Ozone ~ airquality$Temp)
summary(model)
## 
## Call:
## lm(formula = airquality$Ozone ~ airquality$Temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.729 -17.409  -0.587  11.306 118.271 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -146.9955    18.2872  -8.038 9.37e-13 ***
## airquality$Temp    2.4287     0.2331  10.418  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.71 on 114 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.4877, Adjusted R-squared:  0.4832 
## F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16
# Ozone/Temp plot with regression line
plot(airquality$Ozone ~ airquality$Temp, 
     xlab='Max Daily Temperature at La Guardia Airport (F)',
     ylab='Mean Ozone at Roosevelt Island (ppb)',
     main='Temperature vs Ozone (May 1 - Sept 30, 1973)')
abline(model)

# Residuals
plot(model$residuals, ylab='Residuals')
abline(a=0, b=0)

# Q-Q plot
qqnorm(model$residuals)
qqline(model$residuals)

Conclusion

With R-squared value at about 0.48, the model describes almost 50% of variability. Looking at residuals plot, variability is constant with no pattern. Q-Q line looks good with the exception of both tails. Temperature is statistically significant variable to predict ozone concentration at all levels. Overall I think the model fits nicely. There are a few outliers and looking at the main plot, it appears that non-linear model may fit the data better.