Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Continuing on my work from discussion 11 on predicting average city high temperatures from average city low temperatures, I had mentioned that other factors were likely important. Examples include latitude and the region of the city within the US (northwest, northeast, southwest, southeast). I added latitude and CoastalState (vs. Not Coastal) to the data set and will use it for this discussion.
The data set containing average high and low temperatures for large US cities is here (https://www.currentresults.com/Weather/US/average-annual-temperatures-large-cities.php). The data for latitude came from here (https://en.wikipedia.org/wiki/List_of_cities_by_latitude). I used my knowledge to identify coastal and non-coastal states. Below is the head of the dataframe after being loaded from a CSV copy of the data. There are 51 US cities in the dataset.
## High..F Low..F City High..C Low..C LatitudeShort
## 1 72 53 Atlanta, Georgia 22 12 33
## 2 80 59 Austin, Texas 27 15 30
## 3 65 45 Baltimore, Maryland 18 7 39
## 4 74 53 Birmingham, Alabama 23 12 33
## 5 59 44 Boston, Massachusetts 15 7 42
## 6 56 40 Buffalo, New York 14 5 42
## CoastalState
## 1 1
## 2 0
## 3 1
## 4 1
## 5 1
## 6 1
Let’s create a multiple regression model to predict the average high temperature according to the specifications of the prompt.
#create needed variables
CityTemperatures$Low..F2<-(CityTemperatures$Low..F)^2
#create model
model<-lm(CityTemperatures$High..F ~ CityTemperatures$Low..F2 + CityTemperatures$CoastalState + CityTemperatures$CoastalState:CityTemperatures$LatitudeShort)
What do the coefficients mean? Roughly, that we start with an estimate of 46 degrees F. Then for every Low degree Farenheit squared, we add 0.009 degrees F. If the city is in a coastal state, we add 4.1 degrees F. Finally, if the city is a coastal state, we subtract 0.13 degrees for every degree increase in latitude (which makes sense since latitude increases as one goes north).
#summary of model
summary(model)
##
## Call:
## lm(formula = CityTemperatures$High..F ~ CityTemperatures$Low..F2 +
## CityTemperatures$CoastalState + CityTemperatures$CoastalState:CityTemperatures$LatitudeShort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5423 -2.3619 0.1608 2.0266 9.4803
##
## Coefficients:
## Estimate
## (Intercept) 46.2150999
## CityTemperatures$Low..F2 0.0090770
## CityTemperatures$CoastalState 4.1086285
## CityTemperatures$CoastalState:CityTemperatures$LatitudeShort -0.1303418
## Std. Error
## (Intercept) 2.2609697
## CityTemperatures$Low..F2 0.0009545
## CityTemperatures$CoastalState 6.5759538
## CityTemperatures$CoastalState:CityTemperatures$LatitudeShort 0.1723673
## t value
## (Intercept) 20.440
## CityTemperatures$Low..F2 9.510
## CityTemperatures$CoastalState 0.625
## CityTemperatures$CoastalState:CityTemperatures$LatitudeShort -0.756
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## CityTemperatures$Low..F2 1.58e-12 ***
## CityTemperatures$CoastalState 0.535
## CityTemperatures$CoastalState:CityTemperatures$LatitudeShort 0.453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.468 on 47 degrees of freedom
## Multiple R-squared: 0.8406, Adjusted R-squared: 0.8304
## F-statistic: 82.61 on 3 and 47 DF, p-value: < 2.2e-16
Let’s do residual analysis. The residuals are pretty well distributed throughout the range of fitted values and are also pretty well distributed on both sides of the X axis. There is no obvious pattern that we missed, and while the tails are a bit off on a QQplot, they aren’t too bad, so the residuals appear to be roughly normally distributed. So the linear model assumptions have roughly been met.
#1-3
plot(model$fitted.values, model$residuals)
abline(0,0)
#4
qqnorm(model$residuals)
qqline(model$residuals)
Was the linear model appropriate? Yes, I’d say so. The R^2 is high at 0.83 and it meets the assumptions. However, only the low temperature variable is really doing any predictive work, which we can see from the p-value. It is the only p-value that is significant. So it would be wise to try other variable combinations to see if there is a better combination here.
Let’s try to adjust the model a little bit to see if we can get a better combination. Through a process of trial and error, we arrive at the below model, which uses Latitude and the interaction of the Low temperature with Latitude. The adjusted R^2 = 0.88, and the other statistics look good.
#create model
model_adj<-lm(CityTemperatures$High..F ~ CityTemperatures$LatitudeShort + CityTemperatures$Low..F:CityTemperatures$LatitudeShort)
#summary
summary(model_adj)
##
## Call:
## lm(formula = CityTemperatures$High..F ~ CityTemperatures$LatitudeShort +
## CityTemperatures$Low..F:CityTemperatures$LatitudeShort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3864 -1.4059 -0.0928 1.1334 7.1210
##
## Coefficients:
## Estimate
## (Intercept) 93.336677
## CityTemperatures$LatitudeShort -1.549873
## CityTemperatures$LatitudeShort:CityTemperatures$Low..F 0.018128
## Std. Error t value
## (Intercept) 6.362809 14.669
## CityTemperatures$LatitudeShort 0.081776 -18.953
## CityTemperatures$LatitudeShort:CityTemperatures$Low..F 0.003205 5.656
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## CityTemperatures$LatitudeShort < 2e-16 ***
## CityTemperatures$LatitudeShort:CityTemperatures$Low..F 8.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.872 on 48 degrees of freedom
## Multiple R-squared: 0.8883, Adjusted R-squared: 0.8837
## F-statistic: 190.9 on 2 and 48 DF, p-value: < 2.2e-16
However, the residuals don’t appear to be as normally distributed as before, as we can see from looking at the QQplot. So this may not in fact be an improvement, even though the R^2 is a bit higher.
#1-3
plot(model_adj$fitted.values, model_adj$residuals)
abline(0,0)
#4
qqnorm(model_adj$residuals)
qqline(model_adj$residuals)