Using R, build a simple linear model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I found a data set containing average high and low temperatures for large US cities here (https://www.currentresults.com/Weather/US/average-annual-temperatures-large-cities.php). Below is the head of the dataframe after being loaded from a CSV copy of the data. There are 51 US cities in the dataset.
## # A tibble: 6 x 5
## High_F Low_F City High_C Low_C
## <int> <int> <chr> <int> <int>
## 1 72 53 Atlanta, Georgia 22 12
## 2 80 59 Austin, Texas 27 15
## 3 65 45 Baltimore, Maryland 18 7
## 4 74 53 Birmingham, Alabama 23 12
## 5 59 44 Boston, Massachusetts 15 7
## 6 56 40 Buffalo, New York 14 5
Is there a linear relationship between average high and low temperatures for large US cities? Let’s use the average low temperature to predict the average high temperature using a simple linear model.
model<-lm(Discussion10_CityTemperatures$High_F ~ Discussion10_CityTemperatures$Low_F)
Is this an appropriate model? I believe so.
First, we intuively believe that there is some sort of relationship between low and high temperatures in a city.
Second, a simple plot of the average low temperature vs. the average high temperature seems to indicate a linear relationship. As the low temperature goes up, the high temperature also goes up, and it seems that we could draw a straight line through the plots that does a fairly good job of explaining that relationship.
#plot low vs. high
plot(Discussion10_CityTemperatures$Low_F,Discussion10_CityTemperatures$High_F)
Third, when we actually create that simple model and look at its results, we see that it returns an \(r^2\) value of 0.84, which means roughly that the average low temperature explains about 84% of the variance in the average high temperature. This is a pretty high \(r^2\) value for just a single variable. In other words, if we only knew the low temperature value and we had this model, we could do a pretty good job of estimating the average high temperature value.
#summary of model
summary(model)
##
## Call:
## lm(formula = Discussion10_CityTemperatures$High_F ~ Discussion10_CityTemperatures$Low_F)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.052 -1.978 -0.187 1.910 8.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.9205 3.0035 6.632 2.46e-08
## Discussion10_CityTemperatures$Low_F 0.9850 0.0597 16.501 < 2e-16
##
## (Intercept) ***
## Discussion10_CityTemperatures$Low_F ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.322 on 49 degrees of freedom
## Multiple R-squared: 0.8475, Adjusted R-squared: 0.8444
## F-statistic: 272.3 on 1 and 49 DF, p-value: < 2.2e-16
Finally, a plot of the residuals shows that they are pretty random, meaning that there isn’t an underlying pattern in the data which the linear model is missing. Thus, a linear model is a good fit for the data.
#residuals
plot(Discussion10_CityTemperatures$High_F, model$residuals)
In short, the linear model is an appropriate model for this data. Of course, we could make it better by using other variables to predict the average high temperature. Perhaps latitude, the region of the city within the US (northwest, northeast, southwest, southeast), or other factors (plains, mountains) would be important. But this is a good start for predicting the average high temperature in terms of other variables.