Discussion Prompt

Using R, build a simple linear model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Discussion Solution

I found a data set containing average high and low temperatures for large US cities here (https://www.currentresults.com/Weather/US/average-annual-temperatures-large-cities.php). Below is the head of the dataframe after being loaded from a CSV copy of the data. There are 51 US cities in the dataset.

## # A tibble: 6 x 5
##   High_F Low_F                  City High_C Low_C
##    <int> <int>                 <chr>  <int> <int>
## 1     72    53      Atlanta, Georgia     22    12
## 2     80    59         Austin, Texas     27    15
## 3     65    45   Baltimore, Maryland     18     7
## 4     74    53   Birmingham, Alabama     23    12
## 5     59    44 Boston, Massachusetts     15     7
## 6     56    40     Buffalo, New York     14     5

Is there a linear relationship between average high and low temperatures for large US cities? Let’s use the average low temperature to predict the average high temperature using a simple linear model.

model<-lm(Discussion10_CityTemperatures$High_F ~ Discussion10_CityTemperatures$Low_F)

Is this an appropriate model? I believe so.

First, we intuively believe that there is some sort of relationship between low and high temperatures in a city.

Second, a simple plot of the average low temperature vs. the average high temperature seems to indicate a linear relationship. As the low temperature goes up, the high temperature also goes up, and it seems that we could draw a straight line through the plots that does a fairly good job of explaining that relationship.

#plot low vs. high
plot(Discussion10_CityTemperatures$Low_F,Discussion10_CityTemperatures$High_F)

Third, when we actually create that simple model and look at its results, we see that it returns an \(r^2\) value of 0.84, which means roughly that the average low temperature explains about 84% of the variance in the average high temperature. This is a pretty high \(r^2\) value for just a single variable. In other words, if we only knew the low temperature value and we had this model, we could do a pretty good job of estimating the average high temperature value.

#summary of model
summary(model)
## 
## Call:
## lm(formula = Discussion10_CityTemperatures$High_F ~ Discussion10_CityTemperatures$Low_F)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.052 -1.978 -0.187  1.910  8.873 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          19.9205     3.0035   6.632 2.46e-08
## Discussion10_CityTemperatures$Low_F   0.9850     0.0597  16.501  < 2e-16
##                                        
## (Intercept)                         ***
## Discussion10_CityTemperatures$Low_F ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.322 on 49 degrees of freedom
## Multiple R-squared:  0.8475, Adjusted R-squared:  0.8444 
## F-statistic: 272.3 on 1 and 49 DF,  p-value: < 2.2e-16

Finally, a plot of the residuals shows that they are pretty random, meaning that there isn’t an underlying pattern in the data which the linear model is missing. Thus, a linear model is a good fit for the data.

#residuals
plot(Discussion10_CityTemperatures$High_F, model$residuals)

In short, the linear model is an appropriate model for this data. Of course, we could make it better by using other variables to predict the average high temperature. Perhaps latitude, the region of the city within the US (northwest, northeast, southwest, southeast), or other factors (plains, mountains) would be important. But this is a good start for predicting the average high temperature in terms of other variables.