With this analysis, I was able to see that there is mainly only a montly/seasonal effect when it comes to predicting average monthly temperatures in Chicago. Even with the use of an explanatory variable of Average Wind Speed, the linear regression produces insignificant results with the use of that variable, only having significant monthly factors for linear regression. Through the SARIMA analysis, a model following an \(ARIMA(1,0,0)*(2,1,0)_{12}\) was the strongest in terms of AIC, BIC, and overall summary plots. And a prediction of the next 5 months using that model, yields an overall satisfying results. This helps answer the question, from 2000 to 2021, there has been no overall differences in mean average temperatures year over year in the Chicago land Area, and in fact we are able to capture most of the effects by just following a 12 month seasonal lag pattern.
Over my lifetime, It has felt like temperatures in the Chicago land area have definitely changed to be more unexpected. With this research project, I wanted to answer the question, have any significant changes occurred over time in Chicago, or is the average temperature merely seasonal?
This data set is a time series of monthly mean average temperature and average wind speed from the CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US Weather station, from 01-01-2000 to 11-01-2021. This data comes from the National Centers for Environmental Data.
In summary, I will be conducting a regression analysis and a SARIMA model analysis on this data set in order to see if we can capture all of the effects of Monthly average temperatures in Chicago through both of these methods.
STATION | NAME | DATE | AWND | TAVG |
---|---|---|---|---|
USW00094892 | CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US | 2000-01-01 | 10.7 | 23.0 |
USW00094892 | CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US | 2000-02-01 | 11.4 | 31.7 |
USW00094892 | CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US | 2000-03-01 | 9.4 | 43.0 |
USW00094892 | CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US | 2000-04-01 | 10.1 | 47.0 |
USW00094892 | CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US | 2000-05-01 | 9.6 | 61.2 |
From past knowledge, the mean temperature in Chicago should be fairly seasonal. The temperatures every 12 months should be fairly similar.
From the plot of the Time series, we can see that the data is fairly seasonal, there are clear waves of low average monthly temperatures, and fairly high average monthly temperatures every 12 months. With this we can also see that the data does not seem to be increasing or decreasing over the time period, and that the variance does not seem to be changing over time either. The main takeaway is the yearly pattern of temperatures.
I wanted to take a look at the Pairs plot in order to see the relationship between Date, Average Wind Speed and Average temperature in order to help find a strong linear regression. It can be seen fairly clearly that Average Wind Speed has a decently strong negative relationship to Average Temperature. As wind speed increases, average temperature decreases.
This ACF plot of the Average temperature continues to show us the 12 month seasonal trend, which I will now compare to the 12 month seasonal differenced ACF.
Here we see that most of the Autocorrelation has decreased with a differencing of lag = 12, accounting for the seasonality.
In the case of this data set I will compare three different models##
## Call:
## lm(formula = TAVG ~ 0 + time(DATE) + AWND + M, data = monthly_chicago_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5068 -2.3401 0.0082 2.2002 14.4513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time(DATE) 0.004322 0.003242 1.333 0.184
## AWND -0.253093 0.300914 -0.841 0.401
## M1 25.410340 3.290648 7.722 2.82e-13 ***
## M2 27.366428 3.308598 8.271 8.15e-15 ***
## M3 39.901753 3.297674 12.100 < 2e-16 ***
## M4 51.117019 3.429999 14.903 < 2e-16 ***
## M5 61.260331 2.999019 20.427 < 2e-16 ***
## M6 70.958107 2.670714 26.569 < 2e-16 ***
## M7 74.610877 2.388385 31.239 < 2e-16 ***
## M8 73.002399 2.221986 32.855 < 2e-16 ***
## M9 66.320053 2.459838 26.961 < 2e-16 ***
## M10 53.959707 2.843130 18.979 < 2e-16 ***
## M11 41.427928 3.146922 13.165 < 2e-16 ***
## M12 29.733990 3.180628 9.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.914 on 248 degrees of freedom
## Multiple R-squared: 0.9948, Adjusted R-squared: 0.9945
## F-statistic: 3401 on 14 and 248 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TAVG ~ 0 + time(DATE) + M, data = monthly_chicago_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3384 -2.3568 0.1321 2.2856 14.2653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time(DATE) 0.004753 0.003199 1.486 0.139
## M1 22.754745 0.926491 24.560 <2e-16 ***
## M2 24.695446 0.927889 26.615 <2e-16 ***
## M3 37.240693 0.929296 40.074 <2e-16 ***
## M4 48.340485 0.930712 51.939 <2e-16 ***
## M5 58.863004 0.932137 63.148 <2e-16 ***
## M6 68.853705 0.933570 73.753 <2e-16 ***
## M7 72.762588 0.935013 77.820 <2e-16 ***
## M8 71.307835 0.936464 76.146 <2e-16 ***
## M9 64.407627 0.937923 68.670 <2e-16 ***
## M10 51.702873 0.939392 55.039 <2e-16 ***
## M11 38.908669 0.964558 40.338 <2e-16 ***
## M12 27.181380 0.951086 28.579 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.912 on 249 degrees of freedom
## Multiple R-squared: 0.9948, Adjusted R-squared: 0.9945
## F-statistic: 3666 on 13 and 249 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TAVG ~ 0 + M, data = monthly_chicago_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9048 -2.2455 0.0205 2.4774 14.3500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## M1 23.3545 0.8359 27.94 <2e-16 ***
## M2 25.3000 0.8359 30.27 <2e-16 ***
## M3 37.8500 0.8359 45.28 <2e-16 ***
## M4 48.9545 0.8359 58.56 <2e-16 ***
## M5 59.4818 0.8359 71.16 <2e-16 ***
## M6 69.4773 0.8359 83.11 <2e-16 ***
## M7 73.3909 0.8359 87.79 <2e-16 ***
## M8 71.9409 0.8359 86.06 <2e-16 ***
## M9 65.0455 0.8359 77.81 <2e-16 ***
## M10 52.3455 0.8359 62.62 <2e-16 ***
## M11 39.5762 0.8556 46.25 <2e-16 ***
## M12 27.8048 0.8556 32.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.921 on 250 degrees of freedom
## Multiple R-squared: 0.9948, Adjusted R-squared: 0.9945
## F-statistic: 3953 on 12 and 250 DF, p-value: < 2.2e-16
##
## Ljung-Box test
##
## data: Residuals
## Q* = 26.449, df = 3, p-value = 7.681e-06
##
## Model df: 14. Total lags used: 17
##
## Ljung-Box test
##
## data: Residuals
## Q* = 26.062, df = 3, p-value = 9.258e-06
##
## Model df: 13. Total lags used: 16
##
## Ljung-Box test
##
## data: Residuals
## Q* = 28.262, df = 3, p-value = 3.2e-06
##
## Model df: 12. Total lags used: 15
Model | Model 1 | Model 2 | Model 3 |
AIC | 3.78860663808559 | 3.7838214627611 | 3.78501613331949 |
BIC | 3.99290117074367 | 3.97449635990864 | 3.96207139495649 |
This concludes the end of the methods of model comparison for Part A. The regression model I have chosen is Model 3. Model 3 has the lowest BIC, although it has a higher AIC than model 2. It also has the highest F statistic. In this model all of the factors are significant, meaning we do not have any redundant variables. With all of these in mind, model 3 which is the model that accounts for monthly factors alone. This is a better predictor than the model that uses Average Wind speed and time as well as a monthly factor for regression. The regressions also look to be the most normally distributed for this model as well. One of the limitations I see though is that all three models have a pattern occurring in the ACF. As lag approaches 24, there seems to be a significant peak.
From this analysis we can see that the time series looks the most stationary when we have a single lag differencing, and a seasonal lag differencing of 12 months. From this we can also look at which AR and MA effects will best explain our model. Since the non seasonal effect looks to tail off for the PACF and have one main significant peak at in the ACF, we should have a model with p = 0, d = 1, and q = 1 for non seasonal effects. And when looking at seasonality, we have two significant PACF seasonal effects, and a tailing ACF. So we should use a P=2, D=1, and Q = 0 model for our seasonal effects. I will compare this model to a model crafted from just having a difference at lag 12. Which yields a p = 1, d = 0, q = 0, P = 2, D=1, Q = 0, S = 12 model.
## initial value 1.971483
## iter 2 value 1.703163
## iter 3 value 1.678613
## iter 4 value 1.660890
## iter 5 value 1.660214
## iter 6 value 1.660062
## iter 7 value 1.660059
## iter 7 value 1.660059
## iter 7 value 1.660059
## final value 1.660059
## converged
## initial value 1.677324
## iter 2 value 1.677099
## iter 3 value 1.676970
## iter 4 value 1.676965
## iter 5 value 1.676964
## iter 5 value 1.676964
## iter 5 value 1.676964
## final value 1.676964
## converged
## $fit
##
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S),
## include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ma1 sar1 sar2
## -0.7732 -0.4504 -0.3404
## s.e. 0.1018 0.0622 0.0635
##
## sigma^2 estimated as 28.01: log likelihood = -770.88, aic = 1549.76
##
## $degrees_of_freedom
## [1] 246
##
## $ttable
## Estimate SE t.value p.value
## ma1 -0.7732 0.1018 -7.5917 0
## sar1 -0.4504 0.0622 -7.2406 0
## sar2 -0.3404 0.0635 -5.3600 0
##
## $AIC
## [1] 5.960614
##
## $AICc
## [1] 5.960974
##
## $BIC
## [1] 6.014728
## $pred
## Time Series:
## Start = 263
## End = 267
## Frequency = 1
## [1] 32.72198 26.17481 22.12114 41.19856 51.62507
##
## $se
## Time Series:
## Start = 263
## End = 267
## Frequency = 1
## [1] 5.292172 5.426581 5.557740 5.685875 5.811184
## initial value 1.753013
## iter 2 value 1.617216
## iter 3 value 1.596653
## iter 4 value 1.594603
## iter 5 value 1.594597
## iter 6 value 1.594595
## iter 7 value 1.594595
## iter 7 value 1.594595
## iter 7 value 1.594595
## final value 1.594595
## converged
## initial value 1.609779
## iter 2 value 1.609763
## iter 3 value 1.609759
## iter 4 value 1.609759
## iter 4 value 1.609759
## iter 4 value 1.609759
## final value 1.609759
## converged
## $fit
##
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S),
## xreg = constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.3241 -0.4784 -0.3671 0.0071
## s.e. 0.0607 0.0622 0.0614 0.0216
##
## sigma^2 estimated as 24.51: log likelihood = -757.17, aic = 1524.35
##
## $degrees_of_freedom
## [1] 246
##
## $ttable
## Estimate SE t.value p.value
## ar1 0.3241 0.0607 5.3392 0.0000
## sar1 -0.4784 0.0622 -7.6924 0.0000
## sar2 -0.3671 0.0614 -5.9806 0.0000
## constant 0.0071 0.0216 0.3306 0.7412
##
## $AIC
## [1] 5.840416
##
## $AICc
## [1] 5.841015
##
## $BIC
## [1] 5.907877
## $pred
## Time Series:
## Start = 263
## End = 267
## Frequency = 1
## [1] 30.77098 24.24195 20.69853 39.35133 49.95956
##
## $se
## Time Series:
## Start = 263
## End = 267
## Frequency = 1
## [1] 4.950325 5.203849 5.229769 5.232484 5.232769
When comparing both models, the \(ARIMA(0,1,1)*(2,1,0)_{12}\) and the \(ARIMA(1,0,0)*(2,1,0)_{12}\), we can see that the more attractive model is the second. From the summary tables, we can see that the coefficients of both models are signficant. Model 2 has a lower AIC, a lower BIC. It also has a more attractive AIC, the residuals of both models look fairly similar. Model 2 also has more non significant p-values of the Ljung-Box statistic. Therefore model 2 (\(ARIMA(1,0,0)*(2,1,0)_{12}\)) is the more preferred model from this Analysis.
#Results
From my analysis, the best regression model for this data set is Model 3 (incorporating only monthly factors as predictors):
\(y_i = \alpha_1M_1(t) + ... + \alpha_{12}M_{12}(t) + w_t\)
The implication of this result shows us that the best way to predict average temperatures in Chicago is to mainly only look at monthly effects overall instead of having average wind speed or time effects. I am sure that there are other variables that are not a part of this data set that may explain this data well. But in the confines of this project, this is what we can glean from this. This may not be an extremely practical result in reality.
From the SARIMA analysis, the best model for explaining the trend of average monthly temperature in Chicago from 2000 to 2021 is this model \(ARIMA(1,0,0)*(2,1,0)_{12}\). This result was a little bit unexpected. I initially thought that a differencing of 1 in the model would be a better explanation of the data, but a seasonal lag and both seasonal and non AR factors are much better at capturing the overall trends in this data. The forecasted predictions seemed to not vary extremely, but use of this model was better overall compared to the other model presented in the Statistical methods of this project.
Overall monthly factors captured a majority of the trend and seasonality in the Chicago monthly average temperature and average wind speed from 2000 to 2021 data set, this was seen in both the regression analysis and SARIMA model analysis. I think that a limitations of this study could be a small scope of data. Although there are more than 100 months between 2000 and 2021, I think if the time period was longer, we could maybe see a longer trend of increasing average temperatures over time to help me answer my question in a longer term. Another limitation could be that average wind speed could be extremely unrelated to temperature, in fact there are windy days that are warm, and non windy days that are cold in Chicago, so that could have affected the regression analysis. Overall I think this project was very informative in terms of how to craft a regression and a SARIMA model for a data set that is not just a homework assignment.