Abstract

With this analysis, I was able to see that there is mainly only a montly/seasonal effect when it comes to predicting average monthly temperatures in Chicago. Even with the use of an explanatory variable of Average Wind Speed, the linear regression produces insignificant results with the use of that variable, only having significant monthly factors for linear regression. Through the SARIMA analysis, a model following an \(ARIMA(1,0,0)*(2,1,0)_{12}\) was the strongest in terms of AIC, BIC, and overall summary plots. And a prediction of the next 5 months using that model, yields an overall satisfying results. This helps answer the question, from 2000 to 2021, there has been no overall differences in mean average temperatures year over year in the Chicago land Area, and in fact we are able to capture most of the effects by just following a 12 month seasonal lag pattern.

Introduction

Motivation

Over my lifetime, It has felt like temperatures in the Chicago land area have definitely changed to be more unexpected. With this research project, I wanted to answer the question, have any significant changes occurred over time in Chicago, or is the average temperature merely seasonal?

Data Content & Acknowledgements

This data set is a time series of monthly mean average temperature and average wind speed from the CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US Weather station, from 01-01-2000 to 11-01-2021. This data comes from the National Centers for Environmental Data.

To be addressed

In summary, I will be conducting a regression analysis and a SARIMA model analysis on this data set in order to see if we can capture all of the effects of Monthly average temperatures in Chicago through both of these methods.

Preliminary Look at the data

STATION NAME DATE AWND TAVG
USW00094892 CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US 2000-01-01 10.7 23.0
USW00094892 CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US 2000-02-01 11.4 31.7
USW00094892 CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US 2000-03-01 9.4 43.0
USW00094892 CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US 2000-04-01 10.1 47.0
USW00094892 CHICAGO WEST CHICAGO DUPAGE AIRPORT, IL US 2000-05-01 9.6 61.2

Statistical Methods

Part A. Regression analysis:

Preliminary analysis

From past knowledge, the mean temperature in Chicago should be fairly seasonal. The temperatures every 12 months should be fairly similar.

Plot of Time series

From the plot of the Time series, we can see that the data is fairly seasonal, there are clear waves of low average monthly temperatures, and fairly high average monthly temperatures every 12 months. With this we can also see that the data does not seem to be increasing or decreasing over the time period, and that the variance does not seem to be changing over time either. The main takeaway is the yearly pattern of temperatures.

Pairs Plot

I wanted to take a look at the Pairs plot in order to see the relationship between Date, Average Wind Speed and Average temperature in order to help find a strong linear regression. It can be seen fairly clearly that Average Wind Speed has a decently strong negative relationship to Average Temperature. As wind speed increases, average temperature decreases.

ACF Plot

This ACF plot of the Average temperature continues to show us the 12 month seasonal trend, which I will now compare to the 12 month seasonal differenced ACF.

Here we see that most of the Autocorrelation has decreased with a differencing of lag = 12, accounting for the seasonality.

In the case of this data set I will compare three different models

Model comparison and analysis

## 
## Call:
## lm(formula = TAVG ~ 0 + time(DATE) + AWND + M, data = monthly_chicago_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5068  -2.3401   0.0082   2.2002  14.4513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## time(DATE)  0.004322   0.003242   1.333    0.184    
## AWND       -0.253093   0.300914  -0.841    0.401    
## M1         25.410340   3.290648   7.722 2.82e-13 ***
## M2         27.366428   3.308598   8.271 8.15e-15 ***
## M3         39.901753   3.297674  12.100  < 2e-16 ***
## M4         51.117019   3.429999  14.903  < 2e-16 ***
## M5         61.260331   2.999019  20.427  < 2e-16 ***
## M6         70.958107   2.670714  26.569  < 2e-16 ***
## M7         74.610877   2.388385  31.239  < 2e-16 ***
## M8         73.002399   2.221986  32.855  < 2e-16 ***
## M9         66.320053   2.459838  26.961  < 2e-16 ***
## M10        53.959707   2.843130  18.979  < 2e-16 ***
## M11        41.427928   3.146922  13.165  < 2e-16 ***
## M12        29.733990   3.180628   9.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.914 on 248 degrees of freedom
## Multiple R-squared:  0.9948, Adjusted R-squared:  0.9945 
## F-statistic:  3401 on 14 and 248 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TAVG ~ 0 + time(DATE) + M, data = monthly_chicago_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3384  -2.3568   0.1321   2.2856  14.2653 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## time(DATE)  0.004753   0.003199   1.486    0.139    
## M1         22.754745   0.926491  24.560   <2e-16 ***
## M2         24.695446   0.927889  26.615   <2e-16 ***
## M3         37.240693   0.929296  40.074   <2e-16 ***
## M4         48.340485   0.930712  51.939   <2e-16 ***
## M5         58.863004   0.932137  63.148   <2e-16 ***
## M6         68.853705   0.933570  73.753   <2e-16 ***
## M7         72.762588   0.935013  77.820   <2e-16 ***
## M8         71.307835   0.936464  76.146   <2e-16 ***
## M9         64.407627   0.937923  68.670   <2e-16 ***
## M10        51.702873   0.939392  55.039   <2e-16 ***
## M11        38.908669   0.964558  40.338   <2e-16 ***
## M12        27.181380   0.951086  28.579   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.912 on 249 degrees of freedom
## Multiple R-squared:  0.9948, Adjusted R-squared:  0.9945 
## F-statistic:  3666 on 13 and 249 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TAVG ~ 0 + M, data = monthly_chicago_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9048  -2.2455   0.0205   2.4774  14.3500 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## M1   23.3545     0.8359   27.94   <2e-16 ***
## M2   25.3000     0.8359   30.27   <2e-16 ***
## M3   37.8500     0.8359   45.28   <2e-16 ***
## M4   48.9545     0.8359   58.56   <2e-16 ***
## M5   59.4818     0.8359   71.16   <2e-16 ***
## M6   69.4773     0.8359   83.11   <2e-16 ***
## M7   73.3909     0.8359   87.79   <2e-16 ***
## M8   71.9409     0.8359   86.06   <2e-16 ***
## M9   65.0455     0.8359   77.81   <2e-16 ***
## M10  52.3455     0.8359   62.62   <2e-16 ***
## M11  39.5762     0.8556   46.25   <2e-16 ***
## M12  27.8048     0.8556   32.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.921 on 250 degrees of freedom
## Multiple R-squared:  0.9948, Adjusted R-squared:  0.9945 
## F-statistic:  3953 on 12 and 250 DF,  p-value: < 2.2e-16

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 26.449, df = 3, p-value = 7.681e-06
## 
## Model df: 14.   Total lags used: 17

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 26.062, df = 3, p-value = 9.258e-06
## 
## Model df: 13.   Total lags used: 16

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 28.262, df = 3, p-value = 3.2e-06
## 
## Model df: 12.   Total lags used: 15
Model Model 1 Model 2 Model 3
AIC 3.78860663808559 3.7838214627611 3.78501613331949
BIC 3.99290117074367 3.97449635990864 3.96207139495649

This concludes the end of the methods of model comparison for Part A. The regression model I have chosen is Model 3. Model 3 has the lowest BIC, although it has a higher AIC than model 2. It also has the highest F statistic. In this model all of the factors are significant, meaning we do not have any redundant variables. With all of these in mind, model 3 which is the model that accounts for monthly factors alone. This is a better predictor than the model that uses Average Wind speed and time as well as a monthly factor for regression. The regressions also look to be the most normally distributed for this model as well. One of the limitations I see though is that all three models have a pattern occurring in the ACF. As lag approaches 24, there seems to be a significant peak.

Part B. SARIMA Model

From this analysis we can see that the time series looks the most stationary when we have a single lag differencing, and a seasonal lag differencing of 12 months. From this we can also look at which AR and MA effects will best explain our model. Since the non seasonal effect looks to tail off for the PACF and have one main significant peak at in the ACF, we should have a model with p = 0, d = 1, and q = 1 for non seasonal effects. And when looking at seasonality, we have two significant PACF seasonal effects, and a tailing ACF. So we should use a P=2, D=1, and Q = 0 model for our seasonal effects. I will compare this model to a model crafted from just having a difference at lag 12. Which yields a p = 1, d = 0, q = 0, P = 2, D=1, Q = 0, S = 12 model.

## initial  value 1.971483 
## iter   2 value 1.703163
## iter   3 value 1.678613
## iter   4 value 1.660890
## iter   5 value 1.660214
## iter   6 value 1.660062
## iter   7 value 1.660059
## iter   7 value 1.660059
## iter   7 value 1.660059
## final  value 1.660059 
## converged
## initial  value 1.677324 
## iter   2 value 1.677099
## iter   3 value 1.676970
## iter   4 value 1.676965
## iter   5 value 1.676964
## iter   5 value 1.676964
## iter   5 value 1.676964
## final  value 1.676964 
## converged

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1     sar1     sar2
##       -0.7732  -0.4504  -0.3404
## s.e.   0.1018   0.0622   0.0635
## 
## sigma^2 estimated as 28.01:  log likelihood = -770.88,  aic = 1549.76
## 
## $degrees_of_freedom
## [1] 246
## 
## $ttable
##      Estimate     SE t.value p.value
## ma1   -0.7732 0.1018 -7.5917       0
## sar1  -0.4504 0.0622 -7.2406       0
## sar2  -0.3404 0.0635 -5.3600       0
## 
## $AIC
## [1] 5.960614
## 
## $AICc
## [1] 5.960974
## 
## $BIC
## [1] 6.014728

## $pred
## Time Series:
## Start = 263 
## End = 267 
## Frequency = 1 
## [1] 32.72198 26.17481 22.12114 41.19856 51.62507
## 
## $se
## Time Series:
## Start = 263 
## End = 267 
## Frequency = 1 
## [1] 5.292172 5.426581 5.557740 5.685875 5.811184
## initial  value 1.753013 
## iter   2 value 1.617216
## iter   3 value 1.596653
## iter   4 value 1.594603
## iter   5 value 1.594597
## iter   6 value 1.594595
## iter   7 value 1.594595
## iter   7 value 1.594595
## iter   7 value 1.594595
## final  value 1.594595 
## converged
## initial  value 1.609779 
## iter   2 value 1.609763
## iter   3 value 1.609759
## iter   4 value 1.609759
## iter   4 value 1.609759
## iter   4 value 1.609759
## final  value 1.609759 
## converged

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     xreg = constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1     sar1     sar2  constant
##       0.3241  -0.4784  -0.3671    0.0071
## s.e.  0.0607   0.0622   0.0614    0.0216
## 
## sigma^2 estimated as 24.51:  log likelihood = -757.17,  aic = 1524.35
## 
## $degrees_of_freedom
## [1] 246
## 
## $ttable
##          Estimate     SE t.value p.value
## ar1        0.3241 0.0607  5.3392  0.0000
## sar1      -0.4784 0.0622 -7.6924  0.0000
## sar2      -0.3671 0.0614 -5.9806  0.0000
## constant   0.0071 0.0216  0.3306  0.7412
## 
## $AIC
## [1] 5.840416
## 
## $AICc
## [1] 5.841015
## 
## $BIC
## [1] 5.907877

## $pred
## Time Series:
## Start = 263 
## End = 267 
## Frequency = 1 
## [1] 30.77098 24.24195 20.69853 39.35133 49.95956
## 
## $se
## Time Series:
## Start = 263 
## End = 267 
## Frequency = 1 
## [1] 4.950325 5.203849 5.229769 5.232484 5.232769

When comparing both models, the \(ARIMA(0,1,1)*(2,1,0)_{12}\) and the \(ARIMA(1,0,0)*(2,1,0)_{12}\), we can see that the more attractive model is the second. From the summary tables, we can see that the coefficients of both models are signficant. Model 2 has a lower AIC, a lower BIC. It also has a more attractive AIC, the residuals of both models look fairly similar. Model 2 also has more non significant p-values of the Ljung-Box statistic. Therefore model 2 (\(ARIMA(1,0,0)*(2,1,0)_{12}\)) is the more preferred model from this Analysis.

#Results

Part A.

From my analysis, the best regression model for this data set is Model 3 (incorporating only monthly factors as predictors):

\(y_i = \alpha_1M_1(t) + ... + \alpha_{12}M_{12}(t) + w_t\)

The implication of this result shows us that the best way to predict average temperatures in Chicago is to mainly only look at monthly effects overall instead of having average wind speed or time effects. I am sure that there are other variables that are not a part of this data set that may explain this data well. But in the confines of this project, this is what we can glean from this. This may not be an extremely practical result in reality.

Part B.

From the SARIMA analysis, the best model for explaining the trend of average monthly temperature in Chicago from 2000 to 2021 is this model \(ARIMA(1,0,0)*(2,1,0)_{12}\). This result was a little bit unexpected. I initially thought that a differencing of 1 in the model would be a better explanation of the data, but a seasonal lag and both seasonal and non AR factors are much better at capturing the overall trends in this data. The forecasted predictions seemed to not vary extremely, but use of this model was better overall compared to the other model presented in the Statistical methods of this project.

Discussion

Overall monthly factors captured a majority of the trend and seasonality in the Chicago monthly average temperature and average wind speed from 2000 to 2021 data set, this was seen in both the regression analysis and SARIMA model analysis. I think that a limitations of this study could be a small scope of data. Although there are more than 100 months between 2000 and 2021, I think if the time period was longer, we could maybe see a longer trend of increasing average temperatures over time to help me answer my question in a longer term. Another limitation could be that average wind speed could be extremely unrelated to temperature, in fact there are windy days that are warm, and non windy days that are cold in Chicago, so that could have affected the regression analysis. Overall I think this project was very informative in terms of how to craft a regression and a SARIMA model for a data set that is not just a homework assignment.