Question 1a: Exploratory Data Analysis

Plot the Time Series and ACF plots. Comment on the main features, and identify what (if any) assumptions of stationarity are violated. Hint: Before plotting, can you infer anything from the nature of the data?

Inference

Given temperature data I would assume 1) seasonality and 2) small trend higher due to the assumption of global warming. I think the seasonal effects are stronger than the trend effect. Thus, best to fit a seasonal model.

Main Features

It appears that the monthly temperature is slowly rising over time. Additionally, the lows in temperature seemed to occur more often in the 1950’s. (Higher lows and higher highs as times passes) The highs seem to occur more frequently now than in the 1950’s. There does not appear to be any sharp changes in behavior or any outlying observations.

par(mfrow=c(1,1))
plot(data,type="l")

acf(data,lag=12*4,main="")

What assumptions of stationarity are violated?

Stationary Assumptions

Staionary Assumptions = The time series statistical properties: mean, variance and autocorrelation do not change over time.

The ACH chart above shows that autocorrelation does vary over time. There clearly is a seasonal pattern. The correlations are beyond the blue lines, meaning they are significantly different from zero.

The plots below show temperature by year, which highlight the slight trend higher. The box plot show temperature by month. This box plot highlights the seasonality of the data.

#This will aggregate the cycles and display a year on year trend
plot(aggregate(data,FUN=mean),xlab="Year")

Box plot across months will give us a sense of seasonal effects.


```r
 boxplot(data~cycle(data),xlab="Month",ylab="Temp")

The decompose plots below show a trend with an increasing mean over time.

plot(decompose(data))

While it is safe to say the series is not stationary, we can use the Augmented Dickey-Fuller test to check is the data set is stationary. The p-value of .09 is higher than the significance level(.05) means we accept the null hypothesis that the time series is non stationary.

tseries::adf.test(data,k=48)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data
## Dickey-Fuller = -3.178, Lag order = 48, p-value = 0.09172
## alternative hypothesis: stationary

Question 1b: Trend Estimation

Fit the following trend estimation models:

Moving average Parametric quadratic polynomial Local Polynomial Splines

The following uses parametric regression (green) to determine the trend. This method shows an increasing trend as well. The predictors are not statistically signifcant.

## parametric regression

x1=time.pts
x2=time.pts^2

lm.fit=lm(temp~x1+x2)
summary(lm.fit)
## 
## Call:
## lm(formula = temp ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9226  -4.4207  -0.2526   4.6385  13.4337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.7226     0.5489 108.809   <2e-16 ***
## x1            3.1639     2.5384   1.246    0.213    
## x2            1.4030     2.4606   0.570    0.569    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.277 on 825 degrees of freedom
## Multiple R-squared:  0.05925,    Adjusted R-squared:  0.05697 
## F-statistic: 25.98 on 2 and 825 DF,  p-value: 1.143e-11
## Is there a trend? 
temp.fit.lm= ts(fitted(lm.fit),start=1950,frequency=12)
ts.plot(temp,ylab="Temperature")
lines(temp.fit.lm,lwd=2,col="green")
abline(temp.fit.lm[1],0,lwd=2,col="blue")

The trend fitted using local polynomial regression (brown) and splines (yellow) are higher as well.

## Local Polynomial Trend Estimation
loc.fit= loess(temp~time.pts)
temp.fit.loc= ts(fitted(loc.fit),start=1950,frequency=12)
## Splines Trend Estimation
library(mgcv)
gam.fit= gam(temp~s(time.pts))
temp.fit.gam= ts(fitted(gam.fit),start=1950,frequency=12)
## Is there a trend? 
ts.plot(temp,ylab="Temperature")
lines(temp.fit.loc,lwd=2,col="yellow")
lines(temp.fit.gam,lwd=2,col="red")
abline(temp.fit.loc[1],0,lwd=2,col="blue")

Seasonality using Seasonal Means Model (Avg temp) without January and with January

## Estimate seasonality using Seasonal Means Model (Avg temp)
## Drop January/with intercept --avg tem per month 
model1<-dynlm(temp~season(temp))
summary(model1)
## 
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
## 
## Call:
## dynlm(formula = temp ~ season(temp))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7145 -1.5855 -0.0319  1.6000  8.6768 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      55.5145     0.2864 193.867  < 2e-16 ***
## season(temp)Feb   0.7420     0.4050   1.832 0.067266 .  
## season(temp)Mar   1.5333     0.4050   3.786 0.000164 ***
## season(temp)Apr   3.6188     0.4050   8.936  < 2e-16 ***
## season(temp)May   6.0797     0.4050  15.013  < 2e-16 ***
## season(temp)Jun   9.1014     0.4050  22.475  < 2e-16 ***
## season(temp)Jul  12.4420     0.4050  30.724  < 2e-16 ***
## season(temp)Aug  13.4681     0.4050  33.258  < 2e-16 ***
## season(temp)Sep  12.7855     0.4050  31.572  < 2e-16 ***
## season(temp)Oct   9.8087     0.4050  24.221  < 2e-16 ***
## season(temp)Nov   4.9609     0.4050  12.250  < 2e-16 ***
## season(temp)Dec   0.5188     0.4050   1.281 0.200487    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.379 on 816 degrees of freedom
## Multiple R-squared:  0.811,  Adjusted R-squared:  0.8084 
## F-statistic: 318.3 on 11 and 816 DF,  p-value: < 2.2e-16
## Seasonal mean effects/without intercept --avg tem per month
model2<-dynlm(temp~ season(temp)- 1)
summary(model2)
## 
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
## 
## Call:
## dynlm(formula = temp ~ season(temp) - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7145 -1.5855 -0.0319  1.6000  8.6768 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## season(temp)Jan  55.5145     0.2864   193.9   <2e-16 ***
## season(temp)Feb  56.2565     0.2864   196.5   <2e-16 ***
## season(temp)Mar  57.0478     0.2864   199.2   <2e-16 ***
## season(temp)Apr  59.1333     0.2864   206.5   <2e-16 ***
## season(temp)May  61.5942     0.2864   215.1   <2e-16 ***
## season(temp)Jun  64.6159     0.2864   225.7   <2e-16 ***
## season(temp)Jul  67.9565     0.2864   237.3   <2e-16 ***
## season(temp)Aug  68.9826     0.2864   240.9   <2e-16 ***
## season(temp)Sep  68.3000     0.2864   238.5   <2e-16 ***
## season(temp)Oct  65.3232     0.2864   228.1   <2e-16 ***
## season(temp)Nov  60.4754     0.2864   211.2   <2e-16 ***
## season(temp)Dec  56.0333     0.2864   195.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.379 on 816 degrees of freedom
## Multiple R-squared:  0.9985, Adjusted R-squared:  0.9985 
## F-statistic: 4.682e+04 on 12 and 816 DF,  p-value: < 2.2e-16

COS-SIN Model

model3<-dynlm (temp~harmon(temp))
summary(model3)
## 
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
## 
## Call:
## dynlm(formula = temp ~ harmon(temp))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.789 -1.742 -0.109  1.573  9.391 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     61.76944    0.08786  703.06   <2e-16 ***
## harmon(temp)cos -6.18023    0.12425  -49.74   <2e-16 ***
## harmon(temp)sin -2.83955    0.12425  -22.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.528 on 825 degrees of freedom
## Multiple R-squared:  0.7841, Adjusted R-squared:  0.7836 
## F-statistic:  1498 on 2 and 825 DF,  p-value: < 2.2e-16
model4<-dynlm(temp~harmon(temp,2))
summary(model4)
## 
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
## 
## Call:
## dynlm(formula = temp ~ harmon(temp, 2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.498 -1.653 -0.148  1.572  9.099 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         61.76944    0.08308 743.484   <2e-16 ***
## harmon(temp, 2)cos1 -6.18023    0.11749 -52.600   <2e-16 ***
## harmon(temp, 2)cos2 -0.29167    0.11749  -2.482   0.0132 *  
## harmon(temp, 2)sin1 -2.83955    0.11749 -24.168   <2e-16 ***
## harmon(temp, 2)sin2  1.13566    0.11749   9.666   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.391 on 823 degrees of freedom
## Multiple R-squared:  0.8074, Adjusted R-squared:  0.8065 
## F-statistic: 862.6 on 4 and 823 DF,  p-value: < 2.2e-16

Overfit of models

##compare models
## Seasonal Means Model
st2 <-coef(model2)
## Cos-Sin Model
st3<-fitted(model3)[1:12]
st4<-fitted(model4)[1:12]
## Compare Seasonality Estimates
plot(1:12,st2,lwd=2,type="l",xlab ="Month", ylab ="Seasonality")
lines(1:12,st3,lwd=2, col="brown")
lines(1:12,st4,lwd=2, col="blue")

Construct and plot respect to time and ACf plots

resid.2s<-temp-st2
resid.3s<-temp-st3
resid.4s<-temp-st4


checkresiduals(resid.2s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.3s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.4s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Comment on how the two models fit and on the appropriateness of the stationarity assumption of the residuals. Also compare the fits to those in part B and comment if your initial prediction was correct.

Neither model is great.