Question 1a: Exploratory Data Analysis
Plot the Time Series and ACF plots. Comment on the main features, and identify what (if any) assumptions of stationarity are violated. Hint: Before plotting, can you infer anything from the nature of the data?
Inference
Given temperature data I would assume 1) seasonality and 2) small trend higher due to the assumption of global warming. I think the seasonal effects are stronger than the trend effect. Thus, best to fit a seasonal model.
Main Features
It appears that the monthly temperature is slowly rising over time. Additionally, the lows in temperature seemed to occur more often in the 1950’s. (Higher lows and higher highs as times passes) The highs seem to occur more frequently now than in the 1950’s. There does not appear to be any sharp changes in behavior or any outlying observations.
par(mfrow=c(1,1))
plot(data,type="l")

acf(data,lag=12*4,main="")

What assumptions of stationarity are violated?
Stationary Assumptions
Staionary Assumptions = The time series statistical properties: mean, variance and autocorrelation do not change over time.
The ACH chart above shows that autocorrelation does vary over time. There clearly is a seasonal pattern. The correlations are beyond the blue lines, meaning they are significantly different from zero.
The plots below show temperature by year, which highlight the slight trend higher. The box plot show temperature by month. This box plot highlights the seasonality of the data.
#This will aggregate the cycles and display a year on year trend
plot(aggregate(data,FUN=mean),xlab="Year")

Box plot across months will give us a sense of seasonal effects.
```r
boxplot(data~cycle(data),xlab="Month",ylab="Temp")

The decompose plots below show a trend with an increasing mean over time.
plot(decompose(data))

While it is safe to say the series is not stationary, we can use the Augmented Dickey-Fuller test to check is the data set is stationary. The p-value of .09 is higher than the significance level(.05) means we accept the null hypothesis that the time series is non stationary.
tseries::adf.test(data,k=48)
##
## Augmented Dickey-Fuller Test
##
## data: data
## Dickey-Fuller = -3.178, Lag order = 48, p-value = 0.09172
## alternative hypothesis: stationary
Question 1b: Trend Estimation
Fit the following trend estimation models:
Moving average Parametric quadratic polynomial Local Polynomial Splines
The plot below shows the moving avergae(purple) trending higher over time.
#create time points
temp=data
time.pts=c(1:length(temp))
time.pts=c(time.pts-min(time.pts))/max(time.pts)
### Fit a moving average
mav.fit = ksmooth(time.pts, temp, kernel = "box")
temp.fit.mav = ts(mav.fit$y,start= 1950,frequency=12)
## Is there a trend?
ts.plot(temp,ylab="Temperature")
lines(temp.fit.mav,lwd=2,col="purple")
abline(temp.fit.mav[1],0,lwd=2,col="blue")

The following uses parametric regression (green) to determine the trend. This method shows an increasing trend as well. The predictors are not statistically signifcant.
## parametric regression
x1=time.pts
x2=time.pts^2
lm.fit=lm(temp~x1+x2)
summary(lm.fit)
##
## Call:
## lm(formula = temp ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9226 -4.4207 -0.2526 4.6385 13.4337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.7226 0.5489 108.809 <2e-16 ***
## x1 3.1639 2.5384 1.246 0.213
## x2 1.4030 2.4606 0.570 0.569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.277 on 825 degrees of freedom
## Multiple R-squared: 0.05925, Adjusted R-squared: 0.05697
## F-statistic: 25.98 on 2 and 825 DF, p-value: 1.143e-11
## Is there a trend?
temp.fit.lm= ts(fitted(lm.fit),start=1950,frequency=12)
ts.plot(temp,ylab="Temperature")
lines(temp.fit.lm,lwd=2,col="green")
abline(temp.fit.lm[1],0,lwd=2,col="blue")

The trend fitted using local polynomial regression (brown) and splines (yellow) are higher as well.
## Local Polynomial Trend Estimation
loc.fit= loess(temp~time.pts)
temp.fit.loc= ts(fitted(loc.fit),start=1950,frequency=12)
## Splines Trend Estimation
library(mgcv)
gam.fit= gam(temp~s(time.pts))
temp.fit.gam= ts(fitted(gam.fit),start=1950,frequency=12)
## Is there a trend?
ts.plot(temp,ylab="Temperature")
lines(temp.fit.loc,lwd=2,col="yellow")
lines(temp.fit.gam,lwd=2,col="red")
abline(temp.fit.loc[1],0,lwd=2,col="blue")

Summary of all trends
## Compare all estimated trends
all.val= c(temp.fit.mav,temp.fit.lm,temp.fit.gam,temp.fit.loc)
ylim= c(min(all.val),max(all.val))
ts.plot(temp.fit.lm,lwd=2,col="green",ylim=ylim,ylab="Temperature")
lines(temp.fit.mav,lwd=2,col="purple")
lines(temp.fit.gam,lwd=2,col="red")
lines(temp.fit.loc,lwd=2,col="brown")
legend(x=1900,y=64,legend=c("MAV","LM","GAM","LOESS"),lty= 1, col=c("purple","green","red","brown"))

All four residuals violate rules of staionarity. The residual models are not a good fit. All four residuals show strong seasonal patters as shown in the ACF charts The residuals do not look normally distributed either.
par(mfrow=c(1,1))
#acf(resid.1,lag=12*4,main="")
#acf(resid.2,lag=12*4,main="")
#acf(resid.3,lag=12*4,main="")
#acf(resid.4,lag=12*4,main="")
checkresiduals(resid.1)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.2)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.3)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.4)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Question 1c: Seasonality Estimation
Seasonality Estimation:
Fit the following seasonality estimation models. Categorical Linear Regression (ANOVA) COS-SIN Overlay the fitted values on the original time series. Construct and plot the residuals with respect to time and ACF plots. Comment on how the two models fit and on the appropriateness of the stationarity assumption of the residuals. Also compare the fits to those in part B and comment if your initial prediction was correct.
Seasonality using Seasonal Means Model (Avg temp) without January and with January
## Estimate seasonality using Seasonal Means Model (Avg temp)
## Drop January/with intercept --avg tem per month
model1<-dynlm(temp~season(temp))
summary(model1)
##
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
##
## Call:
## dynlm(formula = temp ~ season(temp))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7145 -1.5855 -0.0319 1.6000 8.6768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.5145 0.2864 193.867 < 2e-16 ***
## season(temp)Feb 0.7420 0.4050 1.832 0.067266 .
## season(temp)Mar 1.5333 0.4050 3.786 0.000164 ***
## season(temp)Apr 3.6188 0.4050 8.936 < 2e-16 ***
## season(temp)May 6.0797 0.4050 15.013 < 2e-16 ***
## season(temp)Jun 9.1014 0.4050 22.475 < 2e-16 ***
## season(temp)Jul 12.4420 0.4050 30.724 < 2e-16 ***
## season(temp)Aug 13.4681 0.4050 33.258 < 2e-16 ***
## season(temp)Sep 12.7855 0.4050 31.572 < 2e-16 ***
## season(temp)Oct 9.8087 0.4050 24.221 < 2e-16 ***
## season(temp)Nov 4.9609 0.4050 12.250 < 2e-16 ***
## season(temp)Dec 0.5188 0.4050 1.281 0.200487
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.379 on 816 degrees of freedom
## Multiple R-squared: 0.811, Adjusted R-squared: 0.8084
## F-statistic: 318.3 on 11 and 816 DF, p-value: < 2.2e-16
## Seasonal mean effects/without intercept --avg tem per month
model2<-dynlm(temp~ season(temp)- 1)
summary(model2)
##
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
##
## Call:
## dynlm(formula = temp ~ season(temp) - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7145 -1.5855 -0.0319 1.6000 8.6768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## season(temp)Jan 55.5145 0.2864 193.9 <2e-16 ***
## season(temp)Feb 56.2565 0.2864 196.5 <2e-16 ***
## season(temp)Mar 57.0478 0.2864 199.2 <2e-16 ***
## season(temp)Apr 59.1333 0.2864 206.5 <2e-16 ***
## season(temp)May 61.5942 0.2864 215.1 <2e-16 ***
## season(temp)Jun 64.6159 0.2864 225.7 <2e-16 ***
## season(temp)Jul 67.9565 0.2864 237.3 <2e-16 ***
## season(temp)Aug 68.9826 0.2864 240.9 <2e-16 ***
## season(temp)Sep 68.3000 0.2864 238.5 <2e-16 ***
## season(temp)Oct 65.3232 0.2864 228.1 <2e-16 ***
## season(temp)Nov 60.4754 0.2864 211.2 <2e-16 ***
## season(temp)Dec 56.0333 0.2864 195.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.379 on 816 degrees of freedom
## Multiple R-squared: 0.9985, Adjusted R-squared: 0.9985
## F-statistic: 4.682e+04 on 12 and 816 DF, p-value: < 2.2e-16
COS-SIN Model
model3<-dynlm (temp~harmon(temp))
summary(model3)
##
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
##
## Call:
## dynlm(formula = temp ~ harmon(temp))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.789 -1.742 -0.109 1.573 9.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.76944 0.08786 703.06 <2e-16 ***
## harmon(temp)cos -6.18023 0.12425 -49.74 <2e-16 ***
## harmon(temp)sin -2.83955 0.12425 -22.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.528 on 825 degrees of freedom
## Multiple R-squared: 0.7841, Adjusted R-squared: 0.7836
## F-statistic: 1498 on 2 and 825 DF, p-value: < 2.2e-16
model4<-dynlm(temp~harmon(temp,2))
summary(model4)
##
## Time series regression with "ts" data:
## Start = 1950(1), End = 2018(12)
##
## Call:
## dynlm(formula = temp ~ harmon(temp, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.498 -1.653 -0.148 1.572 9.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.76944 0.08308 743.484 <2e-16 ***
## harmon(temp, 2)cos1 -6.18023 0.11749 -52.600 <2e-16 ***
## harmon(temp, 2)cos2 -0.29167 0.11749 -2.482 0.0132 *
## harmon(temp, 2)sin1 -2.83955 0.11749 -24.168 <2e-16 ***
## harmon(temp, 2)sin2 1.13566 0.11749 9.666 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.391 on 823 degrees of freedom
## Multiple R-squared: 0.8074, Adjusted R-squared: 0.8065
## F-statistic: 862.6 on 4 and 823 DF, p-value: < 2.2e-16
Overfit of models
##compare models
## Seasonal Means Model
st2 <-coef(model2)
## Cos-Sin Model
st3<-fitted(model3)[1:12]
st4<-fitted(model4)[1:12]
## Compare Seasonality Estimates
plot(1:12,st2,lwd=2,type="l",xlab ="Month", ylab ="Seasonality")
lines(1:12,st3,lwd=2, col="brown")
lines(1:12,st4,lwd=2, col="blue")

Construct and plot respect to time and ACf plots
resid.2s<-temp-st2
resid.3s<-temp-st3
resid.4s<-temp-st4
checkresiduals(resid.2s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.3s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(resid.4s)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Comment on how the two models fit and on the appropriateness of the stationarity assumption of the residuals. Also compare the fits to those in part B and comment if your initial prediction was correct.
Neither model is great.