Bellingham_Weather

Bellingham Weather Data

library(repmis)
library(moments)

githubURL <- "https://github.com/AndyBunn/TeachingData/raw/master/kbli.Rdata"
source_data(githubURL)

## Downloading data from: https://github.com/AndyBunn/TeachingData/raw/master/kbli.Rdata

## SHA-1 hash of the downloaded data file is:
## e61c66faa0cd4f23f6bf7e96d860b6b591c39cef

## [1] "kbli.zoo" "kbli.ts"

tAvgF <- kbli.ts[,"tAvgF"]

Decompose the data

tAvgF.d <- decompose(tAvgF)
plot(tAvgF.d)

Standard deviation of the deseasonalized data

sd(tAvgF)

## [1] 8.908073

Taking the seasonal component out, I made a plot of the deseasonalized series (trend + residuals)

tAvgF.ag<-tAvgF.d$trend+tAvgF.d$random
plot(tAvgF.ag,ylab="Average Temp (F)")

Standard deviation of the de-seasonalized series

sd(tAvgF.ag,na.rm=T)

## [1] 2.404545

The s.d for the decomposed data is 8.9 (with seasonality), and the s.d. for the deseasonalized series is 2.4…considerably less with season taken out. This means that the variability within the dataset is greatly increased when seasonality is included.

How strongly correlated is the average temperature at time x with the temperature at time x-1?

cor(tAvgF[2:480],tAvgF[1:(480-1)])

## [1] 0.8114186

This means the autocorrelation value at the first lag (k=1) is phi=0.811.

Plot the acf and pacf of both series

par(mar=c(1,1,1,1))

acf(tAvgF,na.action=na.pass,main="Deseasonalized")

pacf(tAvgF, na.action = na.pass, main="")

acf(tAvgF.ag, na.action = na.pass, main="Residual")

pacf(tAvgF.ag, na.action = na.pass, main="")

The decomposed data follows an AF(1) model. This is obvious in the pacf plot where the 2nd lag is negatively related to the 1st lag. The cyclinc nature in the ACF plot definitely shows the strong seasonal component.

The de-seasonalized data also follows an AR(1) model. This is obvious in the pacf plot where the 2nd lag is negatively related to the 1st lag.

Part 2: my own AR(2) model of the form: xi=phi1(x-1) + phi2(x-2) + episolon(i)

With phi1 and phi2 set to 0.5 and 0.25 respectively

n <- 300
epsilon <- rnorm(n = n,mean = 0,sd = 1) 
x <- numeric(length = n)
phi1 <- 0.5
phi2<-0.25
for(i in 3:n){
   x[i] <- phi1 * x[i-1] + phi2 *x[i-2] + epsilon[i]  
  }
plot(x,type="l")

Because I’ve created an AR(2) model, I would anticipate that the acf would show that the 2nd lag is lower than the 1st lag, but still high enough to be considered affected by the previous lag.

acf and pacf

acf(x)

pacf(x)

The pacf graph indicates that the 2nd lag falls within the range of similarity with the 1st lag (outside the 95% confidence interval)

With phi1 and phi2 set to 0.6 and 0.4 respectively

n <- 300
epsilon <- rnorm(n = n,mean = 0,sd = 1) 
x <- numeric(length = n)
phi1 <- 0.6
phi2<-0.4
  for(i in 3:n){
   x[i] <- phi1 * x[i-1] + phi2 *x[i-2] + epsilon[i]  
  }
plot(x,type="l")

##acf and pacf

acf(x)

pacf(x)

With phi1 and phi2 set to 0.6 and 0.4 respectively and s.d of 2.

n <- 300
epsilon <- rnorm(n = n,mean = 0,sd = 2) 
x <- numeric(length = n)
phi1 <- 0.6
phi2<-0.4
for(i in 3:n){
   x[i] <- phi1 * x[i-1] + phi2 *x[i-2] + epsilon[i]  
  }
plot(x,type="l")

acf and pacf

acf(x)

pacf(x)

By changing the s.d. from 1 to 2, I noticed an interesting difference in the acf graphs…an s.d. of 2 shows a much less steep decline than the s.d. of 1.

Part 3: New Time Series data–“Orange Data”

data("Seatbelts")
tsp(Seatbelts)

## [1] 1969.000 1984.917   12.000

plot(Seatbelts)

DK<-Seatbelts[,"DriversKilled"]

plot(DK)

Histogram of data on Drivers killed

kt <-kurtosis(DK)
sk <- skewness(DK)
my.xlim <- range(DK)
h<-hist(DK, breaks=10, col="lightblue", xlab="Number of Deaths per year",
        main="",xlim=my.xlim) 
xfit<-seq(min(DK),max(DK),length=100) 
yfit<-dnorm(xfit,mean=mean(DK),sd=sd(DK)) 
yfit <- yfit*diff(h$mids[1:2])*length(DK) 
lines(xfit, yfit, col="darkblue", lwd=2)
boxplot(DK, horizontal=TRUE,  outline=TRUE,  axes=FALSE,
        ylim=my.xlim, col = "green", add = TRUE, boxwex=3)
text(x = 145, y=33, labels = paste("Kurtosis=",round(kt,2)),pos = 4)
text(x = 145, y=29, labels = paste("Skewness=",round(sk,2)),pos = 4)

Is there a linear trend in time?

Annualdeaths <- time(DK)
DK.lm <- lm(DK ~ Annualdeaths)
summary(DK.lm)

## 
## Call:
## lm(formula = DK ~ Annualdeaths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.934 -17.733  -2.797  15.238  67.865 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3709.4345   741.7417   5.001 1.29e-06 ***
## Annualdeaths   -1.8142     0.3752  -4.835 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.01 on 190 degrees of freedom
## Multiple R-squared:  0.1096, Adjusted R-squared:  0.1049 
## F-statistic: 23.38 on 1 and 190 DF,  p-value: 2.736e-06

The coefficient shows that the number of deaths decreased by ~2 people annually over the 16 year time period. This makes some sense…as seatbelt laws most likely became more enforced over that time period.

acf(DK)

pacf(DK)

According to these acf/pacf results, the data fits and AR(1) model–the lag at (k-1) in the pacf figure indicates that the number of deaths at time Y is not effected by the time at Y-1. This makes perfect sense because car crashes are typically unaffected by each other and don’t follow many identifiable trends over time.