Gayndah Post Office: Data Analysis

Change in the iegenvlaues (joint and component-wise) as well as changes in the trace and total variation explained (TVE) is explored using the Gayndah Australia post office annual temperature curves. This analysis is based on de-trending via MA smoothing, since the raw data seems to have an increasing trend, i.e., the average temperature curves seems to increasing over time. For the analysis, make sure to use the updated R package fChange, as most of the new methods are implemented in the updated version.

library(fChange) # updated version
library(reshape2)
library(fda)
library(plotly)
library(ggplot2)

Here the minumum temperature curves are observed daily over the time period 1893 to 2009, however, the first and the last year is partially observed in terms of months, i.e., they only have 5 months of information on them. For convenience and completeness, we take the time years of 1984 to 2008, and the discretized minimum temperatures are converted into functions using \(D=21\) Fourier basis functions.

D=21
fun_data_S = Australian_Temp$Gayndah
basis = create.fourier.basis(rangeval = c(0, 1), nbasis = D)
# vectorize the data based on year-month
fun_data_S$md = with(fun_data_S, paste(Month, Day, sep="_"))
# get rid of NAs
nas = which(is.na(fun_data_S$Days.of.accumulation.of.minimum.temperature))
fun_data_S = fun_data_S[-nas, ]
yy = unique(fun_data_S$Year) # unique observed years
mat.S = matrix(0, D, length(yy))
for (i in 1:length(yy)){
  aa = subset(fun_data_S, Year==yy[i])
  cc = aa$Minimum.temperature..Degree.C.
  a = which(is.na(cc))
  if (length(a)>0){
    cc = cc[-which(is.na(cc))]
  }else{
    cc = cc
  }
  f_Obs = Data2fd(argvals=seq(0, 1, length = length(cc)) , cc, basisobj = basis)
  mat.S[, i] = f_Obs$coefs
}
fdata_1 = fd(mat.S, basis)
fdata = fdata_1[2:(ncol(fdata_1$coefs)-1)] # get rid of the first and last observation

The resulting data sets fdata is the 115 annual temperature curves for Gayndah post office, Australia.

invisible(plot(fdata, col="black", main="Annual Temperature curves, Gayndah Postoffice"))

Note that first we have tried to de-mean the functional data using segmentation via changes in the mean function, however, it seems that there is an increasing trend. In the below figure the means for 5 segments of the functional data is given. As the time gets larger, the mean function gets darker. It seems that there is an increase trend.

colr = c("gray80", "gray60", "gray40", "gray20", "gray0")
invisible(plot(fdata, col="pink"))
for (i in 1:5){
  lines(mean(fdata[((i-1)*23+1):(i*23)]), col = colr[i], lwd=3)
}

Therefore, we de-mean the functional data suing Moving Average Smoothing. Assume that \(X_1, \dots, X_{n_x}\) represents the annual Gayndah temperature curves where \(n_x=115\), then for fixed \(q\in\mathbb{N}_0\) we define \[m_i = \frac{X_{i-q}+\dots X_i + \dots +X_{i+q}}{2q+1}, \qquad i = q+1,\dots ,n_x-q\] then the de-meaned data is represented as: \[Y_i = X_i - m_i, , \qquad i = q+1,\dots ,n_x-q\] here we choose \(q=5\):

q = 5
D = nrow(fdata$coefs)
N = ncol(fdata$coefs)
Y_t = m_t = matrix(0, D, N-2*q)
for (i in 1:(N-2*q)){
  mu = mean(fdata[i:(i+2*q)])
  Res = fdata[i+q] - mu
  m_t[, i] = mu$coefs
  Y_t[, i]  = Res$coefs
}
# de-trended data
Y_fun = fd(Y_t, fdata$basis)
m_fun = fd(m_t, fdata$basis)
par(mfrow=c(1,2))
invisible(plot(m_fun, main="MA Smoothed, m_i", col="black"))
invisible(plot(Y_fun, main="MA De-trended, Y_i", col="black"))

Note that the resulting de-meaned data \((Y_i, i\in\{1,\dots,105})\), has no mean change in it. One can see that by:

change_FF(Y_fun, h=3)$pvalue

## [1] 1

However, it has a significant change in the joint eigenvalues. First we need to pick the dimension \(d\) that the joint test will be performed. This can be picked based on total variation explained, i.e.,

pick_dim(Y_fun, 0.85)

## $d
## [1] 10
## 
## $TVEs
##  [1] 0.2390554 0.3951538 0.5054762 0.5903471 0.6486547 0.7042966 0.7546311
##  [8] 0.7967673 0.8337584 0.8672138 0.8975882 0.9250492 0.9446979 0.9627833
## [15] 0.9771330 0.9853906 0.9932811 0.9959501 0.9983714 0.9994308 1.0000000

which picks \(d\) to be 10. then performing the joint test with \(d=10\), yields:

# change in the joint eigenvalue
E_test_joint = eval_joint(Y_fun, d=10)
E_test_joint

## $change
## [1] 0.5047619
## 
## $pvalue
## [1] 0.005
## 
## $eval_before
##  [1] 0.617271707 0.404279217 0.296218564 0.231298943 0.166091052
##  [6] 0.117897601 0.099821239 0.091361547 0.085124445 0.075446118
## [11] 0.051533994 0.045730887 0.036353696 0.027619360 0.020312449
## [16] 0.012885097 0.010195579 0.007171372 0.004684396 0.001982020
## [21] 0.001201868
## 
## $eval_after
##  [1] 0.431869816 0.257930451 0.208384702 0.141108249 0.121778341
##  [6] 0.104097624 0.094427788 0.083357778 0.077474481 0.055830981
## [11] 0.044576037 0.037589100 0.027099121 0.022040875 0.019705836
## [16] 0.013307508 0.011969501 0.003879597 0.002666663 0.001394526
## [21] 0.000536911

which results in significant change, then one might be intersted that among 10 eigenvalues which one contributes to the joint test, or which eigenvalues are marginally changing:

# component-wise eigenvalue tests
sapply(1:10, function(d) eval_component(Y_fun, component = d)$pvalue)

##  [1] 0.235 0.137 0.095 0.361 0.011 1.000 0.994 0.991 0.883 0.627

It turns out that the the fifth eigenvalue is changing (p value below 0.05). Then the joint test should be able to capture the change as long as the 5th eigenvalue is included in the analysis, i.e., \(d\geq 5\). Which turns out to be true, and also when \(d\) is chosen in between 5 and 10, the change location remains the same as well.

# pvalues of the joint test for d=5,...,10
sapply(5:10, function(d) eval_joint(Y_fun, d)$pvalue)

## [1] 0.000 0.005 0.001 0.003 0.004 0.007

# change locations (theta) of the joint test for d=5,...,10
sapply(5:10, function(d) eval_joint(Y_fun, d)$change)

## [1] 0.5047619 0.5047619 0.5047619 0.5047619 0.5047619 0.5047619

the change location is :

N_y = N - 2*q # sample size after de_meaning
loc = E_test_joint$change * N_y
loc

## [1] 53

Hence the change pont location is estimted to be 53. Below is the plot of the functional data before and after the change:

invisible(plot(Y_fun, col="grey", main="De-meaned functional data"))
lines(Y_fun[1:loc], col="red")
lines(Y_fun[(loc+1):N_y], col="blue")

This suggests to look into changes in the total variation. below we test for the changes in the trace of the covariance operator of the functional data, i.e., the total variation.

trace_change(Y_fun)

## $change
## [1] 0.5047619
## 
## $pvalue
## [1] 0.009
## 
## $trace_before
## [1] 2.404481
## 
## $trace_after
## [1] 1.761026

The change in total variation also comes out to be signicant and the change location is the same as the changes in the joint eigenvalues. Tracing back the year that the change is happenning in:

yy[loc + q + 1] # +1 for removing the first year at the beginning

## [1] 1951

Right before this year shows some interesting features:

# observed temp at 1950
aa = subset(fun_data_S, Year=="1950")$Minimum.temperature..Degree.C.
invisible(plot(fdata, col="grey"))
lines(fdata[loc+q-1], lwd=2)
points(seq(0,1,length.out = length(aa)), aa, col="red", cex=0.5)

It seems that in 1950 the month of June the weather was unexpectedly hotter than any other junes causing the variation before this year to get larger, and after 1950, it seems that the temperaure curves became to have less variation in them. Just to confirm, the below is the heat map of monthly average temperatures from 1894 to 2008:

dat = fun_data_S
data = which(dat$Year%in%c(yy[1], yy[117]))
dat = dat[-data,]
year = unique(dat$Year)
month = unique(dat$Month)
mean_t = matrix(0, length(month), length(year))
for (i in 1:length(month)){
  for (j in 1:length(year)){
    dd = subset(dat, Year == year[j] & Month == month[i])
    temp = dd$Minimum.temperature..Degree.C.
    mean_t[i, j] = mean(temp)
  }
}
m_t = t(mean_t)
colnames(m_t) = month
rownames(m_t) = year
ddd = melt(m_t)
colnames(ddd) = c("year", "month", "avg.Temp")
p <- ggplot(ddd, aes(year, month)) + geom_tile(aes(fill = avg.Temp), colour = "white") + scale_fill_gradient(low = "white",high = "steelblue")
ggplotly(p)

It seems that even the average temperature is increasing annually, the variation in annual temperature is decreasing, resulting in less and less dramatic temperature changes.

To explore whether the eigefunctions are changing or not, we first need to take out the effect of eigevalue change, by standardizing the data.

Eval_transform = function(fdobj, k_star, d){
  N = ncol(fdobj$coefs)
  D = nrow(fdobj$coefs)
  fdata = center.fd(fdobj)
  data_b = fdata[1:k_star]
  data_a = fdata[(k_star+1):N]
  l1 = pca.fd(data_b, nharm = D, centerfns = T)$values
  l2 = pca.fd(data_a, nharm = D, centerfns = T)$values
  phi = pca.fd(fdata, nharm = D, centerfns = T)$harmonics
  scores = inprod(fdata, phi)
  
  Y_mat = matrix(0, D, N)
  for (i in 1:k_star){
    Y_mat[,i] = rowMeans(sapply(1:d, function(j) scores[i,j]/sqrt(l1[j])*phi[j]$coefs))
  }
  for (i in (1+k_star):N){
    Y_mat[,i] = rowMeans(sapply(1:d, function(j) scores[i,j]/sqrt(l2[j])*phi[j]$coefs))
  }
  fd(Y_mat, fdata$basis)
}

YY = Eval_transform(Y_fun, loc, 5)

Then applying the two sample test for changes in the covariance, where the two samples are the functional data before and after the change.

Cov_test(YY[1:loc], YY[(1+loc):N_y], 5)

## [1] 0.8212429

This test is suggests (cosisdering the deviations from Gaussianlity and Independency), suggests that the covariance structure, hence the eigenfunctions do not change.

Gayndah Post Office: Data Analysis

Ozan Sonmez

1/7/2018