Toy Example
Data
Let’s generate a toy dense functional dataset!
##### SET #####
M <- 200 #number of measurements per subjects
N <- 100 # the number of subjects
set.seed(123) #for bootstrap
##### DEFINE ####
s <- seq(0,10,length.out = M) #from 0 to 10, time length 200(=M)
# generating functions
meanFunct <- function(s) s+10*exp(-(s-5)^2)
eigFunct1 <- function(s) cos(2*s*pi/10)/sqrt(5)
eigFunct2 <- function(s) sin(2*s*pi/10)/sqrt(5)
meanFunct \(y=s+10{e}^{{-(s-5)}^{2}}\)
plot(s,meanFunct(s),type='l')

The first eigen function is \(y=\frac{cos(\frac{2s\pi}{10})}{\sqrt{5}}\)
plot(s,eigFunct1(s),type='l',col='red')

The second eigen function is \(\frac{-sin(\frac{2s\pi}{10})}{\sqrt{5}}\)
plot(s,eigFunct2(s),type='l',col='blue')

Random residuals for eigen effects
Ksi <- matrix(rnorm(N*2),ncol=2)
head(Ksi)
[,1] [,2]
[1,] -0.56047565 -0.71040656
[2,] -0.23017749 0.25688371
[3,] 1.55870831 -0.24669188
[4,] 0.07050839 -0.34754260
[5,] 0.12928774 -0.95161857
[6,] 1.71506499 -0.04502772
- 1st column = eigen 1
- 2nd column = eigne 2
- for N subjects (counting number, of course)
- Distributions are from Normal PDF
Centering of matrix
Ksi <- apply(Ksi,2,scale)
head(Ksi)
[,1] [,2]
[1,] -0.71304802 -0.6234417
[2,] -0.35120270 0.3768723
[3,] 1.60854170 -0.1438956
[4,] -0.02179795 -0.2481894
[5,] 0.04259548 -0.8728888
[6,] 1.77983218 0.0646535
Mean = 0
round(colMeans(Ksi))
[1] 0 0
Scale up by 5 and 2 for each (just for fun…)
Ksi <- Ksi %*% diag(c(5,2))
head(Ksi,n=10)
[,1] [,2]
[1,] -3.5652401 -1.2468834
[2,] -1.7560135 0.7537447
[3,] 8.0427085 -0.2877911
[4,] -0.1089898 -0.4963787
[5,] 0.2129774 -1.7457776
[6,] 8.8991609 0.1293070
[7,] 2.0294909 -1.4009660
[8,] -7.4246470 -3.2273356
[9,] -4.2574783 -0.5639783
[10,] -2.9363418 2.1231802
Creating \(Y_{true}\)
\[
Y_{true}=E_{Njk}+\mu_{Nk}+\epsilon_{Nj}
\]
yTrue <- Ksi %*% t(matrix(c(eigFunct1(s),eigFunct2(s)),ncol=2)) + t(matrix(rep(meanFunct(s),N),nrow=M))
- Effects from eigen vectors
- Effects from a mean latent function
- Effects from random residuals
The shape is 100(=N, subjects) by 200(=M, time length data, i.e.,
measurements)
dim(yTrue)
[1] 100 200
Running FPCA on a dense dataset
L3 <- MakeFPCAInputs(
IDs = rep(1:N, each=M), # like, 1 1 1 2 2 2 3 3 3 for rep(1:3, each=3)
tVec = rep(s,N), # like 1 2 1 2 1 2 for rep(1:2,3)
t(yTrue) #row = time, column = subject
)
Let’s do it!
FPCAdense <- FPCA(L3$Ly,L3$Lt)
Results
plot(FPCAdense)

As we expected, * mean- and eigen functiosn are separaeted and all
predicted as they are * In the scree plot, two principal components
explain almost everything
Let’s extract those.
meanOfResult <- FPCAdense$mu
phi1 <- FPCAdense$phi[,1]
phi2 <- FPCAdense$phi[,2]
library(scales) #for alpha()
par(mfrow=c(2,2)) # 2 by 2 plot areas
# area 1,1 (for all subject data)
plot(1,type='n',xlab='time',ylab='value',xlim=c(min(s),max(s)),ylim=c(min(yTrue),max(yTrue)))
for(i in 1:100) {
lines(x=s,y=yTrue[i,],col=alpha('gray',0.2))
}
plot(s,meanOfResult,type='l',col='red')
plot(s,phi1,type='l',col='blue')
plot(s,phi2,type='l',col='green')
par(mfrow=c(1,1)) #back to normal

This package works like magic!
Running FPCA on a sparse and noisy dataset
Now, let’s see how the PACE algorithm works.
ySparse <- Sparsify(yTrue,s,sparsity = c(1:5)) #generating sparse data (measurements from 1 to 5 points)
Now noise.
ySparse$yNoisy <- lapply(ySparse$Ly,function(x) x+0.5*rnorm(length(x)))
Run
FPCAsparse <- FPCA(ySparse$yNoisy, ySparse$Lt)
Result
plot(FPCAsparse)

Now let’s consider cumulative functional variance to determine a
proper number of principal factors.
FPCAsparse$cumFVE
[1] 0.7704850 0.9332069 0.9891502 0.9993241
For your information, the sparse data look like this:
par(mfrow=c(1,2))
#for dense data
par(mfrow=c(2,2)) # 2 by 2 plot areas
# area 1,1 (for all subject data)
plot(1,type='n',xlab='time',ylab='value',xlim=c(min(s),max(s)),ylim=c(min(yTrue),max(yTrue)),
main="Dense Data")
for(i in 1:100) {
lines(x=s,y=yTrue[i,],col=alpha('green',0.2))
}
#for sparse data
plot(1,type='n',xlab='time',ylab='value',
xlim=c(min(unlist(ySparse$Lt)),max(unlist(ySparse$Lt))),
ylim=c(min(unlist(ySparse$Ly)),max(unlist(ySparse$Ly))),
main="Sparse Data")
for(i in 1:100) {
x=ySparse$Lt[[i]]
y=ySparse$Ly[[i]]
lines(x,y,col=alpha('red',0.2))
}
par(mfrow=c(1,1))

Estimate path plots: let’s create the fitted sample path plot based
on the results from the parse data.
CreatePathPlot(FPCAsparse)

Prediction grid:
SelectK(FPCAsparse,criterion='AIC')
$K
[1] 2
$criterion
[1] 515.5021
# K=SelectK(FPCAsparse,criterion='AIC')$K
predSparse <- predict(FPCAsparse,ySparse$Ly,ySparse$Lt,K=2) #number of PCA = 2
Number of samples in ySparse:
length(ySparse$Lt)
[1] 100
Number of predictions:
nrow(predSparse$scores)
[1] 100
head(predSparse$scores)
[,1] [,2]
[1,] 1.8118913 -1.25381426
[2,] 0.6427877 -0.18594988
[3,] -5.8391469 -0.74846297
[4,] 1.9361811 0.07853147
[5,] 3.0165165 -0.43987666
[6,] -6.1766873 -1.29560020
Real-world Example
Introduction
- medfly25 dataset: this is a dataset containing the eggs laid from
789 medflies during the first 25 days of their lives. (see also, Carey
et al. 1998))
Carey, J. R., Liedo, P., Müller, H. G., Wang, J. L., & Chiou, J.
M. (1998). Relationship of age patterns of fecundity to mortality,
longevity, and lifetime reproduction in a large cohort of Mediterranean
fruit fly females. The Journals of Gerontology Series A: Biological
Sciences and Medical Sciences, 53(4), B245-B251.
- The data are rather noisy, dense and with a characteristic flat
start.
Data
head(medfly25)
N <- nrow(medfly25)
library(scales)
library(ggplot2)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
g1 <- medfly25 %>%
ggplot(aes(x=Days,y=nEggs,group=ID)) +
geom_line(alpha=0.2,color='gray') +
geom_smooth(aes(group=1),se=FALSE)+
theme_bw() +
labs(x="Days",y="Number of Eggs") +
ggtitle("Medflies - Time Series Plot")
g2 <-medfly25 %>%
ggplot(aes(x=Days,y=nEggs)) +
geom_density_2d_filled() +
scale_fill_brewer()+
theme_bw() +
labs(x="Days",y="Number of Eggs") +
ggtitle("Medflies - Density Plot") +
theme(legend.position='none')
grid.arrange(g1,g2,ncol=2)

FPCA
Flies <- MakeFPCAInputs(medfly25$ID,medfly25$Days,medfly25$nEggs)
fpcaObjFlies <- FPCA(Flies$Ly,Flies$Lt,
list(plot=TRUE,
methodMuCovEst='smooth',
userBwCov=2))

Select number of K (see, Yao et al., 2005)
Yao, F., Müller, H. G., & Wang, J. L. (2005). Functional data
analysis for sparse longitudinal data. Journal of the American
statistical association, 100(470), 577-590.
numK <- SelectK(fpcaObjFlies,criterion = "AIC")
numK
$K
[1] 1
$criterion
[1] 29270.35
K <- numK$K
g<-CreatePathPlot(fpcaObjFlies)

g
NULL
medfly25[medfly25$ID==unique(medfly25$ID)[c(3,5,135)],] %>%
ggplot(aes(x=Days,y=nEggs,color=as.factor(ID))) +
geom_line()

par(mfrow=c(2,2))
CreatePathPlot(fpcaObjFlies,subset=c(3,5,135))
CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=1,main="K=1",pch=4)
CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=3,main="K=3",pch=4)
CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=4,main="K=4",pch=4)
par(mfrow=c(1,1))

The following sections contain descriptions about optional functions
in fdapace.
- Outlier detection (Febrero et al., 2007)
Febrero, M., Galeano, P., & González-Manteiga, W. (2007). A
functional analysis of NOx levels: location and scale estimation and
outlier detection. Computational Statistics, 22(3), 411-427.
- Functional box-plot (Hyndman and Shang, 2010)
Hyndman, R. J., & Shang, H. L. (2010). Rainbow plots, bagplots,
and boxplots for functional data. Journal of Computational and Graphical
Statistics, 19(1), 29-45.
- KDE: Kernel density estimation: it is a non-parametric method to
estimate the PDF of a random variable based on kernels as weights.
CreateOutliersPlot(fpcaObjFlies,optns=list(K=3,variant='KDE'))

- The green line corresponds to
the functional median,
the dark gray area to the area spanned by the curves within the 25th and
75th percentile and the light gray to the area spanned by the curves
within the 2.5th and 97.5th percentile.
CreateFuncBoxPlot(fpcaObjFlies,xlab="Days",ylab="Number of eggs",
optns=list(K=3,variant="bagplot"))
Warning: Cannot use bagplot because aplpack::compute.bagplot is unavailable; reverting to point-wise

---
title: "Functional PCA Analysis Using PACE Approach"
output: html_notebook
---

# Introduction
* A brief introduction to the package fdapace (Gajardo et al. 2021)

> Gajardo, Alvaro, Satarupa Bhattacharjee, Cody Carroll, Yaqing Chen, Xiongtao Dai, Jianing Fan, Pantelis Z. Hadjipantelis, et al. (2021). fdapace: Functional Data Analysis and Empirical Dynamics. https://CRAN.R-project.org/package=fdapace 

* PACE is an algorithm for sparse data cases in FDA.
* https://www.jstor.org/stable/27590579 
* Author: Taekyung Kim (PhD.), Associate Professor, Kwangwoon University, Korea
* This document is originated from `https://cran.r-project.org/web/packages/fdapace/vignettes/fdapaceVig.html`.

# FPCA in R using fdapace

## Install
Let's install the `fdapace` package first.
```{r message=FALSE, warning=FALSE}
install.packages('fdapace')
```

## Load the Package

You know that the package is called before using it.
```{r}
library(fdapace)
```

## Basic Notation

### Data
* one has two lists `yList` and `tList` 
* `yList` = a list of vectors each containing the observed values $Y_{ij}$ for the _ith_ subject
* `tList` = a list of vectors containing corresponding time points

### Model Function
```
FPCAobj <- FPCA(Ly = yList, Lt = tList)
```

# Toy Example

## Data
Let's generate a toy dense functional dataset!

```{r}
##### SET #####
M <- 200 #number of measurements per subjects
N <- 100 # the number of subjects
set.seed(123) #for bootstrap
##### DEFINE ####
s <- seq(0,10,length.out = M) #from 0 to 10, time length 200(=M)
# generating functions
meanFunct <- function(s) s+10*exp(-(s-5)^2)
eigFunct1 <- function(s) cos(2*s*pi/10)/sqrt(5)
eigFunct2 <- function(s) sin(2*s*pi/10)/sqrt(5)

```

`meanFunct` $y=s+10{e}^{{-(s-5)}^{2}}$
```{r}
plot(s,meanFunct(s),type='l')
```

The first eigen function is $y=\frac{cos(\frac{2s\pi}{10})}{\sqrt{5}}$
```{r}
plot(s,eigFunct1(s),type='l',col='red')
```

The second eigen function is $\frac{-sin(\frac{2s\pi}{10})}{\sqrt{5}}$

```{r}
plot(s,eigFunct2(s),type='l',col='blue')
```

Random residuals for eigen effects

```{r}
Ksi <- matrix(rnorm(N*2),ncol=2)
head(Ksi)
```

* 1st column = eigen 1
* 2nd column = eigne 2
* for N subjects (counting number, of course)
* Distributions are from Normal PDF

Centering of matrix
```{r}
Ksi <- apply(Ksi,2,scale)
head(Ksi)
```

Mean = 0
```{r}
round(colMeans(Ksi))
```

Scale up by 5 and 2 for each (just for fun...)
```{r}
Ksi <- Ksi %*% diag(c(5,2))
head(Ksi,n=10)
```

Creating $Y_{true}$

$$
Y_{true}=E_{Njk}+\mu_{Nk}+\epsilon_{Nj}
$$
```{r}
yTrue <- Ksi %*% t(matrix(c(eigFunct1(s),eigFunct2(s)),ncol=2)) + t(matrix(rep(meanFunct(s),N),nrow=M))
```

* Effects from eigen vectors
* Effects from a mean latent function
* Effects from random residuals

The shape is 100(=N, subjects) by 200(=M, time length data, i.e., measurements)

```{r}
dim(yTrue)
```

## Running FPCA on a dense dataset

```{r}
L3 <- MakeFPCAInputs(
  IDs = rep(1:N, each=M), # like, 1 1 1 2 2 2 3 3 3 for rep(1:3, each=3)
  tVec = rep(s,N), # like 1 2 1 2 1 2 for rep(1:2,3)
  t(yTrue) #row = time, column = subject
)
```

Let's do it!
```{r}
FPCAdense <- FPCA(L3$Ly,L3$Lt)
```

### Results

```{r}
plot(FPCAdense)
```
As we expected,
* mean- and eigen functiosn are separaeted and all predicted as they are
* In the scree plot, two principal components explain almost everything

Let's extract those.

```{r}
meanOfResult <- FPCAdense$mu
phi1 <- FPCAdense$phi[,1]
phi2 <- FPCAdense$phi[,2]
```

```{r}
library(scales) #for alpha()
par(mfrow=c(2,2)) # 2 by 2 plot areas
# area 1,1 (for all subject data)
plot(1,type='n',xlab='time',ylab='value',xlim=c(min(s),max(s)),ylim=c(min(yTrue),max(yTrue)))
for(i in 1:100) {
  lines(x=s,y=yTrue[i,],col=alpha('gray',0.2))
}
plot(s,meanOfResult,type='l',col='red')
plot(s,phi1,type='l',col='blue')
plot(s,phi2,type='l',col='green')

par(mfrow=c(1,1)) #back to normal
```

This package works like magic!

## Running FPCA on a sparse and noisy dataset

Now, let's see how the PACE algorithm works.

```{r}
ySparse <- Sparsify(yTrue,s,sparsity = c(1:5)) #generating sparse data (measurements from 1 to 5 points)
```

Now noise.

```{r}
ySparse$yNoisy <- lapply(ySparse$Ly,function(x) x+0.5*rnorm(length(x)))
```

Run
```{r}
FPCAsparse <- FPCA(ySparse$yNoisy, ySparse$Lt)
```

Result
```{r}
plot(FPCAsparse)
```

Now let's consider cumulative functional variance to determine a proper number of principal factors.

```{r}
FPCAsparse$cumFVE
```

For your information, the sparse data look like this:

```{r}
par(mfrow=c(1,2))
#for dense data
par(mfrow=c(2,2)) # 2 by 2 plot areas
# area 1,1 (for all subject data)
plot(1,type='n',xlab='time',ylab='value',xlim=c(min(s),max(s)),ylim=c(min(yTrue),max(yTrue)),
     main="Dense Data")
for(i in 1:100) {
  lines(x=s,y=yTrue[i,],col=alpha('green',0.2))
}
#for sparse data
plot(1,type='n',xlab='time',ylab='value',
     xlim=c(min(unlist(ySparse$Lt)),max(unlist(ySparse$Lt))),
     ylim=c(min(unlist(ySparse$Ly)),max(unlist(ySparse$Ly))),
     main="Sparse Data")
for(i in 1:100) {
  x=ySparse$Lt[[i]]
  y=ySparse$Ly[[i]]
  lines(x,y,col=alpha('red',0.2))
}
par(mfrow=c(1,1))
```

Estimate path plots: let's create the fitted sample path plot based on the results from the parse data.

```{r}
CreatePathPlot(FPCAsparse)
```

Prediction grid:

```{r}
SelectK(FPCAsparse,criterion='AIC')
# K=SelectK(FPCAsparse,criterion='AIC')$K
predSparse <- predict(FPCAsparse,ySparse$Ly,ySparse$Lt,K=2) #number of PCA = 2
```

Number of samples in ySparse:
```{r}
length(ySparse$Lt)
```

Number of predictions:
```{r}
nrow(predSparse$scores)
```

```{r}
head(predSparse$scores)
```

# Real-world Example
## Introduction
* medfly25 dataset: this is a dataset containing the eggs laid from 789 medflies during the first 25 days of their lives. (see also, Carey et al. 1998))

> Carey, J. R., Liedo, P., Müller, H. G., Wang, J. L., & Chiou, J. M. (1998). Relationship of age patterns of fecundity to mortality, longevity, and lifetime reproduction in a large cohort of Mediterranean fruit fly females. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, 53(4), B245-B251.

* The data are rather noisy, dense and with a characteristic flat start.

## Data
```{r}
data("medfly25")
```

```{r}
head(medfly25)
```


```{r}
N <- nrow(medfly25)
library(scales)
library(ggplot2)
library(dplyr)
library(gridExtra)

g1 <- medfly25 %>% 
  ggplot(aes(x=Days,y=nEggs,group=ID)) +
  geom_line(alpha=0.2,color='gray') +
  geom_smooth(aes(group=1),se=FALSE)+
  theme_bw() +
  labs(x="Days",y="Number of Eggs") +
  ggtitle("Medflies - Time Series Plot")

g2 <-medfly25 %>% 
  ggplot(aes(x=Days,y=nEggs)) +
  geom_density_2d_filled() +
  scale_fill_brewer()+
  theme_bw() +
  labs(x="Days",y="Number of Eggs") +
  ggtitle("Medflies - Density Plot") +
  theme(legend.position='none')

grid.arrange(g1,g2,ncol=2)

```

## FPCA

```{r}
Flies <- MakeFPCAInputs(medfly25$ID,medfly25$Days,medfly25$nEggs)
fpcaObjFlies <- FPCA(Flies$Ly,Flies$Lt,
                     list(plot=TRUE,
                          methodMuCovEst='smooth',
                          userBwCov=2))
```

Select number of K (see, Yao et al., 2005)

> Yao, F., Müller, H. G., & Wang, J. L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American statistical association, 100(470), 577-590.

```{r}
numK <- SelectK(fpcaObjFlies,criterion = "FVE")
numK
K <- numK$K
```


```{r}
CreatePathPlot(fpcaObjFlies)
```

```{r}
medfly25[medfly25$ID==unique(medfly25$ID)[c(3,5,135)],] %>% 
  ggplot(aes(x=Days,y=nEggs,color=as.factor(ID))) +
  geom_line()
```


```{r}
par(mfrow=c(2,2))
  CreatePathPlot(fpcaObjFlies,subset=c(3,5,135))
  CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=1,main="K=1",pch=4)
  CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=3,main="K=3",pch=4)
  CreatePathPlot(fpcaObjFlies,subset=c(3,5,135),K=4,main="K=4",pch=4)
par(mfrow=c(1,1))

```

The following sections contain descriptions about optional functions in `fdapace`.

* Outlier detection (Febrero et al., 2007)

> Febrero, M., Galeano, P., & González-Manteiga, W. (2007). A functional analysis of NOx levels: location and scale estimation and outlier detection. Computational Statistics, 22(3), 411-427.

* Functional box-plot (Hyndman and Shang, 2010)

> Hyndman, R. J., & Shang, H. L. (2010). Rainbow plots, bagplots, and boxplots for functional data. Journal of Computational and Graphical Statistics, 19(1), 29-45.

* KDE: Kernel density estimation: it is a non-parametric method to estimate the PDF of a random variable based on kernels as weights.


```{r}
CreateOutliersPlot(fpcaObjFlies,optns=list(K=3,variant='KDE'))

```

* The green line corresponds to `the functional median`, the dark gray area to the area spanned by the curves within the 25th and 75th percentile and the light gray to the area spanned by the curves within the 2.5th and 97.5th percentile.

```{r}
CreateFuncBoxPlot(fpcaObjFlies,xlab="Days",ylab="Number of eggs",
                  optns=list(K=3,variant="bagplot"))
```

