This vignette describes the usage of fda.usc in R. fda.usc package carries out exploratory and descriptive analysis of functional data exploring its most important features such as:
The co-author of fda.usc is Manuel Febrero-Bande, the contributors are Pedro Galeano, Alicia Nieto and Eduardo Garcia-Portugues.
R task view devoted to FDA: https://cran.r-project.org/web/views/FunctionalData.html
is included in the set of the 7 recommended core packages.
Dependencies:
fda.usc package download by month:
fda It is a basic reference to work in R with functional data, see (Ramsay and Silverman 2005), (http://ego.psych.mcgill.ca/misc/fda/)
Ferray and Vieu, (2006) processed FD from a nonparametric point of view (normed or semi–normed functional spaces). These authors are part of the French group STAPH maintaining the page {http://www.lsp.ups-tlse.fr/staph/}}
Core packages:
4.Interactive tools: + StatFda: exploratory analysis and functional regression models,http://www.statfda.com forand refund.shiny package for interactive plotting. + refund.shiny: Interactive Plotting for Functional Data Analyses + tidyfun: makes data wrangling and exploratory analysis of functional data easier, https://fabian-s.github.io/tidyfun
Like many other R packages, the simplest way to obtain fda.usc is to install it directly from CRAN. Type the following command in R console:
install.packages("fda.usc", repos = "http://cran.us.r-project.org")
Users may change the repos options depending on their locations and preferences. Other options such as the directories where to install the packages can be altered in the command. For more details, see help(install.packages).
Here the R package has been downloaded and installed to the default directories.
Alternatively, users can download the package source at http://cran.r-project.org/web/packages/fda.usc/index.html and type Unix commands to install it to the desired location.
The purpose of this section is to give users a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help. More details are given in later sections.
First, we load the fda.usc package:
library(fda.usc)
Functional data analysis is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. The continuum is often time, but may also be spatial location, wavelength, probability, etc.
Functional data analysis is a branch of statistics concerned with analysing data in the form of functions.
Definition 1: A random variable \(\mathcal{X}\) is called a functional variable if it takes values in a functional space \(\mathcal{E}\) –complete normed (or seminormed) space–, (Ferraty and Vieu 2006)
Definition 2: A functional dataset \(\{\mathcal{X}_1,\ldots,\mathcal{X}_n\}\) is the observation of *n} functional variables \(\mathcal{X}_1,\ldots,\mathcal{X}_n\) identically distributed as \(\mathcal{X}\), (Ferraty and Vieu 2006).
\(X_i(t)\) represents the mean temperature (averaged over 1980-2009 years) at the ith weather station in Spain, and at time time t during the year. In R:
But what code has been used to obtain the graph?
data(aemet)
par(mfrow=c(1,2))
CanaryIslands= ifelse(aemet$df$latitude<31,2,4)
plot(aemet$df[,c("longitude","latitude")],col=CanaryIslands,lwd=2)
plot(aemet$temp,col=CanaryIslands,lwd=2)
Definition of fdata class object: An object called fdata as a list of the following components:
lapply(aemet,class)
names(aemet$df)
lapply(aemet$temp,class)
temp <- aemet$temp
names(temp)
dim(temp)
length(argvals(temp))
rangeval(temp)
temp$names
is.fdata(temp)
is.fdata(aemet$temp)
is.fdata(aemet$df)
dim(aemet$df)
dim(aemet$temp)
temp.fd=fdata2fd(temp,type.basis="fourier",nbasis=15)
temp.fdata=fdata(temp.fd) #back to fdata
class(temp.fd)
## [1] "fd"
class(temp.fdata)
## [1] "fdata"
# Canary Islands in red vs Iberian Peninsula in blue
l1<-split.fdata(temp,CanaryIslands)
par(mfrow=c(1,2))
plot(l1[[1]],col=2,ylim=c(0,30))
plot(l1[[2]],col=4,ylim=c(0,30))
dim(l1[[1]]);dim(l1[[2]])
## [1] 9 365
## [1] 64 365
# internal function plot.lfdata
plot.lfdata<-fda.usc:::plot.lfdata
plot.lfdata(l1)
l1[[1]]==l1[[2]]
## [1] FALSE
temp2<-order.fdata(order(CanaryIslands),temp)
rproc2fdata() function generates Functional data from: Ornstein Uhlenbeck process, Brownian process, Gaussian process or Exponential variogram process.
par(mfrow=c(1,2))
lent<-30
tt<-seq(0,1,len=lent)
mu<-fdata(sin(2*pi*tt),tt)
plot(rproc2fdata(200,t=tt,sigma="OU",par.list=list("scale"=1)))
plot(rproc2fdata(200,mu=mu,sigma="OU",par.list=list("scale"=1)))
Other functons: gridfdata() generates n curves as lineal combination of the original curves fdataobj plus a functional trend mu.
rcombfdata() generates n random linear combinations of the fdataobj curves plus a functional trend mu. The coefficients of the combinations follows a normal distribution with zero mean and standard deviation sdarg.
If we supposed that our functional data \(Y(t)\) is observed through the model \(Y(t_i)=X(t_i)+\varepsilon(t_i)\) where the residuals \(\varepsilon(t)\) are independent with \(X(t)\).
We can to get back the original signal \(X(t)\) using a linear smoother: \[\hat{X}(t_i)=\sum_{i=1}^{n} s_{i}(t_j)Y(t_i) \Rightarrow \mathbf{\hat{X}}=\mathbf{S}\mathbf{Y} \] where \(s_{i}(t_j)\) is the weight that the point \(t_j\) gives to the point \(t_i\).
We use two methods to estimate the smoothing matrix \(S\).
Let \(X(t)\in \mathcal{L}_2\), \[ X(t)= \sum_{k\in\mathbb{N}}{c_k\phi_k(t)}\approx\sum_{k=1}^K c_k\phi_k(t)=c^{\top}\Phi\]
The smoothing matrix is given by: \(\mathbf{S}=\Phi(\Phi^{\top}W\Phi+\lambda R)^{-1}\Phi^{\top}W\) where \(\lambda\) is the penalty parameter.
Type of basis:
Fourier: Design to represent periodic functions. Orthonormal
BSplines: Set of polynomials (of order m) dened in subintervals constructed in such a way that in the border of the subintervals the polynomials coincide (up to \(m - 2\) derivative). Banded
Wavelets: Powerful especially when the grid is a power of 2. Orthonormal
Polynomials: Adjust a polynomial to the whole curve.
bsp11<-create.bspline.basis(temp$rangeval,nbasis=11)
bsp111<-create.bspline.basis(temp$rangeval,nbasis=111)
S.bsp11 <- S.basis(temp$argvals, bsp11)
S.bsp111 <- S.basis(temp$argvals, bsp111)
temp.bsp11<-temp.bsp111<-temp
temp.bsp11$data <- temp$data%*%S.bsp11
temp.bsp111$data <- temp$data%*%S.bsp111
par(mfrow=c(1,3))
plot(temp)
plot(temp.bsp111,main="111 Bspline basis elements")
plot(temp.bsp11,main="11 Bspline basis elements")
# Another way
out1<-optim.basis(temp,type.CV=CV.S)
names(out1)
## [1] "gcv" "numbasis" "lambda" "fdataobj"
## [5] "fdata.est" "gcv.opt" "numbasis.opt" "lambda.opt"
## [9] "S.opt" "base.opt"
out1$numbasis.opt
## [1] 76
The problem is to estimate the smoothing parameter or bandwidth \(\nu=h\) that better represents the functional data using kernel smoothing. Now,the nonparametric smoothing of functional data is given by the smoothing matrix \(S\): \[s_{ij}=\frac{1}{h}K\left(\frac{t_i-t_j}{h}\right)\]
\[S(h)=(s_j(t_i))=\frac{K\left(\frac{t_i-t_j}{h}\right)}{\sum_{k=1}^{\top}K\left(\frac{t_k-t_j}{h}\right)}\] where \(h\) is the bandwidth and \(K()\) the Kernel function.
Different types of kernels \(K()\) are defined in the package, see Kernel function.
S<-S.NW(temp$argvals,h=1)
temp<-aemet$temp
temp.hat<-temp
temp.hat$data<-temp$data%*%S
args(optim.np)
## function (fdataobj, h = NULL, W = NULL, Ker = Ker.norm, type.CV = GCV.S,
## type.S = S.NW, par.CV = list(trim = 0, draw = FALSE), par.S = list(),
## correl = TRUE, verbose = FALSE, ...)
## NULL
# Another way
temp.est<-optim.np(temp,h=1)$fdata.est
par(mfrow=c(1,3))
plot(temp.est,main="Kernel smooth - h=1")
temp.est==temp.hat
## [1] FALSE
plot(temp.hat,main="Kernel smooth - h=10")
plot(optim.np(temp)$fdata.est,main="Kernel smooth- h-GCV")
The functional data can be rewritten as a decomposition in an orthonormal PC basis: maximizing the variance of \(X(t)\):
\[ \hat{X}_{i}(t)=\sum_{i=1}^{K}f_{ik}\xi_{k}(t) \]
where \(f_{i,k}\) is the score of the principal component PC, \(\xi_k\).
y<-rowSums(aemet$logprec$data)
par(mfrow=c(1,3))
plot(aemet$temp)
plot((pc<-create.pc.basis(aemet$temp,l=1:2))[[1]],main="Temperature, 2 FPC")
pc2<-fdata2pc(aemet$temp)
plot(pc2$rotation,main="2 FPC basis",xlab="day")
summary(pc2)
##
## - SUMMARY: fdata2pc object -
##
## -With 2 components are explained 98.78 %
## of the variability of explicative variables.
##
## -Variability for each component (%):
## PC1 PC2
## 85.57 13.21
args(create.pls.basis)
args(fdata2pls)
Example: Exploratory analysis for spectrometric curves
Tecator dataset: 215 spectrometric curves of meat samples also with Fat, Water and Protein contents obtained by analytic procedures.
Tecator Goal: Explain the fat content through spectrometric curves.
## [1] "absorp.fdata" "y"
## [1] "Fat" "Water" "Protein"
As shown in tecator Figure, the plot of \(X(t)\) against \(t\) is not necessarily the most informative and other semi-metric can allow to extract much information from functional variables.
The information about fat seems to be contained in the shape of the curves, so, the semimetric of derivatives could be preferred to \(\mathcal{L}_2\). It is difficult to find the best plot given a particular functional dataset because the shape of graphics depends strongly on the proximity measure.
To compute derivatives there are several options (see fdata.deriv function):
Utilities for computing distances, norm and inner products are included in the package. Below are described those but of course, many others can be built with the only restriction that the first two arguments correspond to class –fdata–. In addition, the procedures of the package fda.usc that contains the argument –metric– allow the use of metric or semi-metric functions as shown in the following sections.
fda.usc package collects several metric and semi-metric functions which allow to extract as much information possible from the functional variable.
Option 1: If the data is in a Hilbert space, represent your data in a basis, and compute all the things accordingly. From a practical point of way, the representation of a functional datum must be approximated by a nite number of terms.
Option 2: Approximates all quantities using numerical approximations (Valid for Hilbert and non-Hilbert spaces). This usually involves numerical approximations of integrals, derivatives and so on. A collection of semi-metrics proposed by (Ferraty and Vieu 2006) are also included in the package.
Example of computing norms
Consider \(X(t) = t^2; t\in[0, 1]\). In this case \(||X_1|| =\sqrt{1/5}=0.4472136\) and \(||X_2|| =1/3\)
t = seq(0, 1, len = 101)
x2 = t^2
x2 = fdata(x2, t)
(c(norm.fdata(x2), norm.fdata(x2, lp = 1))) # L2, L1 norm
## [1] 0.4472509 0.3333500
f1 = fdata(rep(1, length(t)), t)
f2 = fdata(sin(2 * pi * t), t) * sqrt(2)
f3 = fdata(cos(2 * pi * t), t) * sqrt(2)
f4 = fdata(sin(4 * pi * t), t) * sqrt(2)
f5 = fdata(cos(4 * pi * t), t) * sqrt(2)
fou5 = c(f1, f2, f3, f4, f5) # Fourier basis is Orthonormal
round(inprod.fdata(fou5), 5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
Example of computing distances
set.seed(1:4)
p<-1001
r<-rnorm(p,sd=.1)
x<-seq(0,2*pi,length=p)
fx<-fdata((sin(x)+r)/sqrt(pi),x)
fx0<-fdata(r,x)
plot(c(fx,fx0))
metric.lp(fx,fx0)[1,1]
## [1] 1.003144
semimetric.basis(fx,fx0,nbasis1=5,nbasis2=5)[1,1]
## [1] 0.9927386
semimetric.basis(fx,fx0,nbasis1=11,nbasis2=111)[1,1]
## [1] 0.9992571
semimetric.fourier(fx,fx0,nbasis=5)[1,1]
## [1] 0.9972634
semimetric.fourier(fx,fx0,nbasis=11)[1,1]
## [1] 0.9992104
semimetric.deriv(fx,fx0,nderiv=0,nknot=5)[1,1]
## [1] 0.9934345
semimetric.deriv(fx,fx0,nderiv=0,nknot=11)[1,1]
## [1] 0.9984524
integrate(function(x){(sin(x)/sqrt(pi))^2},0,2*pi)
## 1 with absolute error < 7.3e-10
Example of semi-metric as classification rule
data(tecator)
set.seed(4:1)
#ind<-sample(215,30)
#xx<-tecator$absorp[ind]
yy<-ifelse(tecator$y$Fat<20,0,1)
ind<-c(sample(which(yy==0),10),sample(which(yy==1),10))
xx<-fdata.deriv(tecator$absorp[ind])
yy<-yy[ind]
par(mfrow=c(1,1))
d1<-as.dist(metric.lp(xx),T,T)
c1<-hclust(d1)
c1$labels=yy
plot(c1,main="Raw data -- metric.lp",xlab="Class of eah leaf")
Try with the derivative of the curves:
Extension: Combining the information of two components using a \(p\)-dimensional metric such as, for example, the Euclidean:
\[m\left(\left(x_0^1,\ldots,x_0^p\right),\left(x_i^1,\ldots,x_i^p\right)\right):=\sqrt{m_1\left(x_0^1,x_i^1\right)^2+\cdots+m_p\left(x_0^p,x_i^p\right)^2}\] where \(m_{i}\) denotes the metric in the \(i\)-component of the product space, see fda.usc:::metric.ldata code.
It is important here to ensure that the different metrics of the spaces have similar scales to avoid one single component dominating the overall distance.
range(d1)
## [1] 0.005531143 0.091797359
range(d2)
## [1] 0.0003143458 0.0105466510
NOTE: Dependent data in tecator dataset?
ar(tecator$y$Fat)
##
## Call:
## ar(x = tecator$y$Fat)
##
## Coefficients:
## 1 2
## 0.6552 0.1161
##
## Order selected 2 sigma^2 estimated as 72.82
plot(tecator$y,col=1:3)
par(mfrow=c(1,1))
ts.plot(tecator$y$Fat)
#barplot(t(tecator$y))
The correlation Distances characterizes independence between vectors of arbitrary finite dimensions. Recently, in (Székely and Rizzo 2013), a bias-corrected version is considered and a test of independence developed.
data(tecator)
xx<-tecator$absorp
xx.d2<-fdata.deriv(xx,nderiv=2)
dcor.xy(xx,xx.d2)
##
## dcor t-test of independence
##
## data: D1 and D2
## T = 35.933, df = 22789, p-value < 2.2e-16
## sample estimates:
## Bias corrected dcor
## 0.2315581
d1<-metric.lp(xx)
d2<-metric.lp(xx.d2)
bcdcor.dist(d1,d2)
## [1] 0.2315581
Fat <-tecator$y[,"Fat",drop=FALSE]
d3<-metric.dist(Fat)
bcdcor.dist(d1,d3)
## [1] 0.1963254
bcdcor.dist(d2,d3)
## [1] 0.9147076
# Functional / Factor
Fat.cut<-as.matrix(cut(Fat[,1],labels=1:4,breaks=4))
dcor.xy(Fat.cut,xx.d2)
##
## dcor t-test of independence
##
## data: D1 and D2
## T = 252.57, df = 22789, p-value < 2.2e-16
## sample estimates:
## Bias corrected dcor
## 0.8583661
Depth (in univariate or multivariate context) is a statistical tool that provides a center-outward ordering of data points. This ordering can be employed to dene location measures (and by oposition outliers). Also, some measures can be dened as maximizers of a particular depth.
y12<-tecator$y[,1:2]
dep<-mdepth.LD(x=y12,scale=T)$dep
plot(y12,col=grey(1-dep))
points(y12[which.max(dep),],col=4,lwd=3)
Many of the preceding depth functions dened in a multivariate contex cannot be applied to functional data. In any case, given a depth for functional data.
The deepest point is a location measure and based on the depth function can have a dierent interpretation: Mean, Median, Mode …
Computing the depths of the whole sample and ordering them (in decreasing order) gives a rank from the most central point (\(x_{[1]}\)) to the most outlying one (\(x_{[n]}\)). So, those points with larger rank are the candidates to be outliers w.r.t. the data cloud (in the sense of the depth function).
Summarizing the \((1-\alpha)\) deepest points could lead also to a robust location measure.
Fraiman-Muniz Depth (Fraiman and Muniz 2001) depth.FM
Modal Depth (Cuevas, Febrero, and Fraiman 2007) depth.mode
Random Projection Depth (Cuevas, Febrero, and Fraiman 2007) depth.RP
Random Tukey Depth (Cuesta-Albertos and Nieto-Reyes 2008) depth.RT
Band Depth (López-Pintado and Romo 2009) *depth.MBD** (private function)
x<-tecator$absorp
depth.mode(x,draw=TRUE)
Modify the procedure to incorporate the extended information.
x.d1<-fdata.deriv(x)
depth.modep(list(x=x,x.d1=x.d1),draw=T)
Example poblenou dataset
Hourly levels of nitrogen oxides in Poblenou (Barcelona). This dataset has 127 daily records (2005/01/06-2005/06/26).
Objective: Explain the diferences in NOx levels as a function of day.
data(poblenou)
dayw = ifelse(poblenou$df$day.week == 7 | poblenou$df$day.festive ==1, 3, ifelse(poblenou$df$day.week == 6, 2, 1))
plot(poblenou$nox,col=dayw)
The \(\mathcal{L}_2\) space (distance is area between curves) seems appropriate although other possibilities could be take into account (\(\mathcal{L}_1\), for example).
fmd = depth.FM(poblenou$nox,draw=T)
md = depth.mode(poblenou$nox)
rpd = depth.RP(poblenou$nox, nproj = 50)
rtd = depth.RT(poblenou$nox)
print(cur <- c(fmd$lmed, md$lmed, rpd$lmed, rtd$lmed))
## 2005-05-06 2005-05-06 2005-05-06 2005-03-04
## 63 63 63 10
plot(poblenou$nox,col="grey")
lines(poblenou$nox[cur], lwd = 2, lty = 1:4, col = 1:4)
legend("topleft", c("FMD", "MD", "RPD", "RTD"), lwd = 2, lty = 1:4,col = 1:4)
There is no general accepted denition of outliers in Functional data so, we dene outlier as a datum generated from a dierent process than the rest of the sample with the following characteristics Its number in the sample is unknown but probably low.
An outlier will have low depth and it will be an outlier in the sense of the depth used.
md1 = depth.mode(lab <- poblenou$nox[dayw == 1]) #Labour days
md2 = depth.mode(sat <- poblenou$nox[dayw == 2]) #Saturdays
md3 = depth.mode(sun <- poblenou$nox[dayw == 3]) #Sundays/Festive
rbind(poblenou$df[dayw == 1, ][which.min(md1$dep), ], poblenou$df[dayw ==
2, ][which.min(md2$dep), ], poblenou$df[dayw == 3, ][which.min(md3$dep),])
## date day.week day.festive
## 22 2005-03-18 5 0
## 57 2005-04-30 6 0
## 58 2005-05-01 7 0
plot(poblenou$nox,col="grey")
lines(poblenou$nox[c(22,57,58),], lwd = 2, lty = 2:4, col = 2:4)
# Method for detecting outliers, see Febrero-Bande, 2008
#out1<-outliers.depth.trim(poblenou$nox[dayw == 1],nb=100)$outliers
Two procedures for detecting outliers are implemented in the package. (Febrero-Bande, Galeano, and González-Manteiga 2008)
Detecting outliers based on trimming: outliers.depth.trim()
Detecting outliers based on weighting: outliers.depth.pond()
# Method for detecting outliers, see Febrero-Bande, 2008
#out1<-outliers.depth.trim(poblenou$nox[dayw == 1],nb=100)$outliers
#out2<-outliers.depth.pond(poblenou$nox[dayw == 1],nb=100)$outliers
RPubs documents
fda.usc vignette: Installation and Descriptive Statistics
fda.usc vignette:Functional Regression
fda.usc vignette: Functional Classification and ANOVA
fda.usc vignette: Variable Selection in Functional Regression (and Classification)
fda.usc reference card Rcard
See reference cad (PDF document) at: https://cran.r-project.org/web/packages/fda.usc/fda.usc.pdf
Links:
[fda.usc R Manual] Reference help document (221 pages) at:(https://cran.r-project.org/web/packages/fda.usc/fda.usc.pdf)
Cardot, Hervé, Frédéric Ferraty, and Pascal Sarda. 1999. “Functional Linear Model.” Statist. Probab. Lett. 45 (1): 11–22.
Cuesta-Albertos, Juan, and Alicia Nieto-Reyes. 2008. “The Random Tukey Depth.” Computational Statistics and Data Analysis 52 (11): 4979–88.
Cuevas, Antonio, Manuel Febrero, and Ricardo Fraiman. 2007. “Robust Estimation and Classification for Functional Data via Projection-Based Depth Notions.” Comput. Statist. 22 (3): 481–96. http://link.springer.com/article/10.1007/s00180-007-0053-0.
Febrero-Bande, Manuel, Pedro Galeano, and Wenceslao González-Manteiga. 2008. “Outlier Detection in Functional Data by Depth Measures, with Application to Identify Abnormal \({\rm NO}_x\) Levels.” Environmetrics 19 (4): 331–45.
Ferraty, Frédéric, and Philippe Vieu. 2006. Nonparametric Functional Data Analysis. Springer Series in Statistics. New York: Springer-Verlag.
Fraiman, Ricardo, and Graciela Muniz. 2001. “Trimmed Means for Functional Data.” Test 10 (2): 419–40.
Krämer, Nicole, and Masashi Sugiyama. 2011. “The Degrees of Freedom of Partial Least Squares Regression.” Journal of the American Statistical Association 106 (494): 697–705.
López-Pintado, Satan, and Juan Romo. 2009. “On the Concept of Depth for Functional Data.” Journal of the American Statistical Association 104: 718–34.
Ramsay, J. O., and B. W. Silverman. 2005. Functional Data Analysis. Springer.
Székely, Gábor J, and Maria L Rizzo. 2013. “The Distance Correlation \(t\)-Test of Independence in High Dimension.” J. Multivariate Anal. 117: 193–213.
Székely, Gábor J, Maria L Rizzo, and Nail K Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” Ann. Statist. 35 (6): 2769–94. http://projecteuclid.org/euclid.aos/1201012979.