本数据是典型的函数数据,函数数据的聚类分析有若干中算法,本文将尝试用函数数据聚类分析包去对fitbit band data 进行聚类。
\(\\\)
url1 <- "/home/ghy/Dropbox/User1/cleaned_data/detailed-data-fitbit-aug"
url3 <- "_16-HR.csv"
day <- list()
for (i in 1:31) {
url2 <- paste0(i,"_",i)
url <- paste0(url1, url2, url3)
day[[i]] = read.csv(url)
}
save(day, file = "/home/ghy/Dropbox/User1/cleaned_data/day.Rdata")
save(day, file = "/home/ghy/Dropbox/User1/gonghy_R_data_analysis/heartrate_app/day.Rdata")
我们画图看看每天心率变化。
var.rename <- function(x){
names(x) <- c("time", "rate")
x
}
# 画图函数
plot.data <- function(test){
names(test) <- c("time", "rate")
test.time <-strptime(test$time, format = "%Y-%m-%d %H:%M:%S")
qplot(test.time, test$rate,xlab="时间",ylab="心率",main="心率随时间变化图", geom = "line")
}
# Aug 17 的心率图
plot.data(day[[17]])
更进一步,我们研究的是八月份的数据,我们希望有个 App 可以查看每天的心率图。
# 如果电脑安装了Rstudio直接运行如下代码
library(shiny)
shinyAppDir("/home/ghy/Dropbox/User1/gonghy_R_data_analysis/heartrate_app/")
此app的运行需要Rstudio。如果没有此环境,可以访问网址 https://heyanggong.shinyapps.io/heartrate_app/
分析的基本步骤:
这里我们用函数数据分析的方式来处理数据, 下面把数据转换为函数数据类型。
我们得到了函数对象的list rate.fd, 然后用R package 将它进行聚类。
方法一,用 Funclustering 包,代码繁杂,可以打开源文件查看源代码。
方法二,用fda.usc包,K-means方法进行聚类
set.seed(100)
library(fda.usc)
res <- kmeans.fd(fd, ncl = 2)
res$cluster
## [1] 1 2 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1
res <- kmeans.fd(fd, ncl = 3)
res$cluster
## [1] 1 3 1 2 2 2 1 1 3 1 3 1 3 1 1 1 1 1 3 2 3 2 1 2 1 1 2
第二种方法简单方便。聚类分析的基本方法至此完成。
我们可以根据心率图来预测是周几?
minutebasis <- create.bspline.basis(rangeval = c(0, 1440), norder = 4, nbasis = 20)
wed <- list()
wed[[1]] <- day[[3]]
wed[[2]] <- day[[10]]
wed[[3]] <- day[[17]]
wed[[4]] <- day[[24]]
sat <- list()
sat[[1]] <- day[[5]]
sat[[2]] <- day[[12]]
sat[[3]] <- day[[19]]
sat[[4]] <- day[[26]]
test <- list()
coef <- c()
for (i in 1:4){
test[[i]] <- with(wed[[i]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
for (i in 5:8){
test[[i]] <- with(sat[[i-4]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
fd <- fd(coef, basisobj = minutebasis)
set.seed(10)
res <- kmeans.fd(fd, ncl = 2)
res$cluster
## [1] 1 2 2 2 2 1 1 2
minutebasis <- create.fourier.basis(rangeval = c(0, 1440), nbasis = 10)
wed <- list()
wed[[1]] <- day[[3]]
wed[[2]] <- day[[10]]
wed[[3]] <- day[[17]]
wed[[4]] <- day[[24]]
sat <- list()
sat[[1]] <- day[[5]]
sat[[2]] <- day[[12]]
sat[[3]] <- day[[19]]
sat[[4]] <- day[[26]]
test <- list()
coef <- c()
for (i in 1:4){
test[[i]] <- with(wed[[i]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
for (i in 5:8){
test[[i]] <- with(sat[[i-4]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
fd <- fd(coef, basisobj = minutebasis)
set.seed(100)
res <- kmeans.fd(fd, ncl = 2)
res$cluster
par(mfrow = c(1,1))
plot(fd)
此方法下预测分类的正确率为 5/8。 正确比较低,可能原因有
可以尝试如下途径改进结果
我再试试其他的。
minutebasis <- create.fourier.basis(rangeval = c(0, 1440), nbasis = 10)
wed <- list()
wed[[1]] <- day[[1]]
wed[[2]] <- day[[8]]
wed[[3]] <- day[[15]]
wed[[4]] <- day[[22]]
sat <- list()
sat[[1]] <- day[[5]]
sat[[2]] <- day[[12]]
sat[[3]] <- day[[19]]
sat[[4]] <- day[[26]]
test <- list()
coef <- c()
for (i in 1:4){
test[[i]] <- with(wed[[i]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
for (i in 5:8){
test[[i]] <- with(sat[[i-4]] %>% var.rename, smooth.basis(time.to.minute(time), rate, minutebasis)$fd)
coef <- cbind(coef, coef(test[[i]]))
}
fd <- fd(coef, basisobj = minutebasis);plot(fd)
set.seed(10000)
res <- kmeans.fd(fd, ncl = 2)
res$cluster
par(mfrow = c(1,1))
plot(fd)