Get full data set here: [UCI’s Center for Machine Learning and Intelligent Systems] (http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones)
The study creating this database involved 30 volunteers "performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. … Each person performed six activities … wearing a smartphone (Samsung Galaxy S II) on the waist. … The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
load("C:/Users/angul/OneDrive/R/ExploreData/Data/samsungData.rda")
names(samsungData)[1:12]
## [1] "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y" "tBodyAcc-mean()-Z"
## [4] "tBodyAcc-std()-X" "tBodyAcc-std()-Y" "tBodyAcc-std()-Z"
## [7] "tBodyAcc-mad()-X" "tBodyAcc-mad()-Y" "tBodyAcc-mad()-Z"
## [10] "tBodyAcc-max()-X" "tBodyAcc-max()-Y" "tBodyAcc-max()-Z"
table(samsungData$activity)
##
## laying sitting standing walk walkdown walkup
## 1407 1286 1374 1226 986 1073
myplclust <- function( hclust, lab=hclust$labels, lab.col=rep(1,length(hclust$labels)), hang=0.1,...){
## modifiction of plclust for plotting hclust objects *in colour*!
## Copyright Eva KF Chan 2009
## Arguments:
## hclust: hclust object
## lab: a character vector of labels of the leaves of the tree
## lab.col: colour for the labels; NA=default device foreground colour
## hang: as in hclust & plclust
## Side effect:
## A display of hierarchical cluster with coloured leaf labels.
y <- rep(hclust$height,2)
x <- as.numeric(hclust$merge)
y <- y[which(x<0)]
x <- x[which(x<0)]
x <- abs(x)
y <- y[order(x)]
x <- x[order(x)]
plot( hclust, labels=FALSE, hang=hang, ... )
text( x=x, y=y[hclust$order]-(max(hclust$height)*hang), labels=lab[hclust$order], col=lab.col[hclust$order], srt=90, adj=c(1,0.5), xpd=NA, ... )}
par(mfrow=c(1,2))
numericActivity <- as.numeric(as.factor(samsungData$activity))[samsungData$subject==1]
plot(samsungData[samsungData$subject==1,1],pch=19,col=numericActivity,ylab=names(samsungData)[1])
plot(samsungData[samsungData$subject==1,2],pch=19,col=numericActivity,ylab=names(samsungData)[2])
legend(150,-0.1,legend=unique(samsungData$activity),col=unique(numericActivity),pch=19)
Plotting average acceleration for first subject Plotting average acceleration for first subject does not show any clear patterns, the activities involving movement show more variation.
source("myplclust.R")
distanceMatrix <- dist(samsungData[samsungData$subject==1,1:3])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering,lab.col=numericActivity)
The dendrogram is not very informative as the average acceleration feature does not seem to be able to discriminate between the six variables here.
par(mfrow=c(1,2))
plot(samsungData[samsungData$subject==1,10],pch=19,col=numericActivity,ylab=names(samsungData)[10])
plot(samsungData[samsungData$subject==1,11],pch=19,col=numericActivity,ylab=names(samsungData)[11])
here we see a separation of two type of actives, in movement (walk - walk down -walk up) & static (laying - sitting - standing). So we could
distanceMatrix <- dist(samsungData[samsungData$subject==1,10:12])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering,lab.col=numericActivity)
separation was done at Height = 1.25 into moving & not moving.
on the matrix, expect for the last two columns which identify the participants.
svd1 = svd(scale(samsungData[samsungData$subject==1,-c(562,563)]))
par(mfrow=c(1,2))
plot(svd1$u[,1],col=numericActivity,pch=19)
plot(svd1$u[,2],col=numericActivity,pch=19)
we see the same pattern as before in the first vector, the second vector groups data at the top, then it thins down except for the purple dots (walk up) so it could be ordering by change in altitude. So we can use the second right singular vector to see the feature which impacts the distribution the most.
plot(svd1$v[,2],pch=19)
#New clustering with maximum contributer
maxContrib <- which.max(svd1$v[,2])
distanceMatrix <- dist(samsungData[samsungData$subject==1,c(10:12,maxContrib)])
hclustering <- hclust(distanceMatrix)
myplclust(hclustering,lab.col=numericActivity)
maxContrib, separated the three movement activities.
names(samsungData)[maxContrib]
## [1] "fBodyAcc-meanFreq()-Z"
#K-means clustering (nstart=1, first try) K-means clustering (nstart=1, first try)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])
##
## laying sitting standing walk walkdown walkup
## 1 0 0 0 0 23 0
## 2 5 0 0 0 0 53
## 3 19 13 5 0 0 0
## 4 26 34 48 0 0 0
## 5 0 0 0 95 0 0
## 6 0 0 0 0 26 0
Cluster 3 is a mix of laying, sitting and standing, the rest are for single factors, repeating the clustering should give similar results.
##K-means clustering (nstart=100, first try) K-means clustering (nstart=100, first try)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=100)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])
##
## laying sitting standing walk walkdown walkup
## 1 0 37 51 0 0 0
## 2 0 0 0 95 0 0
## 3 29 0 0 0 0 0
## 4 18 10 2 0 0 0
## 5 0 0 0 0 49 0
## 6 3 0 0 0 0 53
###K-means clustering (nstart=100, second try)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=100)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])
##
## laying sitting standing walk walkdown walkup
## 1 0 37 51 0 0 0
## 2 0 0 0 0 49 0
## 3 29 0 0 0 0 0
## 4 0 0 0 95 0 0
## 5 18 10 2 0 0 0
## 6 3 0 0 0 0 53
The second attempt gives better results, in both cases we did not specify the starting point so K-means choose a random point by default.
#Cluster 1 Variable Centers (Laying) Cluster 1 Variable Centers (Laying)
kClust <- kmeans(samsungData[samsungData$subject==1,-c(562,563)],centers=6,nstart=100)
table(kClust$cluster,samsungData$activity[samsungData$subject==1])
##
## laying sitting standing walk walkdown walkup
## 1 29 0 0 0 0 0
## 2 0 0 0 0 49 0
## 3 18 10 2 0 0 0
## 4 3 0 0 0 0 53
## 5 0 0 0 95 0 0
## 6 0 37 51 0 0 0
so we see that walking down is cluster 6 & walking up is cluster 2, we can see can now use this to find the features that class the activity.
plot(kClust$center[6,1:10],pch=19,ylab="Cluster Center",xlab="")
here we see of the 10 features plotted, features 1 & 10 are weigh more in clasifying walking down is very different, most features have
plot(kClust$center[2,1:10],pch=19,ylab="Cluster Center",xlab="")
moved up proportionally, expect for feature 7 and 10 which have dropped alot & feature 6 which shot up. These differences in patterns are useful for modeling.