Practical Data Analysis

Data Clearing.

The data comes from http://groupware.les.inf.puc-rio.br/har.

setwd("C:\\")
if(!file.exists("Wgtlft")){
        dir.create("Wgtlft")
}else message("We already have one")

## We already have one

TrainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(TrainUrl, destfile = "./Wgtlft/train.csv")
Train.dateDownloaded <- date()

TestUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(TestUrl, destfile = "./Wgtlft/test.csv")
Test.dateDownloaded <- date()

Train <- read.csv("./Wgtlft/train.csv")
Test <- read.csv("./Wgtlft/test.csv")

This dataset includes other summary statistics of each variable, such as mean, and kurtosis, and skewness. During the data mining, I discovered the summary statistics can be useful to predict the outcome, becuase each participant did the toy excercise accroding to the class in each different time space. And the summary data takes the feature of the data in the given time set. But, since our analysis is to predict the outcome with Test file, we omit this process - for doing that, we need to reprocess Test data in the same passion as Train set.

n <- names(Train)

Exclude <-!grepl("max_",n) & !grepl("min_",n) &
        !grepl("amplitude_", n) & !grepl("avg_", n) & !grepl("var_", n) &
        !grepl("stddev_", n) & !grepl("kurtosis_", n) & !grepl("skewness_", n) &
        !grepl("X", n) & !grepl("user_name", n) & !grepl("_timestamp", n) & 
        !grepl("total_", n)& !grepl("_window", n)

Tr <- Train[, Exclude]
Tst <- Train[, Exclude]

One problem of this data analysis is that the data is so huge that it takes so much time to do some sophiscated algorithm even with help of parralel computing. So, I randomly sample 2,500 data among the data set. It turned out it worked preety handsomely.

set.seed(5697)
s <- sample(c(1:nrow(Tr)),2500,replace = F)
Str <- Tr[s,]

My plan is to seperate the sampled data set into both train and test dataset, then validate the model into Test set which we seperated earlier.

set.seed(2587)
iTrain <- createDataPartition(y = Str$classe, p = 0.75, list = FALSE)
Sttr <- Str[iTrain, ]
Stts <- Str[-iTrain, ]

Among many different algoritms, Random Forest, or Boosting are good match to the analysis, since the perculiar density shape of the data. the density shape of the data follows several-bell shape pattern.

g1 <- ggplot(data = Train, aes(x = roll_belt)) + geom_density() + ggtitle("Roll Belt")
g2 <- ggplot(data = Train, aes(x = roll_arm)) + geom_density() + ggtitle("Roll Arm")
g3 <- ggplot(data = Train, aes(x = roll_forearm)) + geom_density() + ggtitle("Roll forearm")
g4 <- ggplot(data = Train, aes(x = roll_dumbbell)) + geom_density() + ggtitle("Roll dumbbell")
grid.arrange(g1,g2,g3,g4, ncol = 2, nrow = 2)

It’s not a supprise that the data shape is bizzar, since the data collects up-and-down movement of body in the time space. For example, when we lift up the dumbbell, we lift the fore arm, then, halt for a while to squeeze the muscel, then lift down the dumbbell. So increase-stop-decrease pattern has to be shown in the data. Actually it might be better to analyse the data with time series, because the each pattern appear its type as the time goes. But hopefully, the mean of each value for each class shows some distinct pattern, so it’s ok not to use the time series model.

For parallel computing, I used two packages, parallel, and doParallel.

no_core <- detectCores()-1
cl <- makeCluster(no_core)

To validate the data process, I used the cross-validation with 10 folds.

ctrl <- trainControl(method = "cv", number = 10)
registerDoParallel(cl)
rf.fit <- train(classe ~., data = Sttr, trControl = ctrl,
                method = "rf", tuneLength = 10)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

stopCluster(cl)

For the testset seperated from the sampled data shows the high accuracy, 93.42%. The 95% confidence level on the accuracy is from 91.118% to 95.24%.

pred.Stts <- predict(rf.fit,Stts)
conf.Stts <- confusionMatrix(pred.Stts,Stts$classe)
conf.Stts$table

##           Reference
## Prediction   A   B   C   D   E
##          A 184  10   0   0   0
##          B   0 107   9   2   2
##          C   0   3  97   5   2
##          D   0   0   1  90   1
##          E   0   0   4   2 104

kable(conf.Stts$table)

	A	B	C	D	E
A	184	10	0	0	0
B	0	107	9	2	2
C	0	3	97	5	2
D	0	0	1	90	1
E	0	0	4	2	104

d <- data.frame(round(conf.Stts$overall, digits = 4))
colnames(d) <- "Prob"
kable(d)

	Prob
Accuracy	0.9342
Kappa	0.9163
AccuracyLower	0.9118
AccuracyUpper	0.9524
AccuracyNull	0.2953
AccuracyPValue	0.0000
McnemarPValue	NaN

Among the class, the Sensitivity rate differs, for class A, the rate is 1, highest among all the class. The lowewst one is class C.

ds <- data.frame(conf.Stts$byClass)
kable(ds[,c(1:2)])

	Sensitivity	Specificity
Class: A	1.0000000	0.9772210
Class: B	0.8916667	0.9741551
Class: C	0.8738739	0.9804688
Class: D	0.9090909	0.9961832
Class: E	0.9541284	0.9883268

The next one, as I mentioned, is the prediction result on Test data set which we got before sampling the data. We can see not much difference on the validation data. But the confidence inteval gets narrower from 94.25% to 94.89%, which is aspiring result, although the Specifictity on class A goes down to 98.76%.

pred.Test <- predict(rf.fit,Tst)
conf.Test <- confusionMatrix(pred.Test,Tst$classe)
conf.Test$table

##           Reference
## Prediction    A    B    C    D    E
##          A 5512  210    4   14    2
##          B   33 3398  161   13   78
##          C   15  150 3194  152   56
##          D   18   22   45 3015   34
##          E    2   17   18   22 3437

kable(conf.Test$table)

	A	B	C	D	E
A	5512	210	4	14	2
B	33	3398	161	13	78
C	15	150	3194	152	56
D	18	22	45	3015	34
E	2	17	18	22	3437

d <- data.frame(round(conf.Test$overall, digits = 4))
colnames(d) <- "Prob"
kable(d)

	Prob
Accuracy	0.9457
Kappa	0.9312
AccuracyLower	0.9424
AccuracyUpper	0.9488
AccuracyNull	0.2844
AccuracyPValue	0.0000
McnemarPValue	0.0000

ds <- data.frame(conf.Test$byClass)
kable(ds[,c(1:2)])

	Sensitivity	Specificity
Class: A	0.9878136	0.9836206
Class: B	0.8949170	0.9819905
Class: C	0.9333723	0.9769753
Class: D	0.9375000	0.9927466
Class: E	0.9528694	0.9963160

Before predicting the data on the test data set we downloaded, first, we need to find the best tune parameter of the model, in this case, the number of variable used in tree, mtry. Second, we need to varify which is the most important variables to predict.

First, as you see below, The random forest model gives the highest accuracy when the number of the sampled varaible is 17 among 49 variables, the best model comes out.

kable(rf.fit$bestTune)

	mtry
4	17

dim(Tr)

## [1] 19622    49

Second, as you see in the graph below, ‘roll belt’, ‘pitch forearm’, ‘magnet dumbel z’, ‘yaw belt’, and ‘magnet dumbbell y’ significantly decrease gini mean. Among those, ‘roll belt’ is dominant given the amount of decrease.

varImpPlot(rf.fit$finalModel)

Practical Data Analysis

Bruce Kim

October 17, 2016

Synopsis.

Data Clearing.