Overview

This report is gona to predict the human activity, the data for training and testing are from ‘Weight Lifting Exercises Dataset’ in the Human Activity Recognize Website and a paper about this dataset can be found in here.

The code and test result of this report, can be found in the github homepage

Data exploration analysis and pre-processing

The data was recorded by four 9 degrees of freedom Razor inertial measurement units (IMU), which provide three-axes acceleration, gyroscope and magnetometer data at a joint sampling rate of 45 Hz. The sensors are mounted in the users’ glove, armband, lumbar belt and dumbbell. And this dataset, the location of the sensor is descriptive marked as,‘forearm’, ‘arm’, ‘belt’ and ‘dumbbell’ correspondly. For each sensor location, features are extracted from three Euler angles (roll, pitch and yaw), and for each angle, eight features are calculated : mean, variance, standard deviation, max, min, amplitude, kurtosis and skewness.

trainSetOri <- read.csv('pml-training.csv', header=T)
testSetOri <- read.csv('pml-testing.csv', header=T)

There are two major types of ‘Unknown’ variable in train set:

So we preprocessing the data like following,

# remove useless column: ID, three timestamps 
trainSet <- trainSetOri[,c(-1,-3,-4,-5)]
testSet <- testSetOri[,c(-1,-3,-4,-5)]

# replace the '#DIV/0!' string by NA
trainSet[trainSet == "#DIV/0!"] = NA
testSet[testSet == "#DIV/0!"] = NA

# leave only the rows with 'new_window' == 'yes', which indicates the
# end point of a marching processing window
trainSetMin <- trainSet[trainSet$new_window == 'yes', ]
testSetClean <- testSet[,!apply(testSet,2,function(x) {sum(is.na(x))>0})]

# we can now further find the parameter name sharing with test and train set
# and the others in training set is eight categories of feature calculated from
# the raw read from sensors
trainSetMinVs <- names(trainSetMin)
testSetCleanVs <- names(testSetClean)
trainSetRawVs <- trainSetMinVs[sapply(trainSetMinVs, function(x){sum(testSetCleanVs==x)>0})]   # parameter names in group A
trainSetDedVs <- trainSetMinVs[!sapply(trainSetMinVs, function(x){sum(testSetCleanVs==x)>0})]  # parameter names in group B

Now we has several sets, trainSetMin contains the all records at the end of each moving window, from original train set. testSetClean contains all test record but remove the columns (parameters) with ‘#DIV/0!’ strings. And we further seperate the parameters in train set into two groups. One (group A) share with the test set, and indicates the raw measurement from sensors. And another group (group B) is the derivative set, containing the descriptive statistic variable in each moving window.

For a classification on test set, we notice each data sample coming from different moving window. And it means we can only use those parameter in group A to train our model. Before we conduct preprocessing and model train, we can use the calculated descriptive statistics in group B to learning the data itself.

Among these descriptive statistics, skewness is a measure of symmetry of the data in moving window, and kurtosis a measure of whether the data are peaked or flat relative to a normal distribution. For a symmetric distribution, skewness = 0, and for a standard normal distribution, kurtosis = 3.

library(reshape)
library(ggplot2)
library(grid)
library(gridExtra)

trainSetKurtosis <- trainSetMin[,c(1,3,8,9,10,65,66,67,83,84,85,121,122,123,156)]
trainSetSkewness <- trainSetMin[,c(1,3,11,12,13,68,69,70,86,87,88,124,125,126,156)]
trainSetKurtosisM <- melt(data = trainSetKurtosis,
                          id.vars = c("num_window","user_name", "classe"),
                          na.rm = T)
trainSetSkewnessM <- melt(data = trainSetSkewness,
                          id.vars = c("num_window","user_name", "classe"),
                          na.rm = T)
trainSetKurtosisM$value <- as.numeric(as.character(trainSetKurtosisM$value))
trainSetSkewnessM$value <- as.numeric(as.character(trainSetSkewnessM$value))
ggplot(data = trainSetKurtosisM, 
       mapping = aes(x = num_window, y = value)) +
    geom_point(aes(colour = variable, shape = classe)) + 
    facet_wrap(~user_name) + theme_bw() +
    ggtitle("Kurtosis")

ggplot(data = trainSetSkewnessM, 
       mapping = aes(x = num_window, y = value)) +
    geom_point(aes(colour = variable, shape = classe)) + 
    facet_wrap(~user_name) + theme_bw() +
    ggtitle("Skewness")

From the figures, we can see the most of the moving window of different experiment target is neither symmetric nor gaussian. Extreme case randomly exists in different target and different motion. Now we make a further check if the none-symmetry and none-gaussian coming from the data chunk by moving window or from the population itself.

trainSetAvg <- trainSetMin[,c(1,24,27,30,47,50,53,100,103,106,138,141,144,156)]
trainSetAvgM <- melt(data = trainSetAvg,
                          id.vars = c("user_name", "classe"),
                          na.rm = T)
p1 <- ggplot(trainSetAvgM[trainSetAvgM$user_name=="carlitos",])+
    geom_boxplot(aes(x=variable, y=value,fill=classe))+
    theme_bw() + ggtitle("measurement from carlitos") +
    theme(axis.text.x = element_text(angle=90, vjust=0.5, size=11))

p2 <- ggplot(trainSetAvgM[trainSetAvgM$user_name=="pedro",])+
    geom_boxplot(aes(x=variable, y=value,fill=classe))+
    theme_bw() + ggtitle("measurement from pedro") +
    theme(axis.text.x = element_text(angle=90, vjust=0.5, size=11))

grid.arrange(p1,p2,nrow=1)

We plot the boxplot of the average of measurement on different sensors on whole times in train set. Result shows some parameters (roll arm get from carlitos) has obvious sknewness. So we now can claim that the population itself is none-symmetry at least. And we need to perform standardizing before train our prediction model.

Since as we found previously, the test set (after we clean up all NAs) only has the raw measurment from sensors (group A), so we can only include those part of data into model trainning.

library(lattice)
library(caret)
# the all parameters in group A, and we further remove 
# 'new_window' and 'num_window', and use it to make our train set
trainSetRawVs2 = trainSetRawVs[c(-2,-3)]
trainSet2 <- trainSet[,c(trainSetRawVs2,"classe")]
names(trainSet2)
 [1] "user_name"            "roll_belt"            "pitch_belt"          
 [4] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
 [7] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
[10] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
[13] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
[16] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
[19] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
[22] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
[25] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
[28] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
[31] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
[34] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
[37] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
[40] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
[43] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
[46] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
[49] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
[52] "magnet_forearm_y"     "magnet_forearm_z"     "classe"              
# further partition trainset into train and test set,
# the original test set from database is for validation
set.seed(108)
inTrain <- createDataPartition(trainSet2$classe, p = 0.75, list = FALSE)

yTrain <- trainSet2$classe[inTrain]
yTest <- trainSet2$classe[-inTrain]

preObj <- preProcess(trainSet2[inTrain,c(-54,-1)], method = c("center","scale"))
trainSetPp <- cbind(user_name = trainSet2[inTrain,1],
                    predict(preObj, trainSet2[inTrain,c(-54,-1)]))
testSetPp <- cbind(user_name = trainSet2[-inTrain,1], 
                   predict(preObj, trainSet2[-inTrain,c(-54,-1)]))

Model training and result analysis

Model 1, random forest

library(e1071)
library(randomForest)
set.seed(108)

rfP <- train(x = trainSetPp, y = yTrain, proximity = FALSE, method = 'rf')
print(rfP$finalModel)

Call:
 randomForest(x = x, y = y, mtry = param$mtry, proximity = FALSE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 0.63%
Confusion matrix:
     A    B    C    D    E  class.error
A 4181    2    2    0    0 0.0009557945
B   19 2824    5    0    0 0.0084269663
C    0   14 2547    6    0 0.0077911959
D    0    0   30 2379    3 0.0136815920
E    0    1    5    6 2694 0.0044345898
rfPred <- predict(rfP, cbind(testSetPp, classe=yTest))
#table(rfPred,yTest)
confusionMatrix(rfPred, yTest)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1393   10    0    0    0
         B    1  937    4    1    0
         C    0    2  847    8    0
         D    0    0    4  794    2
         E    1    0    0    1  899

Overall Statistics
                                          
               Accuracy : 0.9931          
                 95% CI : (0.9903, 0.9952)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9912          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9986   0.9874   0.9906   0.9876   0.9978
Specificity            0.9972   0.9985   0.9975   0.9985   0.9995
Pos Pred Value         0.9929   0.9936   0.9883   0.9925   0.9978
Neg Pred Value         0.9994   0.9970   0.9980   0.9976   0.9995
Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
Detection Rate         0.2841   0.1911   0.1727   0.1619   0.1833
Detection Prevalence   0.2861   0.1923   0.1748   0.1631   0.1837
Balanced Accuracy      0.9979   0.9929   0.9941   0.9930   0.9986

Model 2, random forest after preprocess data depend on different target

One may notice that the test set, the validation set, and the train set share the same set of targets, and in previous exploration analysis, it is indicated by boxplot that different target has the different parameter distribution. So we can consider do pre-processing on different target.

set.seed(108)
targets <- levels(trainSet2$user_name)
preObj2 <- list()
trainSetPp2 <- data.frame()
testSetPp2 <- data.frame()
trainSet2Tr <- trainSet2[inTrain,]
trainSet2Te <- trainSet2[-inTrain,]
for (target in targets) {
#     print(target)
    subTrainSet2 <- trainSet2Tr[trainSet2Tr$user_name == target,]
    subTestSet2 <- trainSet2Te[trainSet2Te$user_name == target,]
    preObj2[[target]] <- preProcess(subTrainSet2[,c(-54,-1)], 
                                    method = c("center","scale"))
    trainSetPp2Tmp <- cbind(user_name = subTrainSet2[,1],
                            predict(preObj2[[target]], subTrainSet2[,c(-54,-1)]))
    testSetPp2Tmp <- cbind(user_name = subTestSet2[,1], 
                           predict(preObj2[[target]], subTestSet2[,c(-54,-1)]))
    
    trainSetPp2 <- rbind(trainSetPp2, trainSetPp2Tmp)
    testSetPp2 <- rbind(testSetPp2, testSetPp2Tmp)
}

# we train with new preprocessed data
set.seed(108)

rfP2 <- train(x = trainSetPp2, y = yTrain, proximity = FALSE, method = 'rf')
print(rfP2$finalModel)

Call:
 randomForest(x = x, y = y, mtry = param$mtry, proximity = FALSE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 0.37%
Confusion matrix:
     A    B    C    D    E  class.error
A 4182    3    0    0    0 0.0007168459
B    2 2836   10    0    0 0.0042134831
C    0   12 2549    6    0 0.0070120764
D    0    0   14 2392    6 0.0082918740
E    0    0    0    2 2704 0.0007390983
rfPred2 <- predict(rfP2, cbind(testSetPp2, classe=yTest))
#table(rfPred2,yTest)
confusionMatrix(rfPred2, yTest)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1395   50    0    0    0
         B    0  899   26    0    0
         C    0    0  829   47    0
         D    0    0    0  757   38
         E    0    0    0    0  863

Overall Statistics
                                         
               Accuracy : 0.9672         
                 95% CI : (0.9618, 0.972)
    No Information Rate : 0.2845         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9584         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            1.0000   0.9473   0.9696   0.9415   0.9578
Specificity            0.9858   0.9934   0.9884   0.9907   1.0000
Pos Pred Value         0.9654   0.9719   0.9463   0.9522   1.0000
Neg Pred Value         1.0000   0.9874   0.9935   0.9886   0.9906
Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
Detection Rate         0.2845   0.1833   0.1690   0.1544   0.1760
Detection Prevalence   0.2947   0.1886   0.1786   0.1621   0.1760
Balanced Accuracy      0.9929   0.9704   0.9790   0.9661   0.9789

Model 3, predit by tree

The third model, we try to use predition tree, and compare with the performance with model 1

library(rpart)
set.seed(108)

trP <- train(x = trainSetPp, y = yTrain, method = 'rpart')
print(trP$finalModel)
n= 14718 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 14718 10533 A (0.28 0.19 0.17 0.16 0.18)  
   2) roll_belt< 1.054659 13518  9341 A (0.31 0.21 0.19 0.18 0.11)  
     4) pitch_forearm< -1.590591 1173     5 A (1 0.0043 0 0 0) *
     5) pitch_forearm>=-1.590591 12345  9336 A (0.24 0.23 0.21 0.2 0.12)  
      10) magnet_dumbbell_y< 0.6653191 10409  7462 A (0.28 0.18 0.24 0.19 0.11)  
        20) roll_forearm< 0.8270286 6476  3833 A (0.41 0.18 0.18 0.17 0.062) *
        21) roll_forearm>=0.8270286 3933  2634 C (0.077 0.18 0.33 0.23 0.19) *
      11) magnet_dumbbell_y>=0.6653191 1936   964 B (0.032 0.5 0.044 0.23 0.19) *
   3) roll_belt>=1.054659 1200     8 E (0.0067 0 0 0 0.99) *
trPred <- predict(trP, cbind(testSetPp, classe=yTest))
#table(rfPred,yTest)
confusionMatrix(trPred, yTest)
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1265  398  404  368  123
         B   25  321   23  131  115
         C   99  230  428  305  224
         D    0    0    0    0    0
         E    6    0    0    0  439

Overall Statistics
                                          
               Accuracy : 0.5002          
                 95% CI : (0.4861, 0.5143)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3466          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9068  0.33825  0.50058   0.0000  0.48724
Specificity            0.6315  0.92566  0.78810   1.0000  0.99850
Pos Pred Value         0.4945  0.52195  0.33281      NaN  0.98652
Neg Pred Value         0.9446  0.85358  0.88198   0.8361  0.89639
Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
Detection Rate         0.2580  0.06546  0.08728   0.0000  0.08952
Detection Prevalence   0.5216  0.12541  0.26223   0.0000  0.09074
Balanced Accuracy      0.7692  0.63196  0.64434   0.5000  0.74287

Now compare model 1,2,3, we can see tree has lowest accuracy, and random forst in model 1 has highest one. The preprocessing based on different target in model 2 actually reports a lower accuracy than model 1. The reason may rely on the fact that we preprocess separately but train by using whole preprocessed data as a whole set.

Use model 1 on the validation set

pml_write_files = function(x) {
    n = length(x)
    for(i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, 
                    row.names = FALSE, col.names = FALSE)
        }
    }

validateSet <- cbind(user_name = testSetClean[,1],
                     predict(preObj, testSetClean[,c(-56,-1,-2,-3)]))
rfPredV <- predict(rfP, validateSet)
rfPredV
 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
pml_write_files(rfPredV)