Background and objectives

Inertial detectors are used to monitor subjects' activities. Data could be modeled to predict the type and quality of subjects' activity. 
In this project, dumbbell lifting subjects were monitored by 9 degrees inertial measurement unit(IMU) attatched to 1: arm, 2: belt; 3:dumbbell. Sujbects are asked to perform the "right" lifting and "wrong" lifting. 
Further information for the data generation could be found:http://groupware.les.inf.puc-rio.br/har. 
The objectives of this analysis is to select features from hundreds of predictors and model the type of activity with key predictors. 
Further, Model will be applied to test dataset.

Executive summary

Data preprocessing:
    * remove zero and near-zero value; 
    * remove highly correlated variables using cor function; 
    * further feature plot to remove variables that are dominated by certain value. 
Modeling: 
    * K-fold random forest was used with repeated cross validation. 
Result: 
    * The model is very effective to predict; 
    * Based on 24 predictors (see below section "the variable importance."), the model accuracy is 0.98 and the test set accuracy is 0.98; 
    * Based on the model, the prediction of 20 sample is :B A C A A E D B A A B C B A E E A B B B, with one error.

Getting the data

<<<<<<< HEAD

# make this an external chunk that can be included in any file
library(knitr)
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')

options(xtable.type = 'html')
knit_hooks$set(inline = function(x) {
  if(is.numeric(x)) {
    round(x, getOption('digits'))
  } else {
    paste(as.character(x), collapse = ', ')
  }
})
knit_hooks$set(plot = knitr:::hook_plot_html)

if (!file.exists("data")) {
    dir.create("data")
}

urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "./data/pml-training.csv", method = "curl")
download.file(urlTest, destfile = "./data/pml-testing.csv", method = "curl")

pmlTrain <- read.table("/Volumes/s/R programming/data/pml-training.csv", header = TRUE, sep = ",")
pmlTest <- read.table("/Volumes/s/R programming/data/pml-testing.csv", header = TRUE, sep = ",")
pmlTrain <- read.table("./data/pml-training.csv", header = TRUE, sep = ",")
pmlTest <- read.table("./data/pml-testing.csv", header = TRUE, sep = ",")
dim(pmlTrain);dim(pmlTest);table(pmlTrain$classe)

[1] 19622   160

[1]  20 160


   A    B    C    D    E 
5580 3797 3422 3216 3607

data preparation: zero, near-near removing.

set.seed(38360)
# nzv in training set. 
nzvTrain <- nearZeroVar(pmlTrain)
# nzv in testing set. 
nzvTest <- nearZeroVar(pmlTest)
# merge the two nzv index. 
nzv <- unique(c(nzvTrain, nzvTest))
# remove nzv values in training an test of 20 samples. 
rDat <- pmlTrain[, -nzv]
test20 <- pmlTest[, -nzv]
rDatCls <- as.factor(rDat$classe)
# remove non-relevant factors. 
rDat <- rDat[, -c(2, 5)]
test20 <- test20[, -c(2, 5)]

feature selection using correlation.

set.seed(38360)
rDatCor <- cor(rDat[,-57])
highlyCorrelated <- findCorrelation(rDatCor, cutoff=0.75)
rDatFinal <- rDat[, -highlyCorrelated]
test20Final <- test20[, -highlyCorrelated]

further feature selection using feature plot.

<<<<<<< HEAD
=======
names(rDatFinal)
>>>>>>> 639b3dd68fb9cbd68a80dd5ff1b0ad08e367ab5b
for (i in 1:34){
        fplot <- featurePlot(x=rDatFinal[,i],
            y = rDatFinal$classe,
            plot="density", 
            alpha = 0.8)
        print(fplot)
        
}

further remove values of highly skewed .

rDatFinal <- rDatFinal[, -c(1, 2, 3, 4, 5, 6, 23, 25, 28, 29)]
test20Final <- test20Final[, -c(1, 2, 3, 4, 5, 6, 23, 25, 28, 29)]

data partition

inTrain <- createDataPartition(rDatFinal$classe, p = 0.75, list = FALSE)
train <- rDatFinal[inTrain,]
test <- rDatFinal[-inTrain,]

training control.

fitControl <- trainControl(method = "repeatedcv",
                           number = 3,
                           repeats = 10)

random forest modeling.

rf_Fit <- train(classe ~ ., data = train, 
                 method = "rf", 
                 trControl = fitControl,
                 proximity = TRUE)

evaluate the model accuracy

rf_Fit

Random Forest 

14718 samples
   24 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (3 fold, repeated 10 times) 
Summary of sample sizes: 9812, 9811, 9813, 9812, 9812, 9812, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9757304  0.9692930
  13    0.9706482  0.9628686
  24    0.9614079  0.9511839

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.

the variable importance.

vImp <- varImp(rf_Fit)
order(vImp$importance);names(train)

 [1] 10  8 19  2  1  9 18  6 12 22  7 23 11 21 14 24 20  5  3 16 15 13 17  4

 [1] "gyros_belt_x"         "gyros_belt_y"         "gyros_belt_z"         "magnet_belt_z"       
 [5] "roll_arm"             "pitch_arm"            "yaw_arm"              "total_accel_arm"     
 [9] "gyros_arm_y"          "gyros_arm_z"          "magnet_arm_x"         "magnet_arm_z"        
[13] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
[17] "roll_forearm"         "yaw_forearm"          "total_accel_forearm"  "accel_forearm_x"     
[21] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"     "magnet_forearm_z"    
[25] "classe"

evaluate the partitioned test set.

predictions <- predict(rf_Fit, newdata = test[, -35])

calculate the out-of sample error.

confusionMatrix(predictions, test$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1389   22    1    1    0
         B    0  921   14    0    2
         C    1    3  837   24    1
         D    5    0    3  776    4
         E    0    3    0    3  894

Overall Statistics
                                          
               Accuracy : 0.9823          
                 95% CI : (0.9782, 0.9858)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9775          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9957   0.9705   0.9789   0.9652   0.9922
Specificity            0.9932   0.9960   0.9928   0.9971   0.9985
Pos Pred Value         0.9830   0.9829   0.9665   0.9848   0.9933
Neg Pred Value         0.9983   0.9929   0.9955   0.9932   0.9983
Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
Detection Rate         0.2832   0.1878   0.1707   0.1582   0.1823
Detection Prevalence   0.2881   0.1911   0.1766   0.1607   0.1835
Balanced Accuracy      0.9944   0.9832   0.9859   0.9811   0.9954

predict the 20 samples.

test20predict1 <- predict(rf_Fit, newdata = test20Final[, -28])
test20predict1

 [1] B A C A A E D B A A B C B A E E A B B B
Levels: A B C D E

Discussion

If more predictors were included(Model not shown here),  100% accuracy of the 20 sample could be attained. But I'm use a "loose" model here to explore the flexibility of the model. 
In five experiments with different in-sample accuracy, every sample except No. 3 could be well predicted. For example, in above model with in-sample error of 0.9747, 19 test samples are predicted correctly. But sample No3 was mis-predicted from "classe B" to "classe C". Why is sample No3 vulnerable?

# data of No3 sample in test set, 
sample3 <- test20Final[3, -25]
# generate two same row and labeled each row with "B" and "C"
sample3 <- rbind(sample3, sample3)
sample3$classe <- c("B", "C")

# melt sample 3 data. 
sample3.melt <- melt(sample3, id.vars = "classe", 
                     value.name = "var", 
                     variable.name = "predictors")
# melt train data. 
train.melt <- melt(train, id.vars = "classe",
                   value.name = "var", 
                   variable.name = "predictors")
# plotting to observe the data of sample3 in classe B and classe C
g <- ggplot(data = train.melt, 
            aes(x = predictors, y= var))
g <- g + geom_boxplot()
g <- g + geom_point(data = sample3.melt, 
                    aes(x = predictors, y = var, colour = classe), 
                    size = 5, alpha = 0.5)
g <- g + facet_grid(classe~.)
g

From this plot, we notice that there maybe two reasons for the mis-prediction of sample 3. 1) Data in classe B and C are very similar; 2) the value of predictors of sample3 are in-between the “common” range of both B and C. So the model makes mistakes.

Practical Machine Learning Project

Rui Lian

Sep, 2017