Coursera Practical Machine Learning Course Project

Execution Summary

In the Loading the Data and Required Packages section, I load the required package/data for data cleaning and model fitting, I also set the seed as 156. In the Training Model sectiton, I fit the data using random forest with cross validation. Finally, in the Prediction section, I use the model to predict the test set.

Loading the Data and Required Packages

Loading Data:

library(caret)
library(dplyr)
set.seed(156)
training = read.csv("pml-training.csv", header = T)
testing = read.csv("pml-testing.csv", header = T)

Data Cleaning

By viewing the attributes’ names, we can see that there are some useless attributes that are cannot be used to train a model, like index, user_name, raw_timestamp_part_1, etc.

Also notice that there’s huge number of missing value in testing data, we need to get rid of those attributes with missing data.

selected.attributes = c()
for (i in 1:length(testing)){
  if (!is.na(unique(testing[, i]))){
    selected.attributes = c(selected.attributes, i)
  }
}
training = training[, selected.attributes]
testing = testing[, selected.attributes]

training = training[, 6:length(training)]
testing = testing[, 6:length(testing)]

Training Model

Since this data contains both discrete and continuous variables. I decide to use random forest to fit the model.

fitRF = train(classe ~., data = training, method = 'rf', trControl = trainControl(method = 'cv', number = 5), na.action = na.omit)

Take a look into the summary of the model:

fitRF

## Random Forest 
## 
## 19622 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15699, 15698, 15697, 15695 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9957702  0.9946494
##   28    0.9981142  0.9976147
##   54    0.9960761  0.9950364
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 28.

varImp(fitRF)

## rf variable importance
## 
##   only 20 most important variables shown (out of 54)
## 
##                      Overall
## num_window           100.000
## roll_belt             63.362
## pitch_forearm         38.560
## yaw_belt              31.631
## magnet_dumbbell_z     29.225
## magnet_dumbbell_y     27.680
## pitch_belt            27.424
## roll_forearm          22.301
## accel_dumbbell_y      13.370
## magnet_dumbbell_x     11.189
## accel_forearm_x       10.515
## roll_dumbbell         10.306
## accel_belt_z           9.721
## total_accel_dumbbell   9.520
## accel_dumbbell_z       8.230
## magnet_belt_z          7.787
## magnet_forearm_z       7.322
## magnet_belt_y          6.782
## magnet_belt_x          5.900
## roll_arm               5.736

The results shows that The final value used for the model was mtry = 117. And the accuracy performs on the training set is about 0.79.

We can also plot the accuracy-predictor.

plot(fitRF)

And we notice that selecting about 27 predictors works best.

Prediction

pred = predict(fitRF, newdata = testing)
pred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E