In the Loading the Data and Required Packages section, I load the required package/data for data cleaning and model fitting, I also set the seed as 156. In the Training Model sectiton, I fit the data using random forest with cross validation. Finally, in the Prediction section, I use the model to predict the test set.
Loading Data:
library(caret)
library(dplyr)
set.seed(156)
training = read.csv("pml-training.csv", header = T)
testing = read.csv("pml-testing.csv", header = T)
By viewing the attributes’ names, we can see that there are some useless attributes that are cannot be used to train a model, like index, user_name, raw_timestamp_part_1, etc.
Also notice that there’s huge number of missing value in testing data, we need to get rid of those attributes with missing data.
selected.attributes = c()
for (i in 1:length(testing)){
if (!is.na(unique(testing[, i]))){
selected.attributes = c(selected.attributes, i)
}
}
training = training[, selected.attributes]
testing = testing[, selected.attributes]
training = training[, 6:length(training)]
testing = testing[, 6:length(testing)]
Since this data contains both discrete and continuous variables. I decide to use random forest to fit the model.
fitRF = train(classe ~., data = training, method = 'rf', trControl = trainControl(method = 'cv', number = 5), na.action = na.omit)
Take a look into the summary of the model:
fitRF
## Random Forest
##
## 19622 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15699, 15698, 15697, 15695
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9957702 0.9946494
## 28 0.9981142 0.9976147
## 54 0.9960761 0.9950364
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 28.
varImp(fitRF)
## rf variable importance
##
## only 20 most important variables shown (out of 54)
##
## Overall
## num_window 100.000
## roll_belt 63.362
## pitch_forearm 38.560
## yaw_belt 31.631
## magnet_dumbbell_z 29.225
## magnet_dumbbell_y 27.680
## pitch_belt 27.424
## roll_forearm 22.301
## accel_dumbbell_y 13.370
## magnet_dumbbell_x 11.189
## accel_forearm_x 10.515
## roll_dumbbell 10.306
## accel_belt_z 9.721
## total_accel_dumbbell 9.520
## accel_dumbbell_z 8.230
## magnet_belt_z 7.787
## magnet_forearm_z 7.322
## magnet_belt_y 6.782
## magnet_belt_x 5.900
## roll_arm 5.736
The results shows that The final value used for the model was mtry = 117. And the accuracy performs on the training set is about 0.79.
We can also plot the accuracy-predictor.
plot(fitRF)
And we notice that selecting about 27 predictors works best.
pred = predict(fitRF, newdata = testing)
pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E