Experiment

Packages loading

To begin, the necessary packages are loaded to R in addition a seed is set for reproducibility

  library(caret);
  library(randomForest);
  library(doMC);
  set.seed(125);

Collecting the data

In this project two data sets are used. The first data set is given to build a multiclass classification model and the second dataset has unlabeled data that has to be predicted and its values submitted for automated grading. The following piece of code in R download the data in the current R running environment.

download.file(url =  "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
                destfile = "pml-training.csv", method = "curl");
download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
              destfile = "pml-testing.csv" , method = "curl");

Data Reading

Having the data in the current working directory, the data set used to build the training model is loaded in R. Special cases of NA values are treated next it plots. The train data set has 19622 samples and 160 variables.

missing_values_flag = c("NA","#DIV/0!", "");
training <- read.csv(file = "pml-training.csv", na.strings = missing_values_flag);
dim(training);

## [1] 19622   160

# Shows the first 12 variable names
colnames(training[ , 1:12]);

##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "kurtosis_roll_belt"

Data Preprocessing

The next step is to pre-process the data. Firstly, the first six variables (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window) are removed from the training data because they are not useful for prediction. Next it is removes predictors that have near zero variance from the training data. After removing near zero variance, the predictors that have more than a certain threshold (70%) of missing values are also removed from the training data, after this step there is no predictor with missing value.

# Removes the first six variables
# from the data because they are not
# considered useful for prediction
training = training[ , -c(1, 2, 3, 4, 5, 6)];

# remove near zero variance predictors
nzvar <- nearZeroVar(training);
training = training[ , -nzvar];

# Check if a given column has more than a threshold (70%) of its values as NA
thresholdNA <- 0.7;
checkNA <- function(col) {
  (sum(is.na(col)) >= (thresholdNA * length(col)))
}

# get variables to remove based on training set
lotNas <- sapply(training, checkNA);

# remove variables which has lots of na from training set
training <- training[ , !lotNas];

# Reduced the number of predictors
dim(training);

## [1] 19622    54

# Now the training set has no missing value
any(is.na(training));

## [1] FALSE

Spliting in training and testing

After pre-processing the data, the training data set provided is spitted into two sets. One set is used to build a random forest model and the other is used to validate the model. The split was based on the outcome in the outcome variable (classe), which has 5 distinct values (A, B, C, D, E) that correspond to each exercise, with 60% of the samples for training the model and 40% for testing. The following R code shows how the data was split.

###### Training data split based on the outcome
# 60% for training
# 40 for testing
trainIndex <- createDataPartition(training$classe, p = .6,
                                  list = FALSE,
                                  times = 1);

inTrain <- training[trainIndex, ];
inTest <- training[-trainIndex, ];

Predicting

On the next step, the random forest model is built using caret parallel random forest method. Firstly, it is registered number of cores to be used. Secondly, the model is built to predict the classe variable using all other predictors. Finally, the model is used to make prediction on the testing set, it has a high accuracy (99.46%), having all of its misclassification on class C (7 misclassification). Therefore, it is expected that the random forest will perform well on the unlabeled data because it is expected that this data is collected from the sample experiment.

##### Random Forest
registerDoMC(4); # Explicitly register four core
rfModel <- train(classe ~., data = inTrain, method = "parRF"); # call parallel random forest in caret
rfModel;

## Parallel Random Forest 
## 
## 11776 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9899493  0.9872866  0.001510135  0.001909256
##   27    0.9938631  0.9922386  0.001615010  0.002040862
##   53    0.9883917  0.9853191  0.004474800  0.005656878
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

# Predict using random forest
predRF <- predict(rfModel, inTest);
confusionMatrix(inTest$classe, predRF);

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    0    0    0    0
##          B    5 1506    7    0    0
##          C    0    7 1361    0    0
##          D    0    0   16 1270    0
##          E    0    2    0    0 1440
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9953          
##                  95% CI : (0.9935, 0.9967)
##     No Information Rate : 0.2851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9978   0.9941   0.9834   1.0000   1.0000
## Specificity            1.0000   0.9981   0.9989   0.9976   0.9997
## Pos Pred Value         1.0000   0.9921   0.9949   0.9876   0.9986
## Neg Pred Value         0.9991   0.9986   0.9964   1.0000   1.0000
## Prevalence             0.2851   0.1931   0.1764   0.1619   0.1835
## Detection Rate         0.2845   0.1919   0.1735   0.1619   0.1835
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9989   0.9961   0.9911   0.9988   0.9998

Predicting on Hidden Test

Finally the random forest model built is used to predict the 20 samples from the unlabeled data set. One important note is that the training data set and the unlabeled data set have the same variables names. To begin, the unlabeled data set is loaded to R, next it make prediction for all of its sample. The final answer was submitted to the Coursera automatic grading system and it had an accuracy of 95% of the samples(19 out of 20), which is a reasonable result.

testing <- read.csv(file = "pml-testing.csv", na.strings = missing_values_flag);
dim(testing);

## [1]  20 160

predHidden <- predict(rfModel, testing);

Coursera Practical Machine Learning Project

Tulio

May 21, 2015

Synopsis