Title: Practical Machine Learning Project

Author: Farhad.M
Date: Sep. 26, 2015

Summary

Nowadays it is possible to collect a large amount of data about personal activity relatively inexpensively using available devices. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The analysis satrts with loading and preprocessing data and continues ny model construction and validation. Finally the result of prediction on the provided test data is reported.

Loading and Briefly Looking at the Trainign Dataset

The training data has 19622 observations and 160 features.

data_train <- read.csv('./train.csv')
data_test<- read.csv('./test.csv')
dim(data_train)
## [1] 19622   160
table(data_train$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Preprocessing

The size of the training dataset is rather large. Since some columns include a large number of missing values, we first find those columns with more than 80% NAs and remove them.

Counting the number of missing values and finding those with more than 80% NAs

NA_no<- numeric(0)
NA_no <- sapply(data_train, function(x) sum(is.na(x)))
idx <- c()
for (i in 1:length(NA_no)) {
   if (NA_no[[i]]/dim(data_train)[1] >= 0.80)
     {idx <- append(idx,i)}
}

training <- data_train[,-idx]
dim(training)
## [1] 19622    93

So, the number of features reduces from 160 to 93.

Model Construction and Training

As some vaiables have low variablity we use nearZeroVar function from Caret package to find features of the dataset that have near zero variance. Then we remove these predictors from the training dataset.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(12345)
n0v <- nearZeroVar(training, saveMetrics = T)
training <- training[,n0v$nzv == FALSE]
dim(training)
## [1] 19622    59

It can be seen that the number of features is reduced again. Looking at the feature names, the first five features are not usueful for the training purpose. So, we simply remove them as well.

names(training)
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "num_window"          
##  [7] "roll_belt"            "pitch_belt"           "yaw_belt"            
## [10] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
## [13] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [16] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [19] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [22] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [25] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [28] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [31] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [34] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [37] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [40] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [43] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [46] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [49] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [52] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [55] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [58] "magnet_forearm_z"     "classe"
training <- training[,-c(1,2,3,4,5)]
dim(training)
## [1] 19622    54

So, the final number of predictures to be used within training is 54.

Now, the data is divided into two sets for tarining and validation with the fraction of 70% and 30%, respectively:

tset <- createDataPartition(training$classe, p = 0.7, list = FALSE)
Trn <- training[tset, ]
Val <- training[-tset, ]

We use random forest as one of popular models for training.

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(classe ~ ., data = Trn, verbose=F)

Model Validation

We test our model on the training set itself and the cross validation set.

The Training set accuracy

ptrn <- predict(model, Trn)
confusionMatrix(ptrn, Trn$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

It can be seen that the proposed random Forest model is remarkably accurate.

Prediction on the given test data

Here we use the above model to predict the way the exercise is done for the given sample test dataset.

pt<- predict(model, data_test)
print(pt)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Finally, the result file set is generated using the below code:

answers <- as.vector(pt)
pml_write_files = function(x) {
    n = length(x)
    for (i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
    }
}
pml_write_files(answers)