Practical Machine Learning Assignment

Executive Summary

The data contains of 1. Training Set - 19622 observations and 160 variables, and 2. Test Set 20 observations and 160 variables.
While it is usually quantified how much activity a person does using wearables, it is generally not mentioned how well the person is doing it. Data collected from fitbit, Jawbone Up and Nike Fuelband were mapped to five different ways of doing a particular activity. Alongwith the classe, 159 other variable data were collected for the same activity such as Accelerometer data, Gyroscope data, Yaw data etc.
This project’s objective is to accurately identify which classe a set of data belong to based on the 159 other variables. Once we finalised on a model, the model will then be applied to the test data to identify the classe of each of the 20 data. This report does an exploratory analysis of the given data, finds correlation among the different variables and then fits three different models. It performs in-sample accuracy tests and out of sample tests to identify the best model. This model is then used to identify the classe in each data in the test set.

Analysis

Loading the necessary libraries that will be required for the analysis.

## Loading required package: lattice

## Loading required package: ggplot2

## Loaded gbm 2.1.5

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

## corrplot 0.84 loaded

The training and test data is downloaded from the location provided. The variables in the given data are shown below.

training <- read.csv("C:/Users/achar/Documents/R/Predictive Machine Learning/pml-training.csv")
testing <- read.csv("C:/Users/achar/Documents/R/Predictive Machine Learning/pml-testing.csv")
names(training)

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

The first few lines of the data set are displayed below to get a sense of the data. We can clearly see that a lot of the variables contain NAs. Such variables if used as predictors will give errors and erroneous results.

head(training, 3)

##   X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 1 1  carlitos           1323084231               788290 05/12/2011 11:23
## 2 2  carlitos           1323084231               808298 05/12/2011 11:23
## 3 3  carlitos           1323084231               820366 05/12/2011 11:23
##   new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1         no         11      1.41       8.07    -94.4                3
## 2         no         11      1.41       8.07    -94.4                3
## 3         no         11      1.42       8.07    -94.4                3
##   kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt
## 1                                                         
## 2                                                         
## 3                                                         
##   skewness_roll_belt skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 1                                                                      NA
## 2                                                                      NA
## 3                                                                      NA
##   max_picth_belt max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt
## 1             NA                         NA             NA             
## 2             NA                         NA             NA             
## 3             NA                         NA             NA             
##   amplitude_roll_belt amplitude_pitch_belt amplitude_yaw_belt
## 1                  NA                   NA                   
## 2                  NA                   NA                   
## 3                  NA                   NA                   
##   var_total_accel_belt avg_roll_belt stddev_roll_belt var_roll_belt
## 1                   NA            NA               NA            NA
## 2                   NA            NA               NA            NA
## 3                   NA            NA               NA            NA
##   avg_pitch_belt stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 1             NA                NA             NA           NA
## 2             NA                NA             NA           NA
## 3             NA                NA             NA           NA
##   stddev_yaw_belt var_yaw_belt gyros_belt_x gyros_belt_y gyros_belt_z
## 1              NA           NA         0.00            0        -0.02
## 2              NA           NA         0.02            0        -0.02
## 3              NA           NA         0.00            0        -0.02
##   accel_belt_x accel_belt_y accel_belt_z magnet_belt_x magnet_belt_y
## 1          -21            4           22            -3           599
## 2          -22            4           22            -7           608
## 3          -20            5           23            -2           600
##   magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm var_accel_arm
## 1          -313     -128      22.5    -161              34            NA
## 2          -311     -128      22.5    -161              34            NA
## 3          -305     -128      22.5    -161              34            NA
##   avg_roll_arm stddev_roll_arm var_roll_arm avg_pitch_arm stddev_pitch_arm
## 1           NA              NA           NA            NA               NA
## 2           NA              NA           NA            NA               NA
## 3           NA              NA           NA            NA               NA
##   var_pitch_arm avg_yaw_arm stddev_yaw_arm var_yaw_arm gyros_arm_x
## 1            NA          NA             NA          NA        0.00
## 2            NA          NA             NA          NA        0.02
## 3            NA          NA             NA          NA        0.02
##   gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z magnet_arm_x
## 1        0.00       -0.02        -288         109        -123         -368
## 2       -0.02       -0.02        -290         110        -125         -369
## 3       -0.02       -0.02        -289         110        -126         -368
##   magnet_arm_y magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## 1          337          516                                     
## 2          337          513                                     
## 3          344          513                                     
##   kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm skewness_yaw_arm
## 1                                                                       
## 2                                                                       
## 3                                                                       
##   max_roll_arm max_picth_arm max_yaw_arm min_roll_arm min_pitch_arm
## 1           NA            NA          NA           NA            NA
## 2           NA            NA          NA           NA            NA
## 3           NA            NA          NA           NA            NA
##   min_yaw_arm amplitude_roll_arm amplitude_pitch_arm amplitude_yaw_arm
## 1          NA                 NA                  NA                NA
## 2          NA                 NA                  NA                NA
## 3          NA                 NA                  NA                NA
##   roll_dumbbell pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 1      13.05217      -70.49400    -84.87394                       
## 2      13.13074      -70.63751    -84.71065                       
## 3      12.85075      -70.27812    -85.14078                       
##   kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 1                                                                     
## 2                                                                     
## 3                                                                     
##   skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 1                                                              NA
## 2                                                              NA
## 3                                                              NA
##   max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 1                 NA                                 NA                 NA
## 2                 NA                                 NA                 NA
## 3                 NA                                 NA                 NA
##   min_yaw_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## 1                                       NA                       NA
## 2                                       NA                       NA
## 3                                       NA                       NA
##   amplitude_yaw_dumbbell total_accel_dumbbell var_accel_dumbbell
## 1                                          37                 NA
## 2                                          37                 NA
## 3                                          37                 NA
##   avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1                NA                   NA                NA
## 2                NA                   NA                NA
## 3                NA                   NA                NA
##   avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 1                 NA                    NA                 NA
## 2                 NA                    NA                 NA
## 3                 NA                    NA                 NA
##   avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## 1               NA                  NA               NA                0
## 2               NA                  NA               NA                0
## 3               NA                  NA               NA                0
##   gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 1            -0.02                0             -234               47
## 2            -0.02                0             -233               47
## 3            -0.02                0             -232               46
##   accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## 1             -271              -559               293               -65
## 2             -269              -555               296               -64
## 3             -270              -561               298               -63
##   roll_forearm pitch_forearm yaw_forearm kurtosis_roll_forearm
## 1         28.4         -63.9        -153                      
## 2         28.3         -63.9        -153                      
## 3         28.3         -63.9        -152                      
##   kurtosis_picth_forearm kurtosis_yaw_forearm skewness_roll_forearm
## 1                                                                  
## 2                                                                  
## 3                                                                  
##   skewness_pitch_forearm skewness_yaw_forearm max_roll_forearm
## 1                                                           NA
## 2                                                           NA
## 3                                                           NA
##   max_picth_forearm max_yaw_forearm min_roll_forearm min_pitch_forearm
## 1                NA                               NA                NA
## 2                NA                               NA                NA
## 3                NA                               NA                NA
##   min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 1                                     NA                      NA
## 2                                     NA                      NA
## 3                                     NA                      NA
##   amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 1                                        36                NA
## 2                                        36                NA
## 3                                        36                NA
##   avg_roll_forearm stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 1               NA                  NA               NA                NA
## 2               NA                  NA               NA                NA
## 3               NA                  NA               NA                NA
##   stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1                   NA                NA              NA
## 2                   NA                NA              NA
## 3                   NA                NA              NA
##   stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1                 NA              NA            0.03            0.00
## 2                 NA              NA            0.02            0.00
## 3                 NA              NA            0.03           -0.02
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1           -0.02             192             203            -215
## 2           -0.02             192             203            -216
## 3            0.00             196             204            -213
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1              -17              654              476      A
## 2              -18              661              473      A
## 3              -18              658              469      A

Exploratory Analysis

A plot of the classe variable to see the different levels and how the data is spread out among the levels.

plot(training$class, col = "green", main = "Training Set - Classe Data", xlab = "Classe Labels",  ylab = "Number of Observations")

Data Cleaning

First we select those variables where there the NAs are less than 20 (out of 19622) in the training data set. We also select the variables where the NAs are less than 2 (out of 20) in the test data set. Thereafter, we remove the first 7 columns which give us the name, timestamp, new window and num window. These variables cannot act as good predictors. Then we remove those variables where the variables have near zero variance.
Then the training data set is divided into training and validation datasets in the ratio of 3:1. The validation dataset will be used to calculate the OUT-OF-SAMPLE errors in the model fit. The validation dataset has been used for CROSS VALIDATION.

newtrain <- training[,colSums(is.na(training))<20]
newtrain <- newtrain[ , -c(1:7)]
newtest <- testing[,colSums(is.na(testing))<2]
newtest <- newtest[ , -c(1:7)]
NZV <- nearZeroVar(newtrain)
newtrain <- newtrain[ ,-NZV]
inTrain <- createDataPartition(newtrain$classe, p = 3/4)[[1]]
final_training <- newtrain[ inTrain,]
final_validate <- newtrain[ -inTrain,]

After cleaning the data, only 53 variables remain in the training and test data sets.

Now we perform a check if the 53 variables in the final training dataset are available in the test set as well.

check <- names(newtrain) == names(newtest)
sum(check)

## [1] 52

We see that 52 variables are the same in the training, validation and test set. The only variable that is different is the classe variable which is written as Problem ID in the test set. Hence we are good to go.

We then find the correlation among the variables. The diagonal of the correlation matrix is set to zero because it gives the correlation between the same variable. The variables which have the highest correlation more than 0.8 is shown below.

M <- abs(cor(final_training[,-53]))
diag(M) <- 0
highly_correlated <- findCorrelation(M, cutoff = 0.8)
names(final_training[highly_correlated])

##  [1] "accel_belt_z"     "roll_belt"        "accel_belt_y"    
##  [4] "accel_dumbbell_z" "accel_belt_x"     "pitch_belt"      
##  [7] "accel_dumbbell_x" "accel_arm_x"      "magnet_arm_y"    
## [10] "gyros_forearm_y"  "gyros_dumbbell_x" "gyros_dumbbell_z"
## [13] "gyros_arm_x"

We can also see a correlation plot below to view the correlation between variables. The deeper colours are the variables with higher correlation close to -1 or 1. I have intentionally made the diagonal 0 because they are the correlation with the same variable and hence have a value of 1.

N <- cor(final_training[,-53])
diag(N) <- 0
corrplot(M, order = "FPC", method = "color", type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))

Out of Sample Error

The out of sample error is defined as 1 - Accuracy of the Validation Set. It gives us the expected error when we fit a trained model on a new set of data.

Models

We fit 3 different models on the training data and checked for the in-sample and out-of-sample accuracies for all three models below. I tried out other models like the Linear Regression and GLM but did not include them because they had low accuracy and did not value to the report. The three models are:

Model 1 - Classification Tree

set.seed(504)
mod_ct <- rpart(classe ~ ., data=final_training, method="class")
fancyRpartPlot(mod_ct)

colnames(newtest)[53] <- "classe"
test_mod_ct <- predict(mod_ct, newdata = final_validate, type = "class")
conf_mat_ct <- confusionMatrix(test_mod_ct, final_validate$classe)
conf_mat_ct$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1224  155   27   45   27
##          B   59  602   91   87   99
##          C   31   83  624  125   96
##          D   66   64   64  508   79
##          E   15   45   49   39  600

conf_mat_ct$overall[1]

##  Accuracy 
## 0.7255302

Classification Tree has an overall accuracy of 74% with the Validation Set. The out of sample error rate is about 26%.

Model 2 - Random Forest

set.seed(504)
fitControl <- trainControl(method="cv", number=3, verboseIter=FALSE)
mod_rf <- train(classe ~ ., data = final_training, method = "rf", trControl = fitControl)
confusionMatrix(mod_rf)

## Cross-Validated (3 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.4  0.2  0.0  0.0  0.0
##          B  0.0 19.1  0.2  0.0  0.0
##          C  0.0  0.1 17.3  0.4  0.0
##          D  0.0  0.0  0.0 16.0  0.0
##          E  0.0  0.0  0.0  0.0 18.3
##                             
##  Accuracy (average) : 0.9904

test_mod_rf <- predict(mod_rf, newdata = final_validate)
conf_mat_rf <- confusionMatrix(test_mod_rf, final_validate$classe)
conf_mat_rf$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1395    8    0    0    0
##          B    0  941    4    0    0
##          C    0    0  849   16    4
##          D    0    0    2  783    0
##          E    0    0    0    5  897

conf_mat_rf$overall[1]

##  Accuracy 
## 0.9920473

Random Forest has an in-sample accuracy of 99.05% and an out of sample accuracy of 99.2%. The expected out of sample error is 1 - Accuracy or 0.80%.

Model 3 - Gradient Boosting Machines

set.seed(504)
mod_gbm <- train(classe ~ ., data = final_training, method = "gbm", trControl = fitControl, verbose = FALSE)
confusionMatrix(mod_gbm)

## Cross-Validated (3 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 27.9  0.8  0.0  0.0  0.0
##          B  0.3 18.0  0.6  0.1  0.3
##          C  0.1  0.5 16.6  0.5  0.1
##          D  0.1  0.0  0.2 15.7  0.2
##          E  0.0  0.0  0.0  0.1 17.7
##                             
##  Accuracy (average) : 0.9593

test_mod_gbm <- predict(mod_gbm, newdata = final_validate)
conf_mat_gbm <- confusionMatrix(test_mod_gbm, final_validate$classe)
conf_mat_gbm$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1375   30    0    2    1
##          B   10  898   28    6    9
##          C    6   20  813   38   12
##          D    3    1   12  752    5
##          E    1    0    2    6  874

conf_mat_gbm$overall[1]

##  Accuracy 
## 0.9608483

GBM has an in-sample accuracy of 95.85% and an out of sample accuracy of 96.08. The out of sample error rate is 3.92%.

Further I tried the Ensemble method, which took a lot of time to run but did not increase the accuracy much. Hence I dropped the idea and stuck to the Random Forest method which provided accurate results.

Predictions for Test Set

The predictions for the test set were calculated based on our Random Forest model. The Random Forest model had the highest in-sample accuracy, highest out-of-sample accuracy and lowest out-of-sample error rate. Therefore, it was the best choice of model to predict the classe in the Test Set of 20 data points. The predictions are given below. This prediction is expected to be 99.2% accurate and an expected out of sample error rate of 0.2%.

newtest$classe <- predict(mod_rf, newdata = newtest)
newtest$classe

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

=====================================================================================================================