Executive Summary

This the a final project to Practical Machine Learning Class in coursera.com, 1 of the 9 courses under the Data Science Specialization.

The common problem in research is to quantify how much of a particular activity they do, but rarely to quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict how well they do weight lifting exercise.

Using random forest technique, the model predicted the quality of the weighlifting with an accuracy of 99% or estimated 1% out of sample error only using 52 predictors. Top 5 most important predictors are ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z. Performance of random forest is clearly more supreme compared to Linear Discriminant Analysis.

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The Data

This project utilized the Weight Lifting Exercises Dataset from http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. The goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The 6 young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Loading Data and Some Exploratory Analyses

Following code will call Caret package which will be used to build the model, ggplot2 and GGally for visualization, and download the training and test set.

library(caret)
## Warning: package 'caret' was built under R version 3.4.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.2
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.2
#download the data
Url1 <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile  <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
  dir.create("./data")
}
if (!file.exists(trainFile)) {
  download.file(Url1, destfile=trainFile)
}
if (!file.exists(testFile)) {
  download.file(Url2, destfile=testFile)
}

#load the data
training <- read.csv("./data/pml-training.csv",header=T,sep=",",na.strings=c(""," ","NA"))
testing <- read.csv("./data/pml-testing.csv",header=T,sep=",",na.strings=c(""," ","NA"))

Exploratory Analyses

dim(training) ; dim(testing)
## [1] 19622   160
## [1]  20 160

We see that there are 19,622 rows and 160 columns in the training set while only 20 rows with the same number of fields in testing set. The “classe” variable in the both sets is the outcome to predict.

First thing we can do is to quickly check the structure just to see how variables are formatted and if there are any signs of missing values (NA) since we have to deal with it either imputing the values or remove the row if not that much or remove the variable if it have so many missing values. Second, since training data set is sufficiently large to give us reason to further spit it to training and validation set, we will do any exploratory analyses such as checking distribution of each variable, checking linerarity or relationship between variables and outcome on the subset training data so we can avoid bias in selecting model.

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

We can see that of 160 columns, there are many variables with lots of missing values (NA). We might also want to check for variables(except the outcome variable) if there’s any that has variance near zero

Data Preprocessing

We will check those those variable that has near zero variance and remove those predictors. We will also remove those with fields with missing values since it gonna mess up the modeling.

nzvariance<-nearZeroVar(training, saveMetrics = T)
toremove<-nzvariance[nzvariance[,"nzv"],]

print(toremove) ## check what variable will be remove
##                        freqRatio percentUnique zeroVar  nzv
## new_window              47.33005    0.01019264   FALSE TRUE
## kurtosis_yaw_belt        0.00000    0.00509632    TRUE TRUE
## skewness_yaw_belt        0.00000    0.00509632    TRUE TRUE
## amplitude_yaw_belt      32.00000    0.01528896   FALSE TRUE
## avg_roll_arm            77.00000    1.68178575   FALSE TRUE
## stddev_roll_arm         77.00000    1.68178575   FALSE TRUE
## var_roll_arm            77.00000    1.68178575   FALSE TRUE
## avg_pitch_arm           77.00000    1.68178575   FALSE TRUE
## stddev_pitch_arm        77.00000    1.68178575   FALSE TRUE
## var_pitch_arm           77.00000    1.68178575   FALSE TRUE
## avg_yaw_arm             77.00000    1.68178575   FALSE TRUE
## stddev_yaw_arm          80.00000    1.66649679   FALSE TRUE
## var_yaw_arm             80.00000    1.66649679   FALSE TRUE
## kurtosis_roll_arm       78.00000    1.67668943   FALSE TRUE
## kurtosis_picth_arm      80.00000    1.66649679   FALSE TRUE
## skewness_roll_arm       77.00000    1.68178575   FALSE TRUE
## skewness_pitch_arm      80.00000    1.66649679   FALSE TRUE
## max_roll_arm            25.66667    1.47793293   FALSE TRUE
## min_roll_arm            19.25000    1.41677709   FALSE TRUE
## min_pitch_arm           19.25000    1.47793293   FALSE TRUE
## amplitude_roll_arm      25.66667    1.55947406   FALSE TRUE
## amplitude_pitch_arm     20.00000    1.49831821   FALSE TRUE
## kurtosis_yaw_dumbbell    0.00000    0.00509632    TRUE TRUE
## skewness_yaw_dumbbell    0.00000    0.00509632    TRUE TRUE
## amplitude_yaw_dumbbell  80.20000    0.01019264   FALSE TRUE
## kurtosis_roll_forearm   42.00000    1.63591887   FALSE TRUE
## kurtosis_picth_forearm  85.00000    1.64101519   FALSE TRUE
## kurtosis_yaw_forearm     0.00000    0.00509632    TRUE TRUE
## skewness_roll_forearm   41.50000    1.64101519   FALSE TRUE
## skewness_pitch_forearm  21.25000    1.62062991   FALSE TRUE
## skewness_yaw_forearm     0.00000    0.00509632    TRUE TRUE
## max_roll_forearm        27.66667    1.38110284   FALSE TRUE
## min_roll_forearm        27.66667    1.37091020   FALSE TRUE
## amplitude_roll_forearm  20.75000    1.49322189   FALSE TRUE
## avg_roll_forearm        27.66667    1.64101519   FALSE TRUE
## stddev_roll_forearm     87.00000    1.63082255   FALSE TRUE
## var_roll_forearm        87.00000    1.63082255   FALSE TRUE
## avg_pitch_forearm       83.00000    1.65120783   FALSE TRUE
## stddev_pitch_forearm    41.50000    1.64611151   FALSE TRUE
## var_pitch_forearm       83.00000    1.65120783   FALSE TRUE
## avg_yaw_forearm         83.00000    1.65120783   FALSE TRUE
## stddev_yaw_forearm      85.00000    1.64101519   FALSE TRUE
## var_yaw_forearm         85.00000    1.64101519   FALSE TRUE
nrow(toremove) ## number of removed predictor
## [1] 43
training <- training[,!nzvariance$nzv]
testing <- testing[,!nzvariance$nzv]
dim(training) ; dim(testing)
## [1] 19622   117
## [1]  20 117

There are now 116 remaining predictor variables excluding the last variable which is our outcome variable. The graph shows below the remained predictors with missing value which will be removed.

# visualize variables with missing values
percentmissing<-colSums(is.na(training))/nrow(training)

qplot(y=percentmissing, x=index, data=data.frame(percentmissing=percentmissing, index=1:ncol(training)))

# Remove variables with missing values
nomissing<-colSums(is.na(training)) == 0

training_filter_na <- training[,nomissing]
testing_filter_na <- testing[,nomissing]

# Remove unnecessary columns
colRm_train <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window")
colRm_test <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window","problem_id")
newtraining <- training_filter_na[,!(names(training_filter_na) %in% colRm_train)]
newtesting <- testing_filter_na[,!(names(testing_filter_na) %in% colRm_test)]
dim(newtraining)
## [1] 19622    53
dim(newtesting)
## [1] 20 52

Code below will further split the training set to two subset, training set and validation set.

set.seed(526)
inTrain <- createDataPartition(y=newtraining$classe, p=0.7, list=FALSE)
training_clean <- newtraining[inTrain,]
validation_clean <- newtraining[-inTrain,]

We can use ggpairs to explore the training_clean data. We will explore the relationship of predictor ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z with the outcome variable.

ggpairs(training_clean, columns = c('roll_belt', 'yaw_belt', 'pitch_forearm', 'pitch_belt' , 'magnet_dumbbell_z', 'classe'), aes(color=classe, alpha = 0.2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Model Building

Since outcome variable is categorical, We will consider Linear Discriminant Analysis (LDA) and random forest (RF) and see if later model accuracy is more supreme.

Caveat: LDA has strict assumptions to follow that are the same to linear regression model e.g. no outliers and normality of data. One may use logistic regression (or multinomial logistics in this case since outcome variable has more than 2 category) but we will just use LDA just to have comparison with RF, as in most cases RF tend to have better performance than linear modeling approach.

set.seed(123)
# Fit LDA model
ldaFit <- train(classe ~ ., method = "lda", data = training_clean, importance = T, trControl = trainControl(method = "cv", number = 4))

lda_validation_pred <- predict(ldaFit, newdata=validation_clean)

Below will model data based on random forest algorithm.

set.seed(123)
# Fit rf model
rfFit <- train(classe ~ ., method = "rf", data = training_clean, importance = T, trControl = trainControl(method = "cv", number = 4))
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_validation_pred <- predict(rfFit, newdata=validation_clean)

Model Evaluation

We will now evaluate the models using the validation set to estimate the out of sample error rate. We see that random forest model has way better accuracy. The estimated out sample error is about 1% compared to LDA which have is about 30% error rate.

# Check model performance
confusionMatrix(lda_validation_pred,validation_clean$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1366  185   89   63   45
##          B   41  706  121   52  183
##          C  148  140  680  115   85
##          D  114   49  108  694   97
##          E    5   59   28   40  672
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6997          
##                  95% CI : (0.6879, 0.7114)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6199          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8160   0.6198   0.6628   0.7199   0.6211
## Specificity            0.9093   0.9164   0.8996   0.9252   0.9725
## Pos Pred Value         0.7815   0.6401   0.5822   0.6535   0.8358
## Neg Pred Value         0.9255   0.9095   0.9266   0.9440   0.9193
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2321   0.1200   0.1155   0.1179   0.1142
## Detection Prevalence   0.2970   0.1874   0.1985   0.1805   0.1366
## Balanced Accuracy      0.8626   0.7681   0.7812   0.8226   0.7968
confusionMatrix(rf_validation_pred,validation_clean$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672   10    0    0    0
##          B    0 1127    7    0    0
##          C    1    2 1015    4    5
##          D    0    0    4  959    5
##          E    1    0    0    1 1072
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9908, 0.9951)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9914          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9895   0.9893   0.9948   0.9908
## Specificity            0.9976   0.9985   0.9975   0.9982   0.9996
## Pos Pred Value         0.9941   0.9938   0.9883   0.9907   0.9981
## Neg Pred Value         0.9995   0.9975   0.9977   0.9990   0.9979
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1915   0.1725   0.1630   0.1822
## Detection Prevalence   0.2858   0.1927   0.1745   0.1645   0.1825
## Balanced Accuracy      0.9982   0.9940   0.9934   0.9965   0.9952
imp <- varImp(rfFit)$importance
varImpPlot(rfFit$finalModel, sort = TRUE, type = 1, pch = 19, col = 1, cex = 1, main = "Importance of the Predictors")

In the figure above, top 5 most important variables are: ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z.

We will now use the random forest model to predict the testing set.

testing_pred <- predict(rfFit, newdata=newtesting)
testing_pred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

Using random forest model, we can predict how well an individual does in weigth lifting activities using the 52 predictors, with estimated out sample error of only 1%.