This the a final project to Practical Machine Learning Class in coursera.com, 1 of the 9 courses under the Data Science Specialization.
The common problem in research is to quantify how much of a particular activity they do, but rarely to quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict how well they do weight lifting exercise.
Using random forest technique, the model predicted the quality of the weighlifting with an accuracy of 99% or estimated 1% out of sample error only using 52 predictors. Top 5 most important predictors are ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z. Performance of random forest is clearly more supreme compared to Linear Discriminant Analysis.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
This project utilized the Weight Lifting Exercises Dataset from http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. The goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The 6 young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Following code will call Caret package which will be used to build the model, ggplot2 and GGally for visualization, and download the training and test set.
library(caret)
## Warning: package 'caret' was built under R version 3.4.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.2
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.2
#download the data
Url1 <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(trainFile)) {
download.file(Url1, destfile=trainFile)
}
if (!file.exists(testFile)) {
download.file(Url2, destfile=testFile)
}
#load the data
training <- read.csv("./data/pml-training.csv",header=T,sep=",",na.strings=c(""," ","NA"))
testing <- read.csv("./data/pml-testing.csv",header=T,sep=",",na.strings=c(""," ","NA"))
dim(training) ; dim(testing)
## [1] 19622 160
## [1] 20 160
We see that there are 19,622 rows and 160 columns in the training set while only 20 rows with the same number of fields in testing set. The “classe” variable in the both sets is the outcome to predict.
First thing we can do is to quickly check the structure just to see how variables are formatted and if there are any signs of missing values (NA) since we have to deal with it either imputing the values or remove the row if not that much or remove the variable if it have so many missing values. Second, since training data set is sufficiently large to give us reason to further spit it to training and validation set, we will do any exploratory analyses such as checking distribution of each variable, checking linerarity or relationship between variables and outcome on the subset training data so we can avoid bias in selecting model.
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
We can see that of 160 columns, there are many variables with lots of missing values (NA). We might also want to check for variables(except the outcome variable) if there’s any that has variance near zero
We will check those those variable that has near zero variance and remove those predictors. We will also remove those with fields with missing values since it gonna mess up the modeling.
nzvariance<-nearZeroVar(training, saveMetrics = T)
toremove<-nzvariance[nzvariance[,"nzv"],]
print(toremove) ## check what variable will be remove
## freqRatio percentUnique zeroVar nzv
## new_window 47.33005 0.01019264 FALSE TRUE
## kurtosis_yaw_belt 0.00000 0.00509632 TRUE TRUE
## skewness_yaw_belt 0.00000 0.00509632 TRUE TRUE
## amplitude_yaw_belt 32.00000 0.01528896 FALSE TRUE
## avg_roll_arm 77.00000 1.68178575 FALSE TRUE
## stddev_roll_arm 77.00000 1.68178575 FALSE TRUE
## var_roll_arm 77.00000 1.68178575 FALSE TRUE
## avg_pitch_arm 77.00000 1.68178575 FALSE TRUE
## stddev_pitch_arm 77.00000 1.68178575 FALSE TRUE
## var_pitch_arm 77.00000 1.68178575 FALSE TRUE
## avg_yaw_arm 77.00000 1.68178575 FALSE TRUE
## stddev_yaw_arm 80.00000 1.66649679 FALSE TRUE
## var_yaw_arm 80.00000 1.66649679 FALSE TRUE
## kurtosis_roll_arm 78.00000 1.67668943 FALSE TRUE
## kurtosis_picth_arm 80.00000 1.66649679 FALSE TRUE
## skewness_roll_arm 77.00000 1.68178575 FALSE TRUE
## skewness_pitch_arm 80.00000 1.66649679 FALSE TRUE
## max_roll_arm 25.66667 1.47793293 FALSE TRUE
## min_roll_arm 19.25000 1.41677709 FALSE TRUE
## min_pitch_arm 19.25000 1.47793293 FALSE TRUE
## amplitude_roll_arm 25.66667 1.55947406 FALSE TRUE
## amplitude_pitch_arm 20.00000 1.49831821 FALSE TRUE
## kurtosis_yaw_dumbbell 0.00000 0.00509632 TRUE TRUE
## skewness_yaw_dumbbell 0.00000 0.00509632 TRUE TRUE
## amplitude_yaw_dumbbell 80.20000 0.01019264 FALSE TRUE
## kurtosis_roll_forearm 42.00000 1.63591887 FALSE TRUE
## kurtosis_picth_forearm 85.00000 1.64101519 FALSE TRUE
## kurtosis_yaw_forearm 0.00000 0.00509632 TRUE TRUE
## skewness_roll_forearm 41.50000 1.64101519 FALSE TRUE
## skewness_pitch_forearm 21.25000 1.62062991 FALSE TRUE
## skewness_yaw_forearm 0.00000 0.00509632 TRUE TRUE
## max_roll_forearm 27.66667 1.38110284 FALSE TRUE
## min_roll_forearm 27.66667 1.37091020 FALSE TRUE
## amplitude_roll_forearm 20.75000 1.49322189 FALSE TRUE
## avg_roll_forearm 27.66667 1.64101519 FALSE TRUE
## stddev_roll_forearm 87.00000 1.63082255 FALSE TRUE
## var_roll_forearm 87.00000 1.63082255 FALSE TRUE
## avg_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_pitch_forearm 41.50000 1.64611151 FALSE TRUE
## var_pitch_forearm 83.00000 1.65120783 FALSE TRUE
## avg_yaw_forearm 83.00000 1.65120783 FALSE TRUE
## stddev_yaw_forearm 85.00000 1.64101519 FALSE TRUE
## var_yaw_forearm 85.00000 1.64101519 FALSE TRUE
nrow(toremove) ## number of removed predictor
## [1] 43
training <- training[,!nzvariance$nzv]
testing <- testing[,!nzvariance$nzv]
dim(training) ; dim(testing)
## [1] 19622 117
## [1] 20 117
There are now 116 remaining predictor variables excluding the last variable which is our outcome variable. The graph shows below the remained predictors with missing value which will be removed.
# visualize variables with missing values
percentmissing<-colSums(is.na(training))/nrow(training)
qplot(y=percentmissing, x=index, data=data.frame(percentmissing=percentmissing, index=1:ncol(training)))
# Remove variables with missing values
nomissing<-colSums(is.na(training)) == 0
training_filter_na <- training[,nomissing]
testing_filter_na <- testing[,nomissing]
# Remove unnecessary columns
colRm_train <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window")
colRm_test <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2","cvtd_timestamp","num_window","problem_id")
newtraining <- training_filter_na[,!(names(training_filter_na) %in% colRm_train)]
newtesting <- testing_filter_na[,!(names(testing_filter_na) %in% colRm_test)]
dim(newtraining)
## [1] 19622 53
dim(newtesting)
## [1] 20 52
Code below will further split the training set to two subset, training set and validation set.
set.seed(526)
inTrain <- createDataPartition(y=newtraining$classe, p=0.7, list=FALSE)
training_clean <- newtraining[inTrain,]
validation_clean <- newtraining[-inTrain,]
We can use ggpairs to explore the training_clean data. We will explore the relationship of predictor ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z with the outcome variable.
ggpairs(training_clean, columns = c('roll_belt', 'yaw_belt', 'pitch_forearm', 'pitch_belt' , 'magnet_dumbbell_z', 'classe'), aes(color=classe, alpha = 0.2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Since outcome variable is categorical, We will consider Linear Discriminant Analysis (LDA) and random forest (RF) and see if later model accuracy is more supreme.
Caveat: LDA has strict assumptions to follow that are the same to linear regression model e.g. no outliers and normality of data. One may use logistic regression (or multinomial logistics in this case since outcome variable has more than 2 category) but we will just use LDA just to have comparison with RF, as in most cases RF tend to have better performance than linear modeling approach.
set.seed(123)
# Fit LDA model
ldaFit <- train(classe ~ ., method = "lda", data = training_clean, importance = T, trControl = trainControl(method = "cv", number = 4))
lda_validation_pred <- predict(ldaFit, newdata=validation_clean)
Below will model data based on random forest algorithm.
set.seed(123)
# Fit rf model
rfFit <- train(classe ~ ., method = "rf", data = training_clean, importance = T, trControl = trainControl(method = "cv", number = 4))
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_validation_pred <- predict(rfFit, newdata=validation_clean)
We will now evaluate the models using the validation set to estimate the out of sample error rate. We see that random forest model has way better accuracy. The estimated out sample error is about 1% compared to LDA which have is about 30% error rate.
# Check model performance
confusionMatrix(lda_validation_pred,validation_clean$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1366 185 89 63 45
## B 41 706 121 52 183
## C 148 140 680 115 85
## D 114 49 108 694 97
## E 5 59 28 40 672
##
## Overall Statistics
##
## Accuracy : 0.6997
## 95% CI : (0.6879, 0.7114)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6199
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8160 0.6198 0.6628 0.7199 0.6211
## Specificity 0.9093 0.9164 0.8996 0.9252 0.9725
## Pos Pred Value 0.7815 0.6401 0.5822 0.6535 0.8358
## Neg Pred Value 0.9255 0.9095 0.9266 0.9440 0.9193
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2321 0.1200 0.1155 0.1179 0.1142
## Detection Prevalence 0.2970 0.1874 0.1985 0.1805 0.1366
## Balanced Accuracy 0.8626 0.7681 0.7812 0.8226 0.7968
confusionMatrix(rf_validation_pred,validation_clean$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 10 0 0 0
## B 0 1127 7 0 0
## C 1 2 1015 4 5
## D 0 0 4 959 5
## E 1 0 0 1 1072
##
## Overall Statistics
##
## Accuracy : 0.9932
## 95% CI : (0.9908, 0.9951)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9914
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9895 0.9893 0.9948 0.9908
## Specificity 0.9976 0.9985 0.9975 0.9982 0.9996
## Pos Pred Value 0.9941 0.9938 0.9883 0.9907 0.9981
## Neg Pred Value 0.9995 0.9975 0.9977 0.9990 0.9979
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1915 0.1725 0.1630 0.1822
## Detection Prevalence 0.2858 0.1927 0.1745 0.1645 0.1825
## Balanced Accuracy 0.9982 0.9940 0.9934 0.9965 0.9952
imp <- varImp(rfFit)$importance
varImpPlot(rfFit$finalModel, sort = TRUE, type = 1, pch = 19, col = 1, cex = 1, main = "Importance of the Predictors")
In the figure above, top 5 most important variables are: ‘roll_belt’, ‘yaw_belt’, ‘pitch_forearm’, ‘pitch_belt’ and magnet_dumbbell_z.
We will now use the random forest model to predict the testing set.
testing_pred <- predict(rfFit, newdata=newtesting)
testing_pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Using random forest model, we can predict how well an individual does in weigth lifting activities using the 52 predictors, with estimated out sample error of only 1%.