This project consists of predicting the manner in which participants executed an exercise. The data is collected from accelerometers placed on the belt, forearm, arm, and dumbell of six participants.
For a complete explanation: https://www.coursera.org/learn/practical-machine-learning/supplement/PvInj/course-project-instructions-read-first
This part covers the methods used to explore the data and choose the appropriate predictive algorithm.
Set specific seed for experiment replication purposes.
set.seed(500)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rpart.plot)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(e1071)
Reading both training and testing data
data_train <- read.csv("C:/MY FILES/Rdirectory/pml-training.csv")
data_test <- read.csv("C:/MY FILES/Rdirectory/pml-testing.csv")
After looking at the data structure and reading carefully the variables, the following ought to be mentioned:
data_train <- data_train[, -c(1:7)]
data_test <- data_test[, -c(1:7)]
dim(data_train)
## [1] 19622 153
dim(data_test)
## [1] 20 153
Both sets have 153 variables, which is logical. The original sets enclosed 160 variables each. After removing 7 columns, we should be left with 153.
After careful inspection, it came to our attention that a lot of columns contain NA entries.
colSums(is.na(data_train))
## roll_belt pitch_belt yaw_belt
## 0 0 0
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 0 0 0
## kurtosis_yaw_belt skewness_roll_belt skewness_roll_belt.1
## 0 0 0
## skewness_yaw_belt max_roll_belt max_picth_belt
## 0 19216 19216
## max_yaw_belt min_roll_belt min_pitch_belt
## 0 19216 19216
## min_yaw_belt amplitude_roll_belt amplitude_pitch_belt
## 0 19216 19216
## amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 0 19216 19216
## stddev_roll_belt var_roll_belt avg_pitch_belt
## 19216 19216 19216
## stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 19216 19216 19216
## stddev_yaw_belt var_yaw_belt gyros_belt_x
## 19216 19216 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z roll_arm
## 0 0 0
## pitch_arm yaw_arm total_accel_arm
## 0 0 0
## var_accel_arm avg_roll_arm stddev_roll_arm
## 19216 19216 19216
## var_roll_arm avg_pitch_arm stddev_pitch_arm
## 19216 19216 19216
## var_pitch_arm avg_yaw_arm stddev_yaw_arm
## 19216 19216 19216
## var_yaw_arm gyros_arm_x gyros_arm_y
## 19216 0 0
## gyros_arm_z accel_arm_x accel_arm_y
## 0 0 0
## accel_arm_z magnet_arm_x magnet_arm_y
## 0 0 0
## magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## 0 0 0
## kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 0 0 0
## skewness_yaw_arm max_roll_arm max_picth_arm
## 0 19216 19216
## max_yaw_arm min_roll_arm min_pitch_arm
## 19216 19216 19216
## min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## 19216 19216 19216
## amplitude_yaw_arm roll_dumbbell pitch_dumbbell
## 19216 0 0
## yaw_dumbbell kurtosis_roll_dumbbell kurtosis_picth_dumbbell
## 0 0 0
## kurtosis_yaw_dumbbell skewness_roll_dumbbell skewness_pitch_dumbbell
## 0 0 0
## skewness_yaw_dumbbell max_roll_dumbbell max_picth_dumbbell
## 0 19216 19216
## max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 0 19216 19216
## min_yaw_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## 0 19216 19216
## amplitude_yaw_dumbbell total_accel_dumbbell var_accel_dumbbell
## 0 0 19216
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 19216 19216 19216
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 19216 19216 19216
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell
## 19216 19216 19216
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## 0 0 0
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 0 0 0
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## 0 0 0
## roll_forearm pitch_forearm yaw_forearm
## 0 0 0
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## 0 0 0
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## 0 0 0
## max_roll_forearm max_picth_forearm max_yaw_forearm
## 19216 19216 0
## min_roll_forearm min_pitch_forearm min_yaw_forearm
## 19216 19216 0
## amplitude_roll_forearm amplitude_pitch_forearm amplitude_yaw_forearm
## 19216 19216 0
## total_accel_forearm var_accel_forearm avg_roll_forearm
## 0 19216 19216
## stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 19216 19216 19216
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 19216 19216 19216
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x
## 19216 19216 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z classe
## 0 0 0
Each corrupted column have 19216 NA entries (out of 19622)
The code above removes all NA columns and checks if all NA entries have been deleted.
data_train<- data_train[, colSums(is.na(data_train)) == 0]
any(colSums(is.na(data_train)) != 0)
## [1] FALSE
The same code block above should be applied to testing data.
data_test<- data_test[, colSums(is.na(data_test)) == 0]
any(colSums(is.na(data_test)) != 0)
## [1] FALSE
At this point, the datasets have the following dimensions:
dim(data_train)
## [1] 19622 86
dim(data_test)
## [1] 20 53
Data sets may contain predictors which value does not change dramatically across observations. These variables, usually called near-zero variance predictors are not informative and could therefore be deleted.
data_train <- data_train[, -nearZeroVar(data_train)]
dim(data_train)
## [1] 19622 53
The line above shows that 33 variables (86 - 53) were judged to be of near-zero variance, and were naturally deleted.
Time to partition our data(training) into training (70%) and validation (30%) sets.
data_part <- createDataPartition(data_train$classe, p = 0.7, list = FALSE)
data_train_p <- data_train[data_part, ]
data_valid_p <- data_train[-data_part, ]
nrow(data_train_p)
## [1] 13737
nrow(data_valid_p)
## [1] 5885
Two models will be constructed in this part: Decision trees and random forest.
model_tree <- rpart(classe ~ ., data = data_train_p, method = "class")
fancyRpartPlot(model_tree)
As shown above, the high number of leaf nodes makes it delicate to visualize the decision tree. Increasing the node size would only cause overlapping.
Let us now use the developed model to predict the variable “class” from the validation set.
model_tree_pred <- predict(model_tree, data_valid_p, type = "class")
confMat_tree <- confusionMatrix(model_tree_pred, factor(data_valid_p$classe))
confMat_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1498 227 27 106 37
## B 37 630 44 21 63
## C 36 114 835 148 122
## D 49 92 61 615 57
## E 54 76 59 74 803
##
## Overall Statistics
##
## Accuracy : 0.7444
## 95% CI : (0.7331, 0.7555)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6755
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8949 0.5531 0.8138 0.6380 0.7421
## Specificity 0.9057 0.9652 0.9136 0.9474 0.9452
## Pos Pred Value 0.7905 0.7925 0.6653 0.7037 0.7533
## Neg Pred Value 0.9559 0.9000 0.9587 0.9304 0.9421
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2545 0.1071 0.1419 0.1045 0.1364
## Detection Prevalence 0.3220 0.1351 0.2133 0.1485 0.1811
## Balanced Accuracy 0.9003 0.7592 0.8637 0.7927 0.8437
The confusion matrix displays that the decision tree model was able to predict the class in the validation set with an accuracy of 0.74%, leaving the out-of-sample error to be around 0.26%.
This section will cover the construction of a random forest model.
control_rf <- trainControl(method = "cv", number = 3, verboseIter = FALSE)
model_rf <- train(classe ~ ., data = data_train_p, method = "rf", trControl = control_rf)
Let us now use the developed model to predict the variable “class” from the validation set.
model_rf_pred <- predict(model_rf, data_valid_p)
confMat_rf <- confusionMatrix(model_rf_pred, factor(data_valid_p$classe))
confMat_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1669 1 0 0 0
## B 5 1136 2 1 1
## C 0 2 1021 15 2
## D 0 0 3 948 5
## E 0 0 0 0 1074
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.992
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9974 0.9951 0.9834 0.9926
## Specificity 0.9998 0.9981 0.9961 0.9984 1.0000
## Pos Pred Value 0.9994 0.9921 0.9817 0.9916 1.0000
## Neg Pred Value 0.9988 0.9994 0.9990 0.9968 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1930 0.1735 0.1611 0.1825
## Detection Prevalence 0.2838 0.1946 0.1767 0.1624 0.1825
## Balanced Accuracy 0.9984 0.9977 0.9956 0.9909 0.9963
From the result above, it is clear that only few entries were predicted wrong (52 to be precise). The accuracy is also reported to be 99%, leaving the out-of-sample error at only 1%.
The previous section clearly shows that the random forest algorithm generated a model which accuracy was the highest (99%). It is logical to implement this model to predict the classe on the testing data set.
pred_test <- predict(model_rf, data_test)
pred_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
table(pred_test)
## pred_test
## A B C D E
## 7 8 1 1 3
The data used in this project was obtained from this source: http://groupware.les.inf.puc-rio.br/har