The data contains of 1. Training Set - 19622 observations and 160 variables, and 2. Test Set 20 observations and 160 variables.
While it is usually quantified how much activity a person does using wearables, it is generally not mentioned how well the person is doing it. Data collected from fitbit, Jawbone Up and Nike Fuelband were mapped to five different ways of doing a particular activity. Alongwith the classe, 159 other variable data were collected for the same activity such as Accelerometer data, Gyroscope data, Yaw data etc.
This project’s objective is to accurately identify which classe a set of data belong to based on the 159 other variables. Once we finalised on a model, the model will then be applied to the test data to identify the classe of each of the 20 data. This report does an exploratory analysis of the given data, finds correlation among the different variables and then fits three different models. It performs in-sample accuracy tests and out of sample tests to identify the best model. This model is then used to identify the classe in each data in the test set.
Loading the necessary libraries that will be required for the analysis.
## Loading required package: lattice
## Loading required package: ggplot2
## Loaded gbm 2.1.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
## corrplot 0.84 loaded
The training and test data is downloaded from the location provided. The variables in the given data are shown below.
training <- read.csv("C:/Users/achar/Documents/R/Predictive Machine Learning/pml-training.csv")
testing <- read.csv("C:/Users/achar/Documents/R/Predictive Machine Learning/pml-testing.csv")
names(training)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "classe"
The first few lines of the data set are displayed below to get a sense of the data. We can clearly see that a lot of the variables contain NAs. Such variables if used as predictors will give errors and erroneous results.
head(training, 3)
## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3 3 carlitos 1323084231 820366 05/12/2011 11:23
## new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1 no 11 1.41 8.07 -94.4 3
## 2 no 11 1.41 8.07 -94.4 3
## 3 no 11 1.42 8.07 -94.4 3
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt
## 1
## 2
## 3
## skewness_roll_belt skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 1 NA
## 2 NA
## 3 NA
## max_picth_belt max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## amplitude_roll_belt amplitude_pitch_belt amplitude_yaw_belt
## 1 NA NA
## 2 NA NA
## 3 NA NA
## var_total_accel_belt avg_roll_belt stddev_roll_belt var_roll_belt
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## avg_pitch_belt stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## stddev_yaw_belt var_yaw_belt gyros_belt_x gyros_belt_y gyros_belt_z
## 1 NA NA 0.00 0 -0.02
## 2 NA NA 0.02 0 -0.02
## 3 NA NA 0.00 0 -0.02
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x magnet_belt_y
## 1 -21 4 22 -3 599
## 2 -22 4 22 -7 608
## 3 -20 5 23 -2 600
## magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm var_accel_arm
## 1 -313 -128 22.5 -161 34 NA
## 2 -311 -128 22.5 -161 34 NA
## 3 -305 -128 22.5 -161 34 NA
## avg_roll_arm stddev_roll_arm var_roll_arm avg_pitch_arm stddev_pitch_arm
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## var_pitch_arm avg_yaw_arm stddev_yaw_arm var_yaw_arm gyros_arm_x
## 1 NA NA NA NA 0.00
## 2 NA NA NA NA 0.02
## 3 NA NA NA NA 0.02
## gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z magnet_arm_x
## 1 0.00 -0.02 -288 109 -123 -368
## 2 -0.02 -0.02 -290 110 -125 -369
## 3 -0.02 -0.02 -289 110 -126 -368
## magnet_arm_y magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## 1 337 516
## 2 337 513
## 3 344 513
## kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm skewness_yaw_arm
## 1
## 2
## 3
## max_roll_arm max_picth_arm max_yaw_arm min_roll_arm min_pitch_arm
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## min_yaw_arm amplitude_roll_arm amplitude_pitch_arm amplitude_yaw_arm
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## roll_dumbbell pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 1 13.05217 -70.49400 -84.87394
## 2 13.13074 -70.63751 -84.71065
## 3 12.85075 -70.27812 -85.14078
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 1
## 2
## 3
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 1 NA
## 2 NA
## 3 NA
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## min_yaw_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## 1 NA NA
## 2 NA NA
## 3 NA NA
## amplitude_yaw_dumbbell total_accel_dumbbell var_accel_dumbbell
## 1 37 NA
## 2 37 NA
## 3 37 NA
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## 1 NA NA NA 0
## 2 NA NA NA 0
## 3 NA NA NA 0
## gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 1 -0.02 0 -234 47
## 2 -0.02 0 -233 47
## 3 -0.02 0 -232 46
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## 1 -271 -559 293 -65
## 2 -269 -555 296 -64
## 3 -270 -561 298 -63
## roll_forearm pitch_forearm yaw_forearm kurtosis_roll_forearm
## 1 28.4 -63.9 -153
## 2 28.3 -63.9 -153
## 3 28.3 -63.9 -152
## kurtosis_picth_forearm kurtosis_yaw_forearm skewness_roll_forearm
## 1
## 2
## 3
## skewness_pitch_forearm skewness_yaw_forearm max_roll_forearm
## 1 NA
## 2 NA
## 3 NA
## max_picth_forearm max_yaw_forearm min_roll_forearm min_pitch_forearm
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 1 NA NA
## 2 NA NA
## 3 NA NA
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 1 36 NA
## 2 36 NA
## 3 36 NA
## avg_roll_forearm stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1 NA NA 0.03 0.00
## 2 NA NA 0.02 0.00
## 3 NA NA 0.03 -0.02
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1 -0.02 192 203 -215
## 2 -0.02 192 203 -216
## 3 0.00 196 204 -213
## magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1 -17 654 476 A
## 2 -18 661 473 A
## 3 -18 658 469 A
A plot of the classe variable to see the different levels and how the data is spread out among the levels.
plot(training$class, col = "green", main = "Training Set - Classe Data", xlab = "Classe Labels", ylab = "Number of Observations")
First we select those variables where there the NAs are less than 20 (out of 19622) in the training data set. We also select the variables where the NAs are less than 2 (out of 20) in the test data set. Thereafter, we remove the first 7 columns which give us the name, timestamp, new window and num window. These variables cannot act as good predictors. Then we remove those variables where the variables have near zero variance.
Then the training data set is divided into training and validation datasets in the ratio of 3:1. The validation dataset will be used to calculate the OUT-OF-SAMPLE errors in the model fit. The validation dataset has been used for CROSS VALIDATION.
newtrain <- training[,colSums(is.na(training))<20]
newtrain <- newtrain[ , -c(1:7)]
newtest <- testing[,colSums(is.na(testing))<2]
newtest <- newtest[ , -c(1:7)]
NZV <- nearZeroVar(newtrain)
newtrain <- newtrain[ ,-NZV]
inTrain <- createDataPartition(newtrain$classe, p = 3/4)[[1]]
final_training <- newtrain[ inTrain,]
final_validate <- newtrain[ -inTrain,]
After cleaning the data, only 53 variables remain in the training and test data sets.
Now we perform a check if the 53 variables in the final training dataset are available in the test set as well.
check <- names(newtrain) == names(newtest)
sum(check)
## [1] 52
We see that 52 variables are the same in the training, validation and test set. The only variable that is different is the classe variable which is written as Problem ID in the test set. Hence we are good to go.
We then find the correlation among the variables. The diagonal of the correlation matrix is set to zero because it gives the correlation between the same variable. The variables which have the highest correlation more than 0.8 is shown below.
M <- abs(cor(final_training[,-53]))
diag(M) <- 0
highly_correlated <- findCorrelation(M, cutoff = 0.8)
names(final_training[highly_correlated])
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "accel_dumbbell_z" "accel_belt_x" "pitch_belt"
## [7] "accel_dumbbell_x" "accel_arm_x" "magnet_arm_y"
## [10] "gyros_forearm_y" "gyros_dumbbell_x" "gyros_dumbbell_z"
## [13] "gyros_arm_x"
We can also see a correlation plot below to view the correlation between variables. The deeper colours are the variables with higher correlation close to -1 or 1. I have intentionally made the diagonal 0 because they are the correlation with the same variable and hence have a value of 1.
N <- cor(final_training[,-53])
diag(N) <- 0
corrplot(M, order = "FPC", method = "color", type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
The out of sample error is defined as 1 - Accuracy of the Validation Set. It gives us the expected error when we fit a trained model on a new set of data.
We fit 3 different models on the training data and checked for the in-sample and out-of-sample accuracies for all three models below. I tried out other models like the Linear Regression and GLM but did not include them because they had low accuracy and did not value to the report. The three models are:
set.seed(504)
mod_ct <- rpart(classe ~ ., data=final_training, method="class")
fancyRpartPlot(mod_ct)
colnames(newtest)[53] <- "classe"
test_mod_ct <- predict(mod_ct, newdata = final_validate, type = "class")
conf_mat_ct <- confusionMatrix(test_mod_ct, final_validate$classe)
conf_mat_ct$table
## Reference
## Prediction A B C D E
## A 1224 155 27 45 27
## B 59 602 91 87 99
## C 31 83 624 125 96
## D 66 64 64 508 79
## E 15 45 49 39 600
conf_mat_ct$overall[1]
## Accuracy
## 0.7255302
Classification Tree has an overall accuracy of 74% with the Validation Set. The out of sample error rate is about 26%.
set.seed(504)
fitControl <- trainControl(method="cv", number=3, verboseIter=FALSE)
mod_rf <- train(classe ~ ., data = final_training, method = "rf", trControl = fitControl)
confusionMatrix(mod_rf)
## Cross-Validated (3 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.4 0.2 0.0 0.0 0.0
## B 0.0 19.1 0.2 0.0 0.0
## C 0.0 0.1 17.3 0.4 0.0
## D 0.0 0.0 0.0 16.0 0.0
## E 0.0 0.0 0.0 0.0 18.3
##
## Accuracy (average) : 0.9904
test_mod_rf <- predict(mod_rf, newdata = final_validate)
conf_mat_rf <- confusionMatrix(test_mod_rf, final_validate$classe)
conf_mat_rf$table
## Reference
## Prediction A B C D E
## A 1395 8 0 0 0
## B 0 941 4 0 0
## C 0 0 849 16 4
## D 0 0 2 783 0
## E 0 0 0 5 897
conf_mat_rf$overall[1]
## Accuracy
## 0.9920473
Random Forest has an in-sample accuracy of 99.05% and an out of sample accuracy of 99.2%. The expected out of sample error is 1 - Accuracy or 0.80%.
set.seed(504)
mod_gbm <- train(classe ~ ., data = final_training, method = "gbm", trControl = fitControl, verbose = FALSE)
confusionMatrix(mod_gbm)
## Cross-Validated (3 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 27.9 0.8 0.0 0.0 0.0
## B 0.3 18.0 0.6 0.1 0.3
## C 0.1 0.5 16.6 0.5 0.1
## D 0.1 0.0 0.2 15.7 0.2
## E 0.0 0.0 0.0 0.1 17.7
##
## Accuracy (average) : 0.9593
test_mod_gbm <- predict(mod_gbm, newdata = final_validate)
conf_mat_gbm <- confusionMatrix(test_mod_gbm, final_validate$classe)
conf_mat_gbm$table
## Reference
## Prediction A B C D E
## A 1375 30 0 2 1
## B 10 898 28 6 9
## C 6 20 813 38 12
## D 3 1 12 752 5
## E 1 0 2 6 874
conf_mat_gbm$overall[1]
## Accuracy
## 0.9608483
GBM has an in-sample accuracy of 95.85% and an out of sample accuracy of 96.08. The out of sample error rate is 3.92%.
Further I tried the Ensemble method, which took a lot of time to run but did not increase the accuracy much. Hence I dropped the idea and stuck to the Random Forest method which provided accurate results.
The predictions for the test set were calculated based on our Random Forest model. The Random Forest model had the highest in-sample accuracy, highest out-of-sample accuracy and lowest out-of-sample error rate. Therefore, it was the best choice of model to predict the classe in the Test Set of 20 data points. The predictions are given below. This prediction is expected to be 99.2% accurate and an expected out of sample error rate of 0.2%.
newtest$classe <- predict(mod_rf, newdata = newtest)
newtest$classe
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
=====================================================================================================================