Using devices such as Jawbone Up, Nike FuelBand, and Fitbit is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this analysis, we will use data from accelerometers on belt, forearm, arm, dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The five ways are exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Only Class A corresponds to correct performance. The goal of this project is to predict the manner in which they did the exercise, i.e., Class A to E. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har.
We first load te R packages needed to perform the data analysis.
setwd("C:/Users/Ajay Manikandan/Desktop/Coursera/R programming/Machine learning")
library(caret); library(rattle); library(rpart); library(rpart.plot); library(randomForest); library(repmis)
## Loading required package: lattice
## Loading required package: ggplot2
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
Now we download the testing and training dataset from the above mentioned website and save it in your working directory.
#import the data from the URLs
#trainurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
#testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#training <- source_data(trainurl, na.strings = c("NA", "#DIV/0!", ""), header = TRUE)
#testing <- source_data(testurl, na.strings = c("NA", "#DIV/0!", ""), header = TRUE)
#Load the dataset from the directory
training <- read.csv("pml-training.csv", na.strings = c("NA", ""))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", ""))
str(training)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
We can see that by using the str function the training dataset has 19622 observations and 160 variables, and the testing dataset contains 20 observations and 160 variables. We use this datasets to predict the outcome of classe
We can see that the dataset contains a lot of NA values which is not useful for our analysis. Therefore, we are going to clean the dataset by removing the NA values.
training <- training[, colSums(is.na(training)) == 0]
testing <- testing[, colSums(is.na(testing)) == 0]
We also remove the first 7 variables as they are of no revelance to the analysis.
Datatrain <- training[, -c(1:7)]
Datatest <- testing[, -c(1:7)]
After cleaning the data we are able to see that there are only 53 variables that are required for our analysis.
We split the training dataset for training and validation purpose. For reproducibility we set see
set.seed(1234)
intrain <- createDataPartition(Datatrain$classe, p = 0.7, list = FALSE)
train_new <- Datatrain[intrain, ]
valid_new <- Datatrain[-intrain, ]
We are going to use decision trees, random forest and generalized boosted regression(gbm)
We are going to use 5-fold cross validation when implementing the algorithm.
control <- trainControl(method = "cv", number = 5)
fit_dt <- train(classe ~ . , data = train_new, method = "rpart",
trControl = control)
print(fit_dt, digits = 4)
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10988, 10991, 10990, 10989
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03550 0.5214 0.38010
## 0.06093 0.4175 0.21094
## 0.11738 0.3333 0.07467
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0355.
fancyRpartPlot(fit_dt$finalModel)
#predicting outcomes using validation dataset
pred_dt <- predict(fit_dt, valid_new)
#Prediction result
(conf_dt <- confusionMatrix(valid_new$classe, pred_dt))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1530 35 105 0 4
## B 486 379 274 0 0
## C 493 31 502 0 0
## D 452 164 348 0 0
## E 168 145 302 0 467
##
## Overall Statistics
##
## Accuracy : 0.489
## 95% CI : (0.4762, 0.5019)
## No Information Rate : 0.5317
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3311
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4890 0.5027 0.3279 NA 0.99151
## Specificity 0.9478 0.8519 0.8797 0.8362 0.88641
## Pos Pred Value 0.9140 0.3327 0.4893 NA 0.43161
## Neg Pred Value 0.6203 0.9210 0.7882 NA 0.99917
## Prevalence 0.5317 0.1281 0.2602 0.0000 0.08003
## Detection Rate 0.2600 0.0644 0.0853 0.0000 0.07935
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.18386
## Balanced Accuracy 0.7184 0.6773 0.6038 NA 0.93896
(accuracy_dt <- conf_dt$overall[1])
## Accuracy
## 0.4890399
From the above confusion matrix we can see that predicting using decision trees yeilds a accuracy rate of 0.48 which is very low.
Random forest machine learning algorithm is implemented.
fit_rf <- train(classe ~., data = train_new, method = "rf",
trControl = control)
print(fit_rf, digits = 4)
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10991, 10989, 10989, 10990, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9913 0.9889
## 27 0.9919 0.9898
## 52 0.9876 0.9843
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# Prediction of outcomes using validation dataset
pred_rf <- predict(fit_rf, valid_new)
(conf_rf <- confusionMatrix(valid_new$classe, pred_rf))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 12 1126 1 0 0
## C 0 3 1019 4 0
## D 0 1 5 957 1
## E 0 1 2 3 1076
##
## Overall Statistics
##
## Accuracy : 0.9944
## 95% CI : (0.9921, 0.9961)
## No Information Rate : 0.2865
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9929
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9929 0.9956 0.9922 0.9927 0.9991
## Specificity 1.0000 0.9973 0.9986 0.9986 0.9988
## Pos Pred Value 1.0000 0.9886 0.9932 0.9927 0.9945
## Neg Pred Value 0.9972 0.9989 0.9984 0.9986 0.9998
## Prevalence 0.2865 0.1922 0.1745 0.1638 0.1830
## Detection Rate 0.2845 0.1913 0.1732 0.1626 0.1828
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9964 0.9964 0.9954 0.9957 0.9989
(accuracy_rf <- conf_rf$overall[1])
## Accuracy
## 0.9943925
we can see that random forest algorithm yeilds a 99.5% accuracy.
fit_gbm <- train(classe ~., data = train_new, method = "gbm",
trControl = control, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
pred_gbm <- predict(fit_gbm, valid_new)
(conf_gbm <- confusionMatrix(valid_new$classe, pred_gbm))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1652 13 3 5 1
## B 35 1072 27 5 0
## C 0 32 979 12 3
## D 0 6 26 924 8
## E 1 9 15 14 1043
##
## Overall Statistics
##
## Accuracy : 0.9635
## 95% CI : (0.9584, 0.9681)
## No Information Rate : 0.2868
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9538
## Mcnemar's Test P-Value : 6.385e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9787 0.9470 0.9324 0.9625 0.9886
## Specificity 0.9948 0.9859 0.9903 0.9919 0.9919
## Pos Pred Value 0.9869 0.9412 0.9542 0.9585 0.9640
## Neg Pred Value 0.9915 0.9874 0.9854 0.9927 0.9975
## Prevalence 0.2868 0.1924 0.1784 0.1631 0.1793
## Detection Rate 0.2807 0.1822 0.1664 0.1570 0.1772
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9867 0.9665 0.9613 0.9772 0.9903
(accuracy_gbm <- conf_gbm$overall[1])
## Accuracy
## 0.9634664
When we compare the various Machine Learning algorithm implemented we are able to identfy that random forest algorithm is by far the best one with a accuracy of 99.5%. Therefore, we are going to use random forest algorithm to predict the test dataset.
(predict(fit_rf, testing))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E