Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
Here we set up the environment.
# packages we will need
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
Here we download the data (if necessary) and then assign to data frames.
# declare the links
Link1 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Link2 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# get the data and save
# download.file(url = Link1, destfile = "./pml-training.csv", method = "curl")
# download.file(url = Link2, destfile = "./pml-testing.csv", method = "curl")
# Load data sets
# training_DS <- read.csv("./pml-training.csv", na.strings=c("NA","#DIV/0!",""))
# testing_DS <- read.csv("./pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
# if already local use this
training_DS<- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
testing_DS<- read.csv("pml-testing.csv")
In this section of code, we take an exploratory look at the data to understand it’s names and size better.
dim(training_DS)
## [1] 19622 160
dim(testing_DS)
## [1] 20 160
names(training_DS)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "classe"
names(testing_DS)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "problem_id"
yikes!
Here we want the data set to be tidy, so we remove NA values, and remove features that are not in the testing data set. Additionally we see that the first 7 cols are not relevant to our report so we only need cols 8 to 59.
# what features are we dealing with?
features <- names(testing_DS[,colSums(is.na(testing_DS)) == 0])[8:59]
# Only use features used in testing cases.
training_DS <- training_DS[,c(features,"classe")]
testing_DS <- testing_DS[,c(features,"problem_id")]
# verify our work so far
dim(training_DS)
## [1] 19622 53
dim(testing_DS)
## [1] 20 53
As discussed in the lectures, it is good practices to partition the data. Here we are using 60% of the data as the initial training set and reserving 40%.
# set seed for reproducibility
set.seed(1)
# create partitions
part_info <- createDataPartition(training_DS$classe, p=0.6, list=FALSE)
training_set <- training_DS[part_info,]
testing_set <- training_DS[-part_info,]
# verify our work again
dim(training_set)
## [1] 11776 53
dim(testing_set)
## [1] 7846 53
nice.
We will build two models for our analysis, a Descision Tree and a Random Forest.
As Jeff discussed in the lecture, a decision tree is a good place to start but may not yield a high level of accuracy. We can expect an accuracy of about 70-80%.
# set seed for reproducibility
set.seed(1)
# build up the tree
model_tree <- rpart(classe ~ ., data = training_set, method="class",
control = rpart.control(method = "cv", number = 100))
# lets see it
fancyRpartPlot(model_tree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
The confusion matrix will give us an accuracy result and an estimate at the 95% CI.
# set seed for reproducibility
set.seed(1)
# set up classe as a factor
testing_set$classe <- as.factor(testing_set$classe)
# analysis with the confusion matrix
prediction <- predict(model_tree, testing_set, type = "class")
confusionMatrix(prediction, testing_set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1983 223 20 50 21
## B 66 922 67 83 104
## C 60 149 1098 197 173
## D 74 97 92 775 102
## E 49 127 91 181 1042
##
## Overall Statistics
##
## Accuracy : 0.7418
## 95% CI : (0.7319, 0.7514)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6732
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8884 0.6074 0.8026 0.60264 0.7226
## Specificity 0.9441 0.9494 0.9106 0.94436 0.9300
## Pos Pred Value 0.8633 0.7424 0.6547 0.67982 0.6993
## Neg Pred Value 0.9551 0.9098 0.9562 0.92380 0.9371
## Prevalence 0.2845 0.1935 0.1744 0.16391 0.1838
## Detection Rate 0.2527 0.1175 0.1399 0.09878 0.1328
## Detection Prevalence 0.2928 0.1583 0.2137 0.14530 0.1899
## Balanced Accuracy 0.9163 0.7784 0.8566 0.77350 0.8263
The Accuracy is about 74%, not bad but we can do better.
As we learned in class, the Random forest should give a better accuracy.
# we need another library to do this
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# convert to factor so RF works!
training_set$classe <- as.factor(training_set$classe)
# set seed for reproducibility
set.seed(1)
# build up the forest now
model_random_forest <- randomForest(classe ~ ., data = training_set, ntree = 100)
plot(model_random_forest)
This looks promising and follows the pattern that Jeff highlighted in the lecture.
### Test the random forest model using the testing data set. The confusion matrix will give us an accuracy result and an estimate at the 95% CI.
prediction <- predict(model_random_forest, testing_set, type = "class")
confusionMatrix(prediction, testing_set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 4 0 0 0
## B 2 1507 20 0 0
## C 1 7 1346 8 1
## D 0 0 2 1277 6
## E 0 0 0 1 1435
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.9913, 0.995)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9916
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9928 0.9839 0.9930 0.9951
## Specificity 0.9993 0.9965 0.9974 0.9988 0.9998
## Pos Pred Value 0.9982 0.9856 0.9875 0.9938 0.9993
## Neg Pred Value 0.9995 0.9983 0.9966 0.9986 0.9989
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1921 0.1716 0.1628 0.1829
## Detection Prevalence 0.2846 0.1949 0.1737 0.1638 0.1830
## Balanced Accuracy 0.9990 0.9946 0.9906 0.9959 0.9975
whoa! 99.34%
Now the real test! Can our models predict the actual results using the test data which consists of 20 obs.
prediction_model_tree <- predict(model_tree, testing_DS, type = "class")
prediction_model_tree
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A E D A C D A A A C E C A D E A B B B
## Levels: A B C D E
Not bad but…
prediction_random_forest <- predict(model_random_forest, testing_DS)
prediction_random_forest
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Nice!
The decision tree model has a moderate degree of accuracy. This can be seen in the confusion matrix where the accuracy is about 74%. This is reflected in actual test.
In contrast, the random tree model has a high degree of accuracy. This can be seen in the confusion matrix where the accuracy is over 99%, furthermore the model exhibits 95% CI: (0.9942, 0.9972). The 20 sample results based on the actual test confirms this.