Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
We’ll check if the data already exists locally and, if not, we download it.
# Get the training data
train_fileurl <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
if (!file.exists('pml-training.csv')){
download.file(train_fileurl,'./pml_training.csv')
}
train_df <- read.csv("pml_training.csv", na.strings=c("NA","#DIV/0!", ""))
# Get the test data
test_fileurl <- 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
if (!file.exists('pml-testing.csv')){
download.file(train_fileurl,'./pml_testing.csv')
}
test_df <- read.csv("pml_testing.csv", na.strings=c("NA","#DIV/0!", ""))
glimpse(train_df)
## Observations: 19,622
## Variables: 160
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ user_name <fct> carlitos, carlitos, carlitos, carlito...
## $ raw_timestamp_part_1 <int> 1323084231, 1323084231, 1323084231, 1...
## $ raw_timestamp_part_2 <int> 788290, 808298, 820366, 120339, 19632...
## $ cvtd_timestamp <fct> 05/12/2011 11:23, 05/12/2011 11:23, 0...
## $ new_window <fct> no, no, no, no, no, no, no, no, no, n...
## $ num_window <int> 11, 11, 11, 12, 12, 12, 12, 12, 12, 1...
## $ roll_belt <dbl> 1.41, 1.41, 1.42, 1.48, 1.48, 1.45, 1...
## $ pitch_belt <dbl> 8.07, 8.07, 8.07, 8.05, 8.07, 8.06, 8...
## $ yaw_belt <dbl> -94.4, -94.4, -94.4, -94.4, -94.4, -9...
## $ total_accel_belt <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ kurtosis_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_picth_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_yaw_belt <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_roll_belt.1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_yaw_belt <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_picth_belt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_pitch_belt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_pitch_belt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_total_accel_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_roll_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_pitch_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_pitch_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_pitch_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_yaw_belt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ gyros_belt_x <dbl> 0.00, 0.02, 0.00, 0.02, 0.02, 0.02, 0...
## $ gyros_belt_y <dbl> 0.00, 0.00, 0.00, 0.00, 0.02, 0.00, 0...
## $ gyros_belt_z <dbl> -0.02, -0.02, -0.02, -0.03, -0.02, -0...
## $ accel_belt_x <int> -21, -22, -20, -22, -21, -21, -22, -2...
## $ accel_belt_y <int> 4, 4, 5, 3, 2, 4, 3, 4, 2, 4, 2, 2, 4...
## $ accel_belt_z <int> 22, 22, 23, 21, 24, 21, 21, 21, 24, 2...
## $ magnet_belt_x <int> -3, -7, -2, -6, -6, 0, -4, -2, 1, -3,...
## $ magnet_belt_y <int> 599, 608, 600, 604, 600, 603, 599, 60...
## $ magnet_belt_z <int> -313, -311, -305, -310, -302, -312, -...
## $ roll_arm <dbl> -128, -128, -128, -128, -128, -128, -...
## $ pitch_arm <dbl> 22.5, 22.5, 22.5, 22.1, 22.1, 22.0, 2...
## $ yaw_arm <dbl> -161, -161, -161, -161, -161, -161, -...
## $ total_accel_arm <int> 34, 34, 34, 34, 34, 34, 34, 34, 34, 3...
## $ var_accel_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_yaw_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_yaw_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_yaw_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ gyros_arm_x <dbl> 0.00, 0.02, 0.02, 0.02, 0.00, 0.02, 0...
## $ gyros_arm_y <dbl> 0.00, -0.02, -0.02, -0.03, -0.03, -0....
## $ gyros_arm_z <dbl> -0.02, -0.02, -0.02, 0.02, 0.00, 0.00...
## $ accel_arm_x <int> -288, -290, -289, -289, -289, -289, -...
## $ accel_arm_y <int> 109, 110, 110, 111, 111, 111, 111, 11...
## $ accel_arm_z <int> -123, -125, -126, -123, -123, -122, -...
## $ magnet_arm_x <int> -368, -369, -368, -372, -374, -369, -...
## $ magnet_arm_y <int> 337, 337, 344, 344, 337, 342, 336, 33...
## $ magnet_arm_z <int> 516, 513, 513, 512, 506, 513, 509, 51...
## $ kurtosis_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_picth_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_yaw_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_yaw_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_picth_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_yaw_arm <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_yaw_arm <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_roll_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_pitch_arm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_yaw_arm <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ roll_dumbbell <dbl> 13.05217, 13.13074, 12.85075, 13.4312...
## $ pitch_dumbbell <dbl> -70.49400, -70.63751, -70.27812, -70....
## $ yaw_dumbbell <dbl> -84.87394, -84.71065, -85.14078, -84....
## $ kurtosis_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_picth_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_yaw_dumbbell <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_yaw_dumbbell <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_picth_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ total_accel_dumbbell <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 3...
## $ var_accel_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_roll_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_pitch_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_yaw_dumbbell <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ gyros_dumbbell_x <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0...
## $ gyros_dumbbell_y <dbl> -0.02, -0.02, -0.02, -0.02, -0.02, -0...
## $ gyros_dumbbell_z <dbl> 0.00, 0.00, 0.00, -0.02, 0.00, 0.00, ...
## $ accel_dumbbell_x <int> -234, -233, -232, -232, -233, -234, -...
## $ accel_dumbbell_y <int> 47, 47, 46, 48, 48, 48, 47, 46, 47, 4...
## $ accel_dumbbell_z <int> -271, -269, -270, -269, -270, -269, -...
## $ magnet_dumbbell_x <int> -559, -555, -561, -552, -554, -558, -...
## $ magnet_dumbbell_y <int> 293, 296, 298, 303, 292, 294, 295, 30...
## $ magnet_dumbbell_z <dbl> -65, -64, -63, -60, -68, -66, -70, -7...
## $ roll_forearm <dbl> 28.4, 28.3, 28.3, 28.1, 28.0, 27.9, 2...
## $ pitch_forearm <dbl> -63.9, -63.9, -63.9, -63.9, -63.9, -6...
## $ yaw_forearm <dbl> -153, -153, -152, -152, -152, -152, -...
## $ kurtosis_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_picth_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ kurtosis_yaw_forearm <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ skewness_yaw_forearm <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_picth_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ max_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ min_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amplitude_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ total_accel_forearm <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 3...
## $ var_accel_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_roll_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_pitch_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ avg_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ stddev_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ var_yaw_forearm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ gyros_forearm_x <dbl> 0.03, 0.02, 0.03, 0.02, 0.02, 0.02, 0...
## $ gyros_forearm_y <dbl> 0.00, 0.00, -0.02, -0.02, 0.00, -0.02...
## $ gyros_forearm_z <dbl> -0.02, -0.02, 0.00, 0.00, -0.02, -0.0...
## $ accel_forearm_x <int> 192, 192, 196, 189, 189, 193, 195, 19...
## $ accel_forearm_y <int> 203, 203, 204, 206, 206, 203, 205, 20...
## $ accel_forearm_z <int> -215, -216, -213, -214, -214, -215, -...
## $ magnet_forearm_x <int> -17, -18, -18, -16, -17, -9, -18, -9,...
## $ magnet_forearm_y <dbl> 654, 661, 658, 658, 655, 660, 659, 66...
## $ magnet_forearm_z <dbl> 476, 473, 469, 469, 473, 478, 470, 47...
## $ classe <fct> A, A, A, A, A, A, A, A, A, A, A, A, A...
ggplot(train_df, aes(x = classe)) + geom_density()
There are way too many NAs in the data set. If we want any meaningful results we’ll have to do something about them. In this instance, I am getting rid of the variables, which contain all NAs. Also, 1st to 7th variables are irrelevant for this exersise, so we’ll delete them. There are also quite a few near-zero values, which we are getting rid of.
train_df <- train_df[,colSums(is.na(train_df)) == 0]
train_df <- train_df[,-c(1:7)]
test_df <- test_df[,colSums(is.na(test_df)) == 0]
test_df <- test_df[,-c(1:7)]
near_zero <- nearZeroVar(train_df, saveMetrics = TRUE)
train_df <- train_df[, !near_zero$nzv]
test_df <- test_df[, !near_zero$nzv]
Now we are going to partition the train_df into the training and validation parts.
test_split <- train_df %>%
initial_split(prop = 0.7, strata = "classe")
data_training <- training(test_split)
data_testing <- testing(test_split)
# Decision Tree
fit_decTree <- rpart(classe ~ .,
data = data_training,
method = "class")
rpart.plot(fit_decTree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# Random Forest
fit_RF <- train(classe ~ .,
data = data_training,
method = "rf",
trControl = trainControl(method = "cv", 5),
ntree = 250)
Let’s have a look at the results
pred_decTree <- predict(fit_decTree, data_testing, type = "class")
confusionMatrix(pred_decTree, data_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1498 200 16 61 8
## B 42 736 86 83 92
## C 60 105 836 95 87
## D 66 50 69 647 90
## E 8 48 19 78 805
##
## Overall Statistics
##
## Accuracy : 0.7684
## 95% CI : (0.7574, 0.7791)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7065
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8949 0.6462 0.8148 0.6712 0.7440
## Specificity 0.9323 0.9362 0.9286 0.9441 0.9681
## Pos Pred Value 0.8402 0.7084 0.7067 0.7017 0.8403
## Neg Pred Value 0.9571 0.9168 0.9596 0.9361 0.9438
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2545 0.1251 0.1421 0.1099 0.1368
## Detection Prevalence 0.3030 0.1766 0.2010 0.1567 0.1628
## Balanced Accuracy 0.9136 0.7912 0.8717 0.8076 0.8561
pred_rf <- predict(fit_RF, data_testing, type = "raw")
confusionMatrix(pred_rf, data_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 5 0 0 0
## B 0 1134 6 0 0
## C 1 0 1017 5 3
## D 0 0 3 957 5
## E 0 0 0 2 1074
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9927, 0.9966)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9956 0.9912 0.9927 0.9926
## Specificity 0.9988 0.9987 0.9981 0.9984 0.9996
## Pos Pred Value 0.9970 0.9947 0.9912 0.9917 0.9981
## Neg Pred Value 0.9998 0.9989 0.9981 0.9986 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1927 0.1728 0.1626 0.1825
## Detection Prevalence 0.2851 0.1937 0.1743 0.1640 0.1828
## Balanced Accuracy 0.9991 0.9972 0.9947 0.9956 0.9961
Random Forest offers 0.99 accuracy against 0.74 on Decision Tree. Therefore, we choose Random Forest. I am not aware of a meaningful way to visualise what Random Forest has learned, but we can have a look at it’s accuracy.
fit_RF$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 250, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.71%
## Confusion matrix:
## A B C D E class.error
## A 3897 6 2 0 1 0.002304147
## B 20 2630 7 1 0 0.010534236
## C 0 10 2380 6 0 0.006677796
## D 0 0 29 2221 2 0.013765542
## E 0 2 6 6 2511 0.005544554
As we can see, the out-of-sample error is extremely small.
plot(fit_RF)