See the web version of this notebook in https://rpubs.com/ancazugo/weightliftingprediction
As a first step, it is necessary to load the packages and set the working directory.
setwd('./Practical_Machine_Learning/')
library(tidyverse) #Loads ggplot2, tidyr and dplyr
library(caret) #Machine learning algorithms
library(doParallel) #Parallel computation
library(knitr) #HTML rendering
library(kableExtra) #Table formatting
Secondly, we download the train a test datasets from the links
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', destfile = 'pml-train.csv')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', destfile = 'pml-test.csv')
And then, load them to R as dataFrames.
train <- read.csv('pml-train.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))
test <- read.csv('pml-test.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))
First, some variables are going to be removed from both the train a test datasets, since they are of no particular value to this analysis (e.g. user_name, X). Of the remaining variables, number of NAs will be calculated using dplyr. Then those columns that have more than 95% of their values missing will be removed. Lastly, the new_window variable will be converted to numeric.
colsRemoved <- c('X', 'user_name', 'cvtd_timestamp', 'problem_id')
colsNA <- train %>%
summarise_all(funs(sum(is.na(.) / nrow(train))))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
for(col in 1:ncol(colsNA)) {
if (colsNA[,col] > 0.95) {
colsRemoved <- c(colsRemoved, colnames(colsNA)[col])
}
}
new_windowNum <- c('no' = 0,'yes' = 1)
train$new_window <- new_windowNum[train$new_window]
test$new_window <- new_windowNum[test$new_window]
train <- train[, !(colnames(train) %in% colsRemoved)]
test <- test[, !(colnames(test) %in% colsRemoved)]
From the original dataset of 159 variables only 56 are left.
In order to preview the data we are going to build a histogram for each variable using ggplot2 and dplyr.
train %>%
select(-new_window, -classe) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the plot above it is possible to see several distributions, particularly uniform, bimodal, normal and Poisson. While I won’t explain each variable, particularly because the names are not visible in the plot, the reader can replicate can replicate this process and see for him/herself the distribution of the whole dataset.
Then, I will build a summary table of the mean of each variable for each classe and new_window.
trainSummary <- train %>%
group_by(classe, new_window) %>%
summarise_all(funs(mean(.)))
kable(trainSummary) %>%
kable_styling() %>%
scroll_box(width = '100%')
| classe | new_window | raw_timestamp_part_1 | raw_timestamp_part_2 | num_window | roll_belt | pitch_belt | yaw_belt | total_accel_belt | gyros_belt_x | gyros_belt_y | gyros_belt_z | accel_belt_x | accel_belt_y | accel_belt_z | magnet_belt_x | magnet_belt_y | magnet_belt_z | roll_arm | pitch_arm | yaw_arm | total_accel_arm | gyros_arm_x | gyros_arm_y | gyros_arm_z | accel_arm_x | accel_arm_y | accel_arm_z | magnet_arm_x | magnet_arm_y | magnet_arm_z | roll_dumbbell | pitch_dumbbell | yaw_dumbbell | total_accel_dumbbell | gyros_dumbbell_x | gyros_dumbbell_y | gyros_dumbbell_z | accel_dumbbell_x | accel_dumbbell_y | accel_dumbbell_z | magnet_dumbbell_x | magnet_dumbbell_y | magnet_dumbbell_z | roll_forearm | pitch_forearm | yaw_forearm | total_accel_forearm | gyros_forearm_x | gyros_forearm_y | gyros_forearm_z | accel_forearm_x | accel_forearm_y | accel_forearm_z | magnet_forearm_x | magnet_forearm_y | magnet_forearm_z |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 0 | 1322814395 | 489248.3 | 384.7618 | 59.19093 | 0.3512740 | -12.187125 | 10.65856 | -0.0048821 | 0.0404186 | -0.1211040 | -6.2467556 | 28.86310 | -62.42497 | 57.80113 | 602.2471 | -337.8660 | -0.5819795 | 3.3677225 | -11.744707 | 27.38311 | 0.0282051 | -0.2213471 | 0.2635551 | -132.58271 | 47.370499 | -75.62274 | -20.22318 | 235.87498 | 410.6953 | 21.65729 | -18.9735598 | 1.376628 | 14.65564 | 0.1302742 | 0.0336173 | -0.0933723 | -50.536099 | 52.71139 | -56.64029 | -387.9896 | 219.3120 | 10.880387 | 25.72065 | -6.968583 | 24.764078 | 32.13288 | 0.1736483 | 0.1204917 | 0.1609395 | -0.281667 | 170.2508 | -59.25589 | -196.2285 | 475.0866 | 407.4942 |
| A | 1 | 1322813012 | 970452.4 | 381.4771 | 67.32807 | 0.5849541 | -4.320367 | 11.66055 | -0.0069725 | 0.0438532 | -0.1349541 | -6.9908257 | 32.93578 | -75.67890 | 59.26606 | 599.7248 | -341.0826 | 0.2333028 | 4.1759633 | -12.311009 | 27.14679 | -0.0111927 | -0.2082569 | 0.3023853 | -110.83486 | 37.504587 | -62.30275 | 34.73394 | 222.20183 | 407.0917 | 18.06147 | -16.1238532 | 7.525688 | 14.52294 | 0.1621101 | 0.0060550 | -0.1821101 | -47.752294 | 48.09174 | -44.64220 | -361.7064 | 195.6789 | 8.091743 | 19.55899 | -1.805413 | 11.688073 | 31.60550 | 0.1333028 | -0.0977982 | 0.0307339 | -28.211009 | 151.8257 | -47.07339 | -212.3761 | 421.7890 | 380.5138 |
| B | 0 | 1322840157 | 482628.0 | 502.3303 | 65.23866 | -0.0271302 | -13.173844 | 11.14712 | -0.0048897 | 0.0425390 | -0.1349274 | -4.9343733 | 32.00592 | -73.72889 | 49.22808 | 599.3195 | -336.7937 | 31.9581388 | -6.6164228 | 7.825882 | 26.53496 | 0.0210328 | -0.2860570 | 0.2664524 | -41.97687 | 25.150350 | -95.90802 | 235.02663 | 129.50968 | 194.2399 | 35.74026 | 3.2903223 | 14.821075 | 14.36821 | 0.1661323 | 0.0173453 | -0.1430178 | -0.500269 | 69.02717 | -15.66272 | -248.0412 | 265.3722 | 49.009683 | 32.29136 | 14.675519 | 13.036191 | 35.36794 | 0.1436014 | 0.0978268 | 0.1788623 | -76.353685 | 137.4658 | -46.09736 | -325.2636 | 280.6132 | 377.8841 |
| B | 1 | 1322847240 | 978217.9 | 527.1013 | 61.60835 | 0.3565823 | -18.806203 | 10.73418 | -0.0017722 | 0.0421519 | -0.1275949 | -4.9367089 | 30.56962 | -67.55696 | 48.83544 | 602.1772 | -332.2278 | 34.3954430 | -3.9145570 | 11.539114 | 26.34177 | 0.1035443 | -0.2817722 | 0.2413924 | -58.83544 | 38.215190 | -82.45570 | 221.26582 | 161.44304 | 254.0759 | 35.64304 | -0.9797468 | 10.772152 | 15.15190 | 0.2031646 | -0.0550633 | -0.1231646 | -8.316456 | 73.55696 | -24.20253 | -268.3418 | 294.4810 | 38.898734 | 39.48443 | 13.301139 | 8.394557 | 34.24051 | 0.1169620 | 0.2348101 | 0.2218987 | -83.708861 | 108.7468 | -60.89873 | -323.4177 | 236.2785 | 385.8481 |
| C | 0 | 1322828694 | 488631.7 | 483.3490 | 64.73437 | -1.0731862 | -7.381876 | 11.17214 | -0.0148837 | 0.0393347 | -0.1351790 | -4.0229714 | 31.06921 | -70.95794 | 57.24165 | 599.8798 | -337.4379 | 25.2065931 | -1.6805489 | 3.976333 | 24.30758 | 0.1012142 | -0.2637649 | 0.2742721 | -77.61008 | 41.025060 | -53.77655 | 156.84994 | 189.67601 | 362.1483 | -13.36173 | -24.8806405 | -15.910832 | 12.90036 | 0.1921629 | 0.0546778 | -0.1517661 | -40.201969 | 30.43556 | -52.45495 | -370.2079 | 157.6002 | 62.901850 | 59.87649 | 12.336405 | 39.188461 | 34.93974 | 0.2082399 | 0.0636456 | 0.1408681 | -47.589499 | 212.7488 | -61.49702 | -334.9326 | 502.6381 | 460.8798 |
| C | 1 | 1322825623 | 975019.8 | 502.6286 | 68.63857 | 0.2075714 | -7.261143 | 11.74286 | -0.0167143 | 0.0471429 | -0.1312857 | -5.2857143 | 33.42857 | -76.90000 | 55.78571 | 602.6143 | -333.1429 | 14.2122857 | 0.5442857 | 1.983571 | 24.04286 | 0.1182857 | -0.2682857 | 0.2840000 | -99.05714 | 40.142857 | -45.61429 | 108.91429 | 215.35714 | 389.5857 | -23.39000 | -29.2871429 | -18.864286 | 13.84286 | 0.1932857 | 0.0840000 | -0.1875714 | -50.700000 | 27.18571 | -59.02857 | -392.4857 | 182.5000 | 50.371429 | 54.65943 | 13.685571 | 32.746571 | 33.70000 | 0.1105714 | 0.2668571 | 0.1281429 | -63.300000 | 190.3714 | -69.81429 | -346.1286 | 497.6429 | 493.2714 |
| D | 0 | 1322822391 | 495403.1 | 430.5459 | 60.93022 | 1.7649349 | -17.651605 | 11.23387 | -0.0143565 | 0.0362186 | -0.1375183 | -8.2113124 | 30.46298 | -69.38862 | 48.75755 | 594.2371 | -340.6174 | 22.7197204 | -10.4935621 | 5.195103 | 23.42262 | 0.0326819 | -0.2521258 | 0.2649221 | 15.85796 | 24.682873 | -48.24627 | 396.90372 | 96.64983 | 297.0232 | 50.30249 | -2.2790961 | 1.054527 | 11.33143 | 0.2024500 | 0.0137718 | -0.1324150 | -22.462345 | 53.10740 | -33.60439 | -316.6667 | 218.0918 | 56.706705 | 16.17097 | 28.014801 | 4.816276 | 36.08230 | 0.1242008 | -0.0042453 | 0.1142930 | -153.744836 | 152.4630 | -48.34318 | -456.0848 | 318.0601 | 362.2234 |
| D | 1 | 1322808126 | 975628.8 | 471.2319 | 57.40188 | 4.2385507 | -27.555217 | 11.01449 | -0.0318841 | 0.0415942 | -0.0930435 | -11.4057971 | 30.46377 | -62.55072 | 44.86957 | 596.4638 | -340.7971 | 18.9311594 | -8.9391304 | 3.164493 | 22.14493 | 0.1615942 | -0.2814493 | 0.2788406 | 12.52174 | 40.260870 | -29.30435 | 409.89855 | 114.62319 | 337.3478 | 45.51304 | -4.7971014 | 3.681159 | 12.60870 | 0.2923188 | -0.0018841 | -0.1363768 | -31.289855 | 58.52174 | -41.17391 | -307.6377 | 204.7826 | 55.507246 | 13.20609 | 30.639565 | -2.158261 | 35.42029 | 0.0444928 | 0.0507246 | 0.0843478 | -150.246377 | 149.3043 | -59.57971 | -458.5797 | 309.9275 | 355.5507 |
| E | 0 | 1322835719 | 499247.2 | 374.4104 | 73.97826 | 0.5724150 | -5.798892 | 12.64739 | 0.0091412 | 0.0378600 | -0.1302409 | -4.4492630 | 28.84212 | -90.95833 | 63.28713 | 568.0967 | -378.4212 | 20.4761593 | -12.5401559 | -1.865091 | 24.55754 | 0.0428741 | -0.2813407 | 0.2808362 | -18.78373 | 16.853742 | -76.44898 | 324.01531 | 83.58163 | 217.4036 | 26.72861 | -6.9057886 | 5.849583 | 14.42460 | 0.1340760 | 0.1163010 | -0.1418566 | -18.200680 | 55.85317 | -24.17347 | -291.6845 | 239.5663 | 72.770975 | 39.29533 | 16.741984 | 11.659025 | 36.64994 | 0.1332370 | 0.0625765 | 0.1540136 | -70.821145 | 145.6947 | -58.74291 | -328.6772 | 279.4447 | 351.6800 |
| E | 1 | 1322849484 | 960602.5 | 384.4051 | 84.85443 | -2.6078481 | 11.503038 | 13.86076 | 0.0284810 | 0.0483544 | -0.1444304 | 0.9493671 | 32.63291 | -106.05063 | 73.79747 | 564.2911 | -379.9620 | 12.0554430 | -12.5224051 | -0.088481 | 25.05063 | -0.1645570 | -0.2098734 | 0.2550633 | -16.34177 | -5.075949 | -92.79747 | 299.79747 | 79.15190 | 180.0253 | 29.73797 | -10.0126582 | -12.779747 | 14.79747 | 0.1102532 | 0.2065823 | -0.1318987 | -24.379747 | 57.58228 | -40.91139 | -312.2278 | 227.4684 | 57.860760 | 34.79506 | 14.283544 | 15.306962 | 37.51899 | 0.2767089 | 0.0168354 | 0.1432911 | -83.670886 | 164.5823 | -62.02532 | -358.6329 | 298.8101 | 412.4304 |
As evidenced by the summary above, the variable to be predicted is Class, which classifies participants according to their correctness performing a dumbbell biceps curl 10 times. This variable has five values in regard to their performance:
However, there are still many variables that could possibly be related. Think of all those variables that measure the same metric but in different axis. So in order to solve this, the best method is to do a Principal Component Analysis to reduce the number of variables. For this purpose I will use the prcomp() function. Then, I will transpose and transform the cumulative Proportion summary (Importance) in a dataFrame that can be plotted with ggplot2.
trainPCA <- prcomp(train[, -57], scale. = T, center = T)
PCAsummary <- summary(trainPCA)$importance
PCAsummary <- data.frame(t(PCAsummary))
PCAsummary$PCA <- seq(1, nrow(PCAsummary))
ggplot(PCAsummary, aes(x = PCA, y = Cumulative.Proportion)) + geom_point() + theme_bw()
From the plot above it is possible to determine that around 40 PCs are necessary to explain 100% of the variance in the dataset, nonetheless, this would be close to the number of original number of variables in the clean training dataset. Therefore I will use the
preProcess() function from the caret package to process the data. This will be done in order to keep the PCs that explain up to 95% of the variance, which if you check the summary above is 27 PCs. Then I will predict the new variables that will be used to train the final model.
preProc <- preProcess(train[,-57], method = 'pca', thresh = 0.95)
trainPC <- predict(preProc, train[,-57])
trainPC$classe <- train$classe
After doing the PCA on the dataset and adding the outcome variable, I will build a Random Forest prediction model using the PCs from the previous step. However, in order to reduce time, I will train the model in parallel using the functions detectCores(), makePSOCKcluster(), registerDoParallel and stopCluster() from the doParallel package. Note: This step might take some time, depending on your PC.
ncores <- detectCores() - 1
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
start_time <- Sys.time()
modelRF <- train(classe ~ ., method = 'rf', data = trainPC, list = F)
end_time <- Sys.time()
stopCluster(cl)
end_time - start_time
## Time difference of 13.71151 mins
Since and out-of-sample error cannot be calculated, I will calculate the in-of-sample error on the training dataset using the confusionMatrix() function.
trainPred <- predict(modelRF, trainPC[,-28])
confusionMatrix(table(trainPred, train$classe))
## Confusion Matrix and Statistics
##
##
## trainPred A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
From these results, the model is highly accurate, sensitive and specificity, although these metrics are based on the predictions of the training dataset.
Finally, I predict the PCs on the test set and use that outcome to predict the class based on the Random Forest model.
testPC <- predict(preProc, test)
classPrediction <- predict(modelRF, testPC)
kable(data.frame(Case = seq(1:20), classPrediction), align = 'c') %>%
kable_styling(full_width = F, fixed_thead = T) %>%
scroll_box(height = '20%')
| Case | classPrediction |
|---|---|
| 1 | B |
| 2 | A |
| 3 | B |
| 4 | A |
| 5 | A |
| 6 | E |
| 7 | D |
| 8 | B |
| 9 | A |
| 10 | A |
| 11 | B |
| 12 | C |
| 13 | B |
| 14 | A |
| 15 | E |
| 16 | E |
| 17 | A |
| 18 | B |
| 19 | B |
| 20 | B |
From the quiz in Coursera, this is the correct outcome for all the cases.