See the web version of this notebook in https://rpubs.com/ancazugo/weightliftingprediction

Data Cleaning

As a first step, it is necessary to load the packages and set the working directory.

setwd('./Practical_Machine_Learning/')
library(tidyverse) #Loads ggplot2, tidyr and dplyr
library(caret) #Machine learning algorithms
library(doParallel) #Parallel computation
library(knitr) #HTML rendering
library(kableExtra) #Table formatting

Secondly, we download the train a test datasets from the links

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', destfile = 'pml-train.csv')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', destfile = 'pml-test.csv')

And then, load them to R as dataFrames.

train <- read.csv('pml-train.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))
test <- read.csv('pml-test.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))

First, some variables are going to be removed from both the train a test datasets, since they are of no particular value to this analysis (e.g. user_name, X). Of the remaining variables, number of NAs will be calculated using dplyr. Then those columns that have more than 95% of their values missing will be removed. Lastly, the new_window variable will be converted to numeric.

colsRemoved <- c('X', 'user_name', 'cvtd_timestamp', 'problem_id')

colsNA <- train %>%
    summarise_all(funs(sum(is.na(.) / nrow(train))))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
for(col in 1:ncol(colsNA)) {
    if (colsNA[,col] > 0.95) {
        colsRemoved <- c(colsRemoved, colnames(colsNA)[col])
    }
}
new_windowNum <- c('no' = 0,'yes' = 1)
train$new_window <- new_windowNum[train$new_window]
test$new_window <- new_windowNum[test$new_window]

train <- train[, !(colnames(train) %in% colsRemoved)]
test <- test[, !(colnames(test) %in% colsRemoved)]

From the original dataset of 159 variables only 56 are left.

Exploratory Data Analysis

In order to preview the data we are going to build a histogram for each variable using ggplot2 and dplyr.

train %>%
    select(-new_window, -classe) %>% 
    gather() %>% 
    ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(aes())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the plot above it is possible to see several distributions, particularly uniform, bimodal, normal and Poisson. While I won’t explain each variable, particularly because the names are not visible in the plot, the reader can replicate can replicate this process and see for him/herself the distribution of the whole dataset.

Then, I will build a summary table of the mean of each variable for each classe and new_window.

trainSummary <- train %>%
    group_by(classe, new_window) %>%
    summarise_all(funs(mean(.)))
kable(trainSummary) %>%
    kable_styling() %>%
    scroll_box(width = '100%')
classe new_window raw_timestamp_part_1 raw_timestamp_part_2 num_window roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm total_accel_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell pitch_dumbbell yaw_dumbbell total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
A 0 1322814395 489248.3 384.7618 59.19093 0.3512740 -12.187125 10.65856 -0.0048821 0.0404186 -0.1211040 -6.2467556 28.86310 -62.42497 57.80113 602.2471 -337.8660 -0.5819795 3.3677225 -11.744707 27.38311 0.0282051 -0.2213471 0.2635551 -132.58271 47.370499 -75.62274 -20.22318 235.87498 410.6953 21.65729 -18.9735598 1.376628 14.65564 0.1302742 0.0336173 -0.0933723 -50.536099 52.71139 -56.64029 -387.9896 219.3120 10.880387 25.72065 -6.968583 24.764078 32.13288 0.1736483 0.1204917 0.1609395 -0.281667 170.2508 -59.25589 -196.2285 475.0866 407.4942
A 1 1322813012 970452.4 381.4771 67.32807 0.5849541 -4.320367 11.66055 -0.0069725 0.0438532 -0.1349541 -6.9908257 32.93578 -75.67890 59.26606 599.7248 -341.0826 0.2333028 4.1759633 -12.311009 27.14679 -0.0111927 -0.2082569 0.3023853 -110.83486 37.504587 -62.30275 34.73394 222.20183 407.0917 18.06147 -16.1238532 7.525688 14.52294 0.1621101 0.0060550 -0.1821101 -47.752294 48.09174 -44.64220 -361.7064 195.6789 8.091743 19.55899 -1.805413 11.688073 31.60550 0.1333028 -0.0977982 0.0307339 -28.211009 151.8257 -47.07339 -212.3761 421.7890 380.5138
B 0 1322840157 482628.0 502.3303 65.23866 -0.0271302 -13.173844 11.14712 -0.0048897 0.0425390 -0.1349274 -4.9343733 32.00592 -73.72889 49.22808 599.3195 -336.7937 31.9581388 -6.6164228 7.825882 26.53496 0.0210328 -0.2860570 0.2664524 -41.97687 25.150350 -95.90802 235.02663 129.50968 194.2399 35.74026 3.2903223 14.821075 14.36821 0.1661323 0.0173453 -0.1430178 -0.500269 69.02717 -15.66272 -248.0412 265.3722 49.009683 32.29136 14.675519 13.036191 35.36794 0.1436014 0.0978268 0.1788623 -76.353685 137.4658 -46.09736 -325.2636 280.6132 377.8841
B 1 1322847240 978217.9 527.1013 61.60835 0.3565823 -18.806203 10.73418 -0.0017722 0.0421519 -0.1275949 -4.9367089 30.56962 -67.55696 48.83544 602.1772 -332.2278 34.3954430 -3.9145570 11.539114 26.34177 0.1035443 -0.2817722 0.2413924 -58.83544 38.215190 -82.45570 221.26582 161.44304 254.0759 35.64304 -0.9797468 10.772152 15.15190 0.2031646 -0.0550633 -0.1231646 -8.316456 73.55696 -24.20253 -268.3418 294.4810 38.898734 39.48443 13.301139 8.394557 34.24051 0.1169620 0.2348101 0.2218987 -83.708861 108.7468 -60.89873 -323.4177 236.2785 385.8481
C 0 1322828694 488631.7 483.3490 64.73437 -1.0731862 -7.381876 11.17214 -0.0148837 0.0393347 -0.1351790 -4.0229714 31.06921 -70.95794 57.24165 599.8798 -337.4379 25.2065931 -1.6805489 3.976333 24.30758 0.1012142 -0.2637649 0.2742721 -77.61008 41.025060 -53.77655 156.84994 189.67601 362.1483 -13.36173 -24.8806405 -15.910832 12.90036 0.1921629 0.0546778 -0.1517661 -40.201969 30.43556 -52.45495 -370.2079 157.6002 62.901850 59.87649 12.336405 39.188461 34.93974 0.2082399 0.0636456 0.1408681 -47.589499 212.7488 -61.49702 -334.9326 502.6381 460.8798
C 1 1322825623 975019.8 502.6286 68.63857 0.2075714 -7.261143 11.74286 -0.0167143 0.0471429 -0.1312857 -5.2857143 33.42857 -76.90000 55.78571 602.6143 -333.1429 14.2122857 0.5442857 1.983571 24.04286 0.1182857 -0.2682857 0.2840000 -99.05714 40.142857 -45.61429 108.91429 215.35714 389.5857 -23.39000 -29.2871429 -18.864286 13.84286 0.1932857 0.0840000 -0.1875714 -50.700000 27.18571 -59.02857 -392.4857 182.5000 50.371429 54.65943 13.685571 32.746571 33.70000 0.1105714 0.2668571 0.1281429 -63.300000 190.3714 -69.81429 -346.1286 497.6429 493.2714
D 0 1322822391 495403.1 430.5459 60.93022 1.7649349 -17.651605 11.23387 -0.0143565 0.0362186 -0.1375183 -8.2113124 30.46298 -69.38862 48.75755 594.2371 -340.6174 22.7197204 -10.4935621 5.195103 23.42262 0.0326819 -0.2521258 0.2649221 15.85796 24.682873 -48.24627 396.90372 96.64983 297.0232 50.30249 -2.2790961 1.054527 11.33143 0.2024500 0.0137718 -0.1324150 -22.462345 53.10740 -33.60439 -316.6667 218.0918 56.706705 16.17097 28.014801 4.816276 36.08230 0.1242008 -0.0042453 0.1142930 -153.744836 152.4630 -48.34318 -456.0848 318.0601 362.2234
D 1 1322808126 975628.8 471.2319 57.40188 4.2385507 -27.555217 11.01449 -0.0318841 0.0415942 -0.0930435 -11.4057971 30.46377 -62.55072 44.86957 596.4638 -340.7971 18.9311594 -8.9391304 3.164493 22.14493 0.1615942 -0.2814493 0.2788406 12.52174 40.260870 -29.30435 409.89855 114.62319 337.3478 45.51304 -4.7971014 3.681159 12.60870 0.2923188 -0.0018841 -0.1363768 -31.289855 58.52174 -41.17391 -307.6377 204.7826 55.507246 13.20609 30.639565 -2.158261 35.42029 0.0444928 0.0507246 0.0843478 -150.246377 149.3043 -59.57971 -458.5797 309.9275 355.5507
E 0 1322835719 499247.2 374.4104 73.97826 0.5724150 -5.798892 12.64739 0.0091412 0.0378600 -0.1302409 -4.4492630 28.84212 -90.95833 63.28713 568.0967 -378.4212 20.4761593 -12.5401559 -1.865091 24.55754 0.0428741 -0.2813407 0.2808362 -18.78373 16.853742 -76.44898 324.01531 83.58163 217.4036 26.72861 -6.9057886 5.849583 14.42460 0.1340760 0.1163010 -0.1418566 -18.200680 55.85317 -24.17347 -291.6845 239.5663 72.770975 39.29533 16.741984 11.659025 36.64994 0.1332370 0.0625765 0.1540136 -70.821145 145.6947 -58.74291 -328.6772 279.4447 351.6800
E 1 1322849484 960602.5 384.4051 84.85443 -2.6078481 11.503038 13.86076 0.0284810 0.0483544 -0.1444304 0.9493671 32.63291 -106.05063 73.79747 564.2911 -379.9620 12.0554430 -12.5224051 -0.088481 25.05063 -0.1645570 -0.2098734 0.2550633 -16.34177 -5.075949 -92.79747 299.79747 79.15190 180.0253 29.73797 -10.0126582 -12.779747 14.79747 0.1102532 0.2065823 -0.1318987 -24.379747 57.58228 -40.91139 -312.2278 227.4684 57.860760 34.79506 14.283544 15.306962 37.51899 0.2767089 0.0168354 0.1432911 -83.670886 164.5823 -62.02532 -358.6329 298.8101 412.4304

Prediction Model

As evidenced by the summary above, the variable to be predicted is Class, which classifies participants according to their correctness performing a dumbbell biceps curl 10 times. This variable has five values in regard to their performance:

However, there are still many variables that could possibly be related. Think of all those variables that measure the same metric but in different axis. So in order to solve this, the best method is to do a Principal Component Analysis to reduce the number of variables. For this purpose I will use the prcomp() function. Then, I will transpose and transform the cumulative Proportion summary (Importance) in a dataFrame that can be plotted with ggplot2.

trainPCA <- prcomp(train[, -57], scale. = T, center = T)
PCAsummary <- summary(trainPCA)$importance
PCAsummary <- data.frame(t(PCAsummary))
PCAsummary$PCA <- seq(1, nrow(PCAsummary))
ggplot(PCAsummary, aes(x = PCA, y = Cumulative.Proportion)) + geom_point() + theme_bw()

From the plot above it is possible to determine that around 40 PCs are necessary to explain 100% of the variance in the dataset, nonetheless, this would be close to the number of original number of variables in the clean training dataset. Therefore I will use the preProcess() function from the caret package to process the data. This will be done in order to keep the PCs that explain up to 95% of the variance, which if you check the summary above is 27 PCs. Then I will predict the new variables that will be used to train the final model.

preProc <- preProcess(train[,-57], method = 'pca', thresh = 0.95)
trainPC <- predict(preProc, train[,-57])
trainPC$classe <- train$classe

After doing the PCA on the dataset and adding the outcome variable, I will build a Random Forest prediction model using the PCs from the previous step. However, in order to reduce time, I will train the model in parallel using the functions detectCores(), makePSOCKcluster(), registerDoParallel and stopCluster() from the doParallel package. Note: This step might take some time, depending on your PC.

ncores <- detectCores() - 1
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
start_time <- Sys.time()
modelRF <- train(classe ~ ., method = 'rf', data = trainPC, list = F)
end_time <- Sys.time()
stopCluster(cl)
end_time - start_time
## Time difference of 13.71151 mins

Since and out-of-sample error cannot be calculated, I will calculate the in-of-sample error on the training dataset using the confusionMatrix() function.

trainPred <- predict(modelRF, trainPC[,-28])
confusionMatrix(table(trainPred, train$classe))
## Confusion Matrix and Statistics
## 
##          
## trainPred    A    B    C    D    E
##         A 5580    0    0    0    0
##         B    0 3797    0    0    0
##         C    0    0 3422    0    0
##         D    0    0    0 3216    0
##         E    0    0    0    0 3607
## 
## Overall Statistics
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9998, 1)          
##     No Information Rate : 0.2844               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

From these results, the model is highly accurate, sensitive and specificity, although these metrics are based on the predictions of the training dataset.

Finally, I predict the PCs on the test set and use that outcome to predict the class based on the Random Forest model.

testPC <- predict(preProc, test)
classPrediction <- predict(modelRF, testPC)
kable(data.frame(Case = seq(1:20), classPrediction), align = 'c') %>%
  kable_styling(full_width = F, fixed_thead = T) %>%
    scroll_box(height = '20%')
Case classPrediction
1 B
2 A
3 B
4 A
5 A
6 E
7 D
8 B
9 A
10 A
11 B
12 C
13 B
14 A
15 E
16 E
17 A
18 B
19 B
20 B

From the quiz in Coursera, this is the correct outcome for all the cases.