Data Cleaning

As a first step, it is necessary to load the packages and set the working directory.

setwd('./Practical_Machine_Learning/')

library(tidyverse) #Loads ggplot2, tidyr and dplyr
library(caret) #Machine learning algorithms
library(doParallel) #Parallel computation
library(knitr) #HTML rendering
library(kableExtra) #Table formatting

Secondly, we download the train a test datasets from the links

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', destfile = 'pml-train.csv')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', destfile = 'pml-test.csv')

And then, load them to R as dataFrames.

train <- read.csv('pml-train.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))
test <- read.csv('pml-test.csv', na.strings=c('#DIV/0', '#DIV/0!', '', 'NA'))

First, some variables are going to be removed from both the train a test datasets, since they are of no particular value to this analysis (e.g. user_name, X). Of the remaining variables, number of NAs will be calculated using dplyr. Then those columns that have more than 95% of their values missing will be removed. Lastly, the new_window variable will be converted to numeric.

colsRemoved <- c('X', 'user_name', 'cvtd_timestamp', 'problem_id')

colsNA <- train %>%
    summarise_all(funs(sum(is.na(.) / nrow(train))))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

for(col in 1:ncol(colsNA)) {
    if (colsNA[,col] > 0.95) {
        colsRemoved <- c(colsRemoved, colnames(colsNA)[col])
    }
}
new_windowNum <- c('no' = 0,'yes' = 1)
train$new_window <- new_windowNum[train$new_window]
test$new_window <- new_windowNum[test$new_window]

train <- train[, !(colnames(train) %in% colsRemoved)]
test <- test[, !(colnames(test) %in% colsRemoved)]

From the original dataset of 159 variables only 56 are left.

Exploratory Data Analysis

In order to preview the data we are going to build a histogram for each variable using ggplot2 and dplyr.

train %>%
    select(-new_window, -classe) %>% 
    gather() %>% 
    ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(aes())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the plot above it is possible to see several distributions, particularly uniform, bimodal, normal and Poisson. While I won’t explain each variable, particularly because the names are not visible in the plot, the reader can replicate can replicate this process and see for him/herself the distribution of the whole dataset.

Then, I will build a summary table of the mean of each variable for each classe and new_window.

trainSummary <- train %>%
    group_by(classe, new_window) %>%
    summarise_all(funs(mean(.)))
kable(trainSummary) %>%
    kable_styling() %>%
    scroll_box(width = '100%')

classe	new_window	raw_timestamp_part_1	raw_timestamp_part_2	num_window	roll_belt	pitch_belt	yaw_belt	total_accel_belt	gyros_belt_x	gyros_belt_y	gyros_belt_z	accel_belt_x	accel_belt_y	accel_belt_z	magnet_belt_x	magnet_belt_y	magnet_belt_z	roll_arm	pitch_arm	yaw_arm	total_accel_arm	gyros_arm_x	gyros_arm_y	gyros_arm_z	accel_arm_x	accel_arm_y	accel_arm_z	magnet_arm_x	magnet_arm_y	magnet_arm_z	roll_dumbbell	pitch_dumbbell	yaw_dumbbell	total_accel_dumbbell	gyros_dumbbell_x	gyros_dumbbell_y	gyros_dumbbell_z	accel_dumbbell_x	accel_dumbbell_y	accel_dumbbell_z	magnet_dumbbell_x	magnet_dumbbell_y	magnet_dumbbell_z	roll_forearm	pitch_forearm	yaw_forearm	total_accel_forearm	gyros_forearm_x	gyros_forearm_y	gyros_forearm_z	accel_forearm_x	accel_forearm_y	accel_forearm_z	magnet_forearm_x	magnet_forearm_y	magnet_forearm_z
A	0	1322814395	489248.3	384.7618	59.19093	0.3512740	-12.187125	10.65856	-0.0048821	0.0404186	-0.1211040	-6.2467556	28.86310	-62.42497	57.80113	602.2471	-337.8660	-0.5819795	3.3677225	-11.744707	27.38311	0.0282051	-0.2213471	0.2635551	-132.58271	47.370499	-75.62274	-20.22318	235.87498	410.6953	21.65729	-18.9735598	1.376628	14.65564	0.1302742	0.0336173	-0.0933723	-50.536099	52.71139	-56.64029	-387.9896	219.3120	10.880387	25.72065	-6.968583	24.764078	32.13288	0.1736483	0.1204917	0.1609395	-0.281667	170.2508	-59.25589	-196.2285	475.0866	407.4942
A	1	1322813012	970452.4	381.4771	67.32807	0.5849541	-4.320367	11.66055	-0.0069725	0.0438532	-0.1349541	-6.9908257	32.93578	-75.67890	59.26606	599.7248	-341.0826	0.2333028	4.1759633	-12.311009	27.14679	-0.0111927	-0.2082569	0.3023853	-110.83486	37.504587	-62.30275	34.73394	222.20183	407.0917	18.06147	-16.1238532	7.525688	14.52294	0.1621101	0.0060550	-0.1821101	-47.752294	48.09174	-44.64220	-361.7064	195.6789	8.091743	19.55899	-1.805413	11.688073	31.60550	0.1333028	-0.0977982	0.0307339	-28.211009	151.8257	-47.07339	-212.3761	421.7890	380.5138
B	0	1322840157	482628.0	502.3303	65.23866	-0.0271302	-13.173844	11.14712	-0.0048897	0.0425390	-0.1349274	-4.9343733	32.00592	-73.72889	49.22808	599.3195	-336.7937	31.9581388	-6.6164228	7.825882	26.53496	0.0210328	-0.2860570	0.2664524	-41.97687	25.150350	-95.90802	235.02663	129.50968	194.2399	35.74026	3.2903223	14.821075	14.36821	0.1661323	0.0173453	-0.1430178	-0.500269	69.02717	-15.66272	-248.0412	265.3722	49.009683	32.29136	14.675519	13.036191	35.36794	0.1436014	0.0978268	0.1788623	-76.353685	137.4658	-46.09736	-325.2636	280.6132	377.8841
B	1	1322847240	978217.9	527.1013	61.60835	0.3565823	-18.806203	10.73418	-0.0017722	0.0421519	-0.1275949	-4.9367089	30.56962	-67.55696	48.83544	602.1772	-332.2278	34.3954430	-3.9145570	11.539114	26.34177	0.1035443	-0.2817722	0.2413924	-58.83544	38.215190	-82.45570	221.26582	161.44304	254.0759	35.64304	-0.9797468	10.772152	15.15190	0.2031646	-0.0550633	-0.1231646	-8.316456	73.55696	-24.20253	-268.3418	294.4810	38.898734	39.48443	13.301139	8.394557	34.24051	0.1169620	0.2348101	0.2218987	-83.708861	108.7468	-60.89873	-323.4177	236.2785	385.8481
C	0	1322828694	488631.7	483.3490	64.73437	-1.0731862	-7.381876	11.17214	-0.0148837	0.0393347	-0.1351790	-4.0229714	31.06921	-70.95794	57.24165	599.8798	-337.4379	25.2065931	-1.6805489	3.976333	24.30758	0.1012142	-0.2637649	0.2742721	-77.61008	41.025060	-53.77655	156.84994	189.67601	362.1483	-13.36173	-24.8806405	-15.910832	12.90036	0.1921629	0.0546778	-0.1517661	-40.201969	30.43556	-52.45495	-370.2079	157.6002	62.901850	59.87649	12.336405	39.188461	34.93974	0.2082399	0.0636456	0.1408681	-47.589499	212.7488	-61.49702	-334.9326	502.6381	460.8798
C	1	1322825623	975019.8	502.6286	68.63857	0.2075714	-7.261143	11.74286	-0.0167143	0.0471429	-0.1312857	-5.2857143	33.42857	-76.90000	55.78571	602.6143	-333.1429	14.2122857	0.5442857	1.983571	24.04286	0.1182857	-0.2682857	0.2840000	-99.05714	40.142857	-45.61429	108.91429	215.35714	389.5857	-23.39000	-29.2871429	-18.864286	13.84286	0.1932857	0.0840000	-0.1875714	-50.700000	27.18571	-59.02857	-392.4857	182.5000	50.371429	54.65943	13.685571	32.746571	33.70000	0.1105714	0.2668571	0.1281429	-63.300000	190.3714	-69.81429	-346.1286	497.6429	493.2714
D	0	1322822391	495403.1	430.5459	60.93022	1.7649349	-17.651605	11.23387	-0.0143565	0.0362186	-0.1375183	-8.2113124	30.46298	-69.38862	48.75755	594.2371	-340.6174	22.7197204	-10.4935621	5.195103	23.42262	0.0326819	-0.2521258	0.2649221	15.85796	24.682873	-48.24627	396.90372	96.64983	297.0232	50.30249	-2.2790961	1.054527	11.33143	0.2024500	0.0137718	-0.1324150	-22.462345	53.10740	-33.60439	-316.6667	218.0918	56.706705	16.17097	28.014801	4.816276	36.08230	0.1242008	-0.0042453	0.1142930	-153.744836	152.4630	-48.34318	-456.0848	318.0601	362.2234
D	1	1322808126	975628.8	471.2319	57.40188	4.2385507	-27.555217	11.01449	-0.0318841	0.0415942	-0.0930435	-11.4057971	30.46377	-62.55072	44.86957	596.4638	-340.7971	18.9311594	-8.9391304	3.164493	22.14493	0.1615942	-0.2814493	0.2788406	12.52174	40.260870	-29.30435	409.89855	114.62319	337.3478	45.51304	-4.7971014	3.681159	12.60870	0.2923188	-0.0018841	-0.1363768	-31.289855	58.52174	-41.17391	-307.6377	204.7826	55.507246	13.20609	30.639565	-2.158261	35.42029	0.0444928	0.0507246	0.0843478	-150.246377	149.3043	-59.57971	-458.5797	309.9275	355.5507
E	0	1322835719	499247.2	374.4104	73.97826	0.5724150	-5.798892	12.64739	0.0091412	0.0378600	-0.1302409	-4.4492630	28.84212	-90.95833	63.28713	568.0967	-378.4212	20.4761593	-12.5401559	-1.865091	24.55754	0.0428741	-0.2813407	0.2808362	-18.78373	16.853742	-76.44898	324.01531	83.58163	217.4036	26.72861	-6.9057886	5.849583	14.42460	0.1340760	0.1163010	-0.1418566	-18.200680	55.85317	-24.17347	-291.6845	239.5663	72.770975	39.29533	16.741984	11.659025	36.64994	0.1332370	0.0625765	0.1540136	-70.821145	145.6947	-58.74291	-328.6772	279.4447	351.6800
E	1	1322849484	960602.5	384.4051	84.85443	-2.6078481	11.503038	13.86076	0.0284810	0.0483544	-0.1444304	0.9493671	32.63291	-106.05063	73.79747	564.2911	-379.9620	12.0554430	-12.5224051	-0.088481	25.05063	-0.1645570	-0.2098734	0.2550633	-16.34177	-5.075949	-92.79747	299.79747	79.15190	180.0253	29.73797	-10.0126582	-12.779747	14.79747	0.1102532	0.2065823	-0.1318987	-24.379747	57.58228	-40.91139	-312.2278	227.4684	57.860760	34.79506	14.283544	15.306962	37.51899	0.2767089	0.0168354	0.1432911	-83.670886	164.5823	-62.02532	-358.6329	298.8101	412.4304

Prediction Model

As evidenced by the summary above, the variable to be predicted is Class, which classifies participants according to their correctness performing a dumbbell biceps curl 10 times. This variable has five values in regard to their performance:

A: Exactly according to the specification
B: Throwing Elbows to the front
C: Lifting the dumbbell only halfway
D: Lowering the dumbbell only halfway
E: Throwing the hips to the front

However, there are still many variables that could possibly be related. Think of all those variables that measure the same metric but in different axis. So in order to solve this, the best method is to do a Principal Component Analysis to reduce the number of variables. For this purpose I will use the prcomp() function. Then, I will transpose and transform the cumulative Proportion summary (Importance) in a dataFrame that can be plotted with ggplot2.

trainPCA <- prcomp(train[, -57], scale. = T, center = T)
PCAsummary <- summary(trainPCA)$importance
PCAsummary <- data.frame(t(PCAsummary))
PCAsummary$PCA <- seq(1, nrow(PCAsummary))
ggplot(PCAsummary, aes(x = PCA, y = Cumulative.Proportion)) + geom_point() + theme_bw()

From the plot above it is possible to determine that around 40 PCs are necessary to explain 100% of the variance in the dataset, nonetheless, this would be close to the number of original number of variables in the clean training dataset. Therefore I will use the preProcess() function from the caret package to process the data. This will be done in order to keep the PCs that explain up to 95% of the variance, which if you check the summary above is 27 PCs. Then I will predict the new variables that will be used to train the final model.

preProc <- preProcess(train[,-57], method = 'pca', thresh = 0.95)
trainPC <- predict(preProc, train[,-57])
trainPC$classe <- train$classe

After doing the PCA on the dataset and adding the outcome variable, I will build a Random Forest prediction model using the PCs from the previous step. However, in order to reduce time, I will train the model in parallel using the functions detectCores(), makePSOCKcluster(), registerDoParallel and stopCluster() from the doParallel package. Note: This step might take some time, depending on your PC.

ncores <- detectCores() - 1
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
start_time <- Sys.time()
modelRF <- train(classe ~ ., method = 'rf', data = trainPC, list = F)
end_time <- Sys.time()
stopCluster(cl)
end_time - start_time

## Time difference of 13.71151 mins

Since and out-of-sample error cannot be calculated, I will calculate the in-of-sample error on the training dataset using the confusionMatrix() function.

trainPred <- predict(modelRF, trainPC[,-28])
confusionMatrix(table(trainPred, train$classe))

## Confusion Matrix and Statistics
## 
##          
## trainPred    A    B    C    D    E
##         A 5580    0    0    0    0
##         B    0 3797    0    0    0
##         C    0    0 3422    0    0
##         D    0    0    0 3216    0
##         E    0    0    0    0 3607
## 
## Overall Statistics
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9998, 1)          
##     No Information Rate : 0.2844               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

From these results, the model is highly accurate, sensitive and specificity, although these metrics are based on the predictions of the training dataset.

Finally, I predict the PCs on the test set and use that outcome to predict the class based on the Random Forest model.

testPC <- predict(preProc, test)
classPrediction <- predict(modelRF, testPC)
kable(data.frame(Case = seq(1:20), classPrediction), align = 'c') %>%
  kable_styling(full_width = F, fixed_thead = T) %>%
    scroll_box(height = '20%')

Case	classPrediction
1	B
2	A
3	B
4	A
5	A
6	E
7	D
8	B
9	A
10	A
11	B
12	C
13	B
14	A
15	E
16	E
17	A
18	B
19	B
20	B

From the quiz in Coursera, this is the correct outcome for all the cases.

Weightlifting Prediction Analysis Using PCA and Random Forests

Andres Camilo Zuñiga Gonzalez

28/5/2020

Data Cleaning

Exploratory Data Analysis

Prediction Model