Predicting Errors in Execution of Physical Exercises

Synopsis
Introduction
Downloading the Data
Reading and Processing the data
Setting the variables to their correct class
Creating a Train set and a Validation set
Model Creation
Assessing Model Accuracy
Reducing the Number of Features
Identifying Variables with High Correlation
names(train_set)[highcor]

Synopsis

The predominant approach to preventing injuries currently is to provide athletes with a professional trainer who provides real time feedback while observing the execution of certain exercises. The objective of this work is to determine whether it will be possible to classify errors during the execution of movement based on data obtained from motion traces recorded using on-body sensors. We used regression as our tool to create predictive models on the HAR weight lifting exercises dataset. We classified errors and correct execution of lifting barbells with high accuracy, sensitivity and specificity.

Introduction

Six male participants aged between 20-28 years, were asked to perform one set of 10 repetitions of Unilateral Dumbbell Biceps Curl using a 1.25 dumbbell in different fashions: exactly according to the specified execution of the exercise (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Mounted sensors in the users’ glove, armband, lumbar belt and dumbbell collected data on the Euler angles (roll, pitch and yaw), as well as the raw accelerometer, gyroscope and magnetometer readings. More information is available from the website http://groupware.les.inf.puc-rio.br/har.

Downloading the Data

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

training_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testing_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(training_url, destfile = "training.csv")
download.file(testing_url, destfile = "testing.csv")
dateDownloaded <- date()
dateDownloaded

## [1] "Tue Apr 05 07:09:58 2016"

Reading and Processing the data

library(caret)
library(ggplot2)
training <- read.csv("training.csv", header = TRUE, na.strings = c("NA", "#DIV/0!", ""), stringsAsFactors = FALSE)
testing <- read.csv("testing.csv", header = TRUE, na.strings = c("NA", "#DIV/0!", ""), stringsAsFactors = FALSE)
sumVar_index <- grepl("^min|^max|^kurtosis|^skewness|^avg|^var|^stddev|^amplitude", names(training))###identifying summary variables
sumVar <- names(training)[!sumVar_index]
my_df <- training[, sumVar]### removing summary variables
all_na_index <- sapply(my_df, function(x)sum(is.na(x)))
my_df2 <- my_df[, -c(1:7)]### removing housekeeping variables

The features of the data may be classified into measurement, summary, and housekeeping variables. The summary variables (beginning with: min, max, kurtosis, skewness, avg, stddev, and amplitude) apply summary statisitics on the measurement variables (beginning with: roll, pitch, yaw, total, gyros, magnet, and accel). It would have been preferred to use the summary variables for our model as they immensely reduce the number of observations and processing time and yet contain the gist of the measurement variables. However, it was impossible to make predictions based on the summary variables on the testing dataset as this contain only missing values. We will be removing the housekeeping variables that contain the row numbers (x), timestamps (raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), and measurement intervals (new_window and num_window).

The downloaded training dataset contains 160 variables and 19622 rows while the testing dataset contains 160 variables and 160 rows.

Setting the variables to their correct class

We need to set the variables into its correct class to avoid errors.

my_df2$total_accel_belt <- as.numeric(my_df2$total_accel_belt)
my_df2$accel_belt_x <- as.numeric(my_df2$ accel_belt_x)
my_df2$accel_belt_y <- as.numeric(my_df2$accel_belt_y)
my_df2$accel_belt_z <- as.numeric(my_df2$accel_belt_z)
my_df2$magnet_belt_x <- as.numeric(my_df2$magnet_belt_x)
my_df2$magnet_belt_y <- as.numeric(my_df2$magnet_belt_y)
my_df2$magnet_belt_z <- as.numeric(my_df2$magnet_belt_z)
my_df2$total_accel_arm <- as.numeric(my_df2$total_accel_arm)
my_df2$accel_arm_x <- as.numeric(my_df2$accel_arm_x)
my_df2$accel_arm_y <- as.numeric(my_df2$accel_arm_y)
my_df2$accel_arm_z <- as.numeric(my_df2$accel_arm_z)
my_df2$magnet_arm_x <- as.numeric(my_df2$magnet_arm_x)
my_df2$magnet_arm_y <- as.numeric(my_df2$magnet_arm_y)
my_df2$magnet_arm_z <- as.numeric(my_df2$magnet_arm_z)
my_df2$total_accel_dumbbell <- as.numeric(my_df2$total_accel_dumbbell)
my_df2$total_accel_dumbbell <- as.numeric(my_df2$total_accel_dumbbell)
my_df2$accel_dumbbell_x <- as.numeric(my_df2$ accel_dumbbell_x)
my_df2$accel_dumbbell_y <- as.numeric(my_df2$ accel_dumbbell_y)
my_df2$accel_dumbbell_z <- as.numeric(my_df2$ accel_dumbbell_z)
my_df2$magnet_dumbbell_x <- as.numeric(my_df2$ magnet_dumbbell_x)
my_df2$magnet_dumbbell_y <- as.numeric(my_df2$ magnet_dumbbell_y)
my_df2$total_accel_forearm <- as.numeric(my_df2$total_accel_forearm)
my_df2$accel_forearm_x <- as.numeric(my_df2$accel_forearm_x)
my_df2$accel_forearm_y <- as.numeric(my_df2$accel_forearm_y)
my_df2$accel_forearm_z <- as.numeric(my_df2$accel_forearm_z)
my_df2$magnet_forearm_x <- as.numeric(my_df2$magnet_forearm_x)
my_df2$classe <- as.factor(my_df2$classe)
###Checking for variables that contain only zeroes
all_zero_index <- sapply(my_df2[,-53], sum)
all_zero_vars <- which(all_zero_index == 0)

Creating a Train set and a Validation set

We partition the data into a train set and two test sets with 60, 30, and 10 percent composition. A testing set was downloaded earlier as a final validation of the model’s accuracy.

library(caret)
set.seed(107)
intrain <- createDataPartition(y = my_df2$classe, p = 0.6, list = FALSE)
train_set <- my_df2[intrain, ]
validation_set <- my_df2[-intrain, ]
intrain2 <- createDataPartition(y = validation_set$classe, p = 0.75, list = FALSE)
test_set1 <- validation_set[intrain2, ]
test_set2 <- validation_set[-intrain2, ]

The final training dataset contains 53 variables and 11776 rows. The first test set has 53 variables and 5886 rows. The second test set has 53 variables and 1960 rows.

Model Creation

We generate a random forest model on the training dataset using the caret and rf package. The variable classe will be our dependent variable. It contains the classification of whether the movement was performed correctly or not and what error was commited as discussed earlier.We included a 5-fold cross validation to improve our model repeated twice.

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
rfor_fitall = train(classe ~ ., data=train_set, method="rf", trControl=ctrl)

Assessing Model Accuracy

We examine the model for its accuracy and we find it to be very accurate.

library(knitr)
print(kable(rfor_fitall$results))

mtry	Accuracy	Kappa	AccuracySD	KappaSD
2	0.9890880	0.9861952	0.0016867	0.0021348
27	0.9891732	0.9863023	0.0034669	0.0043881
52	0.9816579	0.9767923	0.0048224	0.0061066

pred_Vset <- predict(rfor_fitall, newdata = test_set1)
out_of_SampleErr <- table(pred_Vset, test_set1$classe)
Model_accuracy <- confusionMatrix(pred_Vset, test_set1$classe)

The table below shows which predictions on the training dataset were correct and which were not. The non-diagonal elements are the errors. it misclassified only 91 out of a possible 11,776 entries for an overall misclassification rate of 0.77%. The misclassification rate for individual classes can be seen at the last column. The overall accuracy is 99.23%.

in_SamplErr <- rfor_fitall$finalModel$confusion
print(kable(in_SamplErr))

	A	B	C	D	E	class.error
A	3344	2	2	0	0	0.0011947
B	16	2255	8	0	0	0.0105309
C	0	16	2031	7	0	0.0111977
D	0	1	24	1903	2	0.0139896
E	0	3	3	7	2152	0.0060046

The prediction on the the first test dataset was 99.41% accurate. it misclassified only 35 out of a possible 5,886 entries for an overall misclassification rate of 0.59%.

print(kable(out_of_SampleErr))

	A	B	C	D	E
A	1673	5	0	0	0
B	1	1130	7	0	0
C	0	4	1017	9	3
D	0	0	2	955	3
E	0	0	0	1	1076

The confusion matrix summarize the accuracy, sensitivity, specificity, and other parameters of our model’s prediction by class on the first test set.

print(kable(Model_accuracy$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9994026	0.9988129	0.9970203	0.9997624	0.2844037	0.2842338	0.2850832	0.9991078
Class: B	0.9920983	0.9983147	0.9929701	0.9981045	0.1935100	0.1919810	0.1933401	0.9952065
Class: C	0.9912281	0.9967078	0.9845111	0.9981455	0.1743119	0.1727829	0.1755012	0.9939679
Class: D	0.9896373	0.9989839	0.9947917	0.9979700	0.1639484	0.1622494	0.1630989	0.9943106
Class: E	0.9944547	0.9997918	0.9990715	0.9987523	0.1838260	0.1828067	0.1829766	0.9971233

The plot below shows the relationship between the number of randomly selected predictors and the accuracy. Accuracy is highest when mtry, the number of variables available for splitting at each tree node is 27. mtry is the tuning parameter for the package rf in caret.

plot(rfor_fitall)

Reducing the Number of Features

We now check which features are important for our model to reduce the number of features in our model to improve the processing time of our model and improve scalability and interpretability.

varImpPlot(rfor_fitall$finalModel, n.var = 27)

We compare the more important features to those which are highly correlated and decide which features to keep.

Identifying Variables with High Correlation

cor_mat <- cor(train_set[,-53])
Cor_Sum <- summary(cor_mat[upper.tri(cor_mat)])
highcor <- findCorrelation(cor_mat, cutoff = .75)
highcor_Vars <-  as.data.frame(names(train_set)[highcor])
print(kable(highcor_Vars))

names(train_set)[highcor]

accel_belt_z
roll_belt
accel_belt_y
accel_arm_y
total_accel_belt
accel_dumbbell_z
accel_belt_x
pitch_belt
magnet_dumbbell_x
accel_dumbbell_y
magnet_dumbbell_y
accel_dumbbell_x
accel_arm_x
accel_arm_z
magnet_arm_y
magnet_belt_z
accel_forearm_y
gyros_arm_x

Can we do just as well with 20 features?

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
rfor_fit20 = train(classe ~ yaw_belt + pitch_forearm + magnet_dumbbell_z + pitch_belt + magnet_belt_y + gyros_belt_z + magnet_belt_x + gyros_arm_y + gyros_dumbbell_y + yaw_arm + accel_belt_z + accel_dumbbell_z + accel_dumbbell_y + gyros_forearm_y + accel_forearm_x + gyros_belt_x + magnet_arm_z + gyros_dumbbell_z + magnet_belt_z + magnet_dumbbell_y, data=train_set, method="rf", trControl=ctrl)

pred_Vset20 <- predict(rfor_fit20, newdata = test_set1)
Model_accuracy20 <- confusionMatrix(pred_Vset20, test_set1$classe)
print(kable(Model_accuracy20$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9970131	0.9985755	0.9964179	0.9988126	0.2844037	0.2835542	0.2845736	0.9977943
Class: B	0.9903424	0.9972614	0.9886065	0.9976818	0.1935100	0.1916412	0.1938498	0.9938019
Class: C	0.9863548	0.9948560	0.9758920	0.9971128	0.1743119	0.1719334	0.1761808	0.9906054
Class: D	0.9813472	0.9987807	0.9937041	0.9963511	0.1639484	0.1608902	0.1619096	0.9900639
Class: E	0.9972274	0.9997918	0.9990741	0.9993758	0.1838260	0.1833163	0.1834862	0.9985096

we achieved the same accuracy,sensitivity, and specificity with fewer features. We probably can reduce it some more.

Let’s try a model with 8 features

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
rfor_fit8 = train(classe ~ yaw_belt + pitch_forearm + magnet_dumbbell_z + pitch_belt + magnet_dumbbell_y + gyros_belt_z + magnet_belt_x + yaw_arm, data=train_set, method="rf", trControl=ctrl)

pred_testset8 <- predict(rfor_fit8, newdata = test_set1)
Model_accuracy8 <- confusionMatrix(pred_testset8, test_set1$classe)
print(kable(Model_accuracy8$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9940263	0.9964387	0.9910661	0.9976230	0.2844037	0.2827047	0.2852531	0.9952325
Class: B	0.9640035	0.9957868	0.9821109	0.9914010	0.1935100	0.1865443	0.1899422	0.9798952
Class: C	0.9853801	0.9921811	0.9637750	0.9968989	0.1743119	0.1717635	0.1782195	0.9887806
Class: D	0.9865285	0.9979679	0.9896050	0.9973599	0.1639484	0.1617397	0.1634387	0.9922482
Class: E	0.9907579	0.9987510	0.9944341	0.9979201	0.1838260	0.1821271	0.1831464	0.9947544

The results are still impressive. If you recall the plot above, 2 randomly selected predictors was able to achieve a slightly lower accuracy compared to one with 27.

Let’s try a model with 4 features.

rfor_fit4 = train(classe ~ yaw_belt + pitch_forearm + magnet_dumbbell_z + yaw_arm, data=train_set, method="rf", trControl=ctrl)

pred_testset4 <- predict(rfor_fit4, newdata = test_set1)
Model_accuracy4 <- confusionMatrix(pred_testset4, test_set1$classe)
print(kable(Model_accuracy4$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9689367	0.9883666	0.9706762	0.9876631	0.2844037	0.2755691	0.2838940	0.9786516
Class: B	0.8858648	0.9772488	0.9033124	0.9727406	0.1935100	0.1714237	0.1897723	0.9315568
Class: C	0.8927875	0.9777778	0.8945312	0.9773756	0.1743119	0.1556235	0.1739721	0.9352827
Class: D	0.9316062	0.9784597	0.8945274	0.9864782	0.1639484	0.1527353	0.1707441	0.9550329
Class: E	0.9574861	0.9931307	0.9691300	0.9904505	0.1838260	0.1760109	0.1816174	0.9753084

Specificity and sensitivity dipped a bit but a model with 4 features has better interpretability, scalability and faster processing.

Predicting on the second Test Set

we now compare the model with all the variables and the one with 4 only on the second test set.

pred_testsetall_2 <- predict(rfor_fitall, newdata = test_set2)
Model_accuracyall_2 <- confusionMatrix(pred_testsetall_2, test_set2$classe)
pred_testset4_2 <- predict(rfor_fit4, newdata = test_set2)
Model_accuracy4_2 <- confusionMatrix(pred_testset4_2, test_set2$classe)
print(kable(Model_accuracyall_2$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9964158	0.9985735	0.9964158	0.9985735	0.2846939	0.2836735	0.2846939	0.9974946
Class: B	0.9947230	0.9981025	0.9921053	0.9987342	0.1933673	0.1923469	0.1938776	0.9964127
Class: C	0.9853801	0.9950556	0.9768116	0.9969040	0.1744898	0.1719388	0.1760204	0.9902179
Class: D	0.9781931	0.9981696	0.9905363	0.9957395	0.1637755	0.1602041	0.1617347	0.9881814
Class: E	0.9972222	0.9993750	0.9972222	0.9993750	0.1836735	0.1831633	0.1836735	0.9982986

print(kable(Model_accuracy4_2$byClass))

	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy
Class: A	0.9784946	0.9843081	0.9612676	0.9913793	0.2846939	0.2785714	0.2897959	0.9814014
Class: B	0.8786280	0.9784946	0.9073569	0.9711237	0.1933673	0.1698980	0.1872449	0.9285613
Class: C	0.8801170	0.9796044	0.9011976	0.9747847	0.1744898	0.1535714	0.1704082	0.9298607
Class: D	0.9314642	0.9823063	0.9115854	0.9865196	0.1637755	0.1525510	0.1673469	0.9568852
Class: E	0.9583333	0.9887500	0.9504132	0.9906074	0.1836735	0.1760204	0.1852041	0.9735417

Sensitivity suffered a bit, particularly in predicting class B and C errors, but Specificity is still up there. Reducing the number of predictors from 52 to 4 increased bias, which reduced our capacity to predict accurately.

Model_perf <- data.frame(Model_52Pred= c(getTrainPerf(rfor_fitall)[,1], Model_accuracy$overall[1], Model_accuracyall_2$overall[1]), Model_4Pred= c(getTrainPerf(rfor_fit4)[,1], Model_accuracy4$overall[1], Model_accuracy4_2$overall[1]))
rownames(Model_perf) <- c("training_dtset", "testing_set1", "testing_set2")
Model_perf$Model_52Pred <- round(Model_perf$Model_52Pred*100, 2) 
Model_perf$Model_4Pred <- round(Model_perf$Model_4Pred*100, 2)
names(Model_perf) <- c("Model_52Accuracy(%)", "Model_4Accuracy(%)")
print(kable(Model_perf))

	Model_52Accuracy(%)	Model_4Accuracy(%)
training_dtset	98.92	92.27
testing_set1	99.41	93.14
testing_set2	99.13	93.06

Based on our model’s performance on the second test set, the estimated prediction error for our model with 52 predictors is 0.87%. For the model with 4 predictors, 6.94%.

Conclusions

Predictive models on the HAR weight lifting exercises dataset classified errors and correct execution of lifting barbells with high accuracy, sensitivity and specificity. It has to be pointed out however, that the errors in movement were performed purposefully. Different results may be obtained when the errors are committed without intent to commit the error.

Misclassified points by the model with all variables on the 2nd test set

correctall <- pred_testsetall_2 == test_set2$classe
qplot(pitch_forearm,magnet_dumbbell_z,colour=correctall,data=test_set2)

Misclassified points by the model with 20 variables on the 2nd test set

pred_testset20_2 <- predict(rfor_fit20, newdata = test_set2)
Model_accuracy20_2 <- confusionMatrix(pred_testset20_2, test_set2$classe)
correct20 <- pred_testset20_2 == test_set2$classe
qplot(pitch_forearm,magnet_dumbbell_z,colour=correct20,data=test_set2)

Misclassified points by the model with 8 variables on the 2nd test set

pred_testset8_2 <- predict(rfor_fit8, newdata = test_set2)
Model_accuracy8_2 <- confusionMatrix(pred_testset8_2, test_set2$classe)
correct8 <- pred_testset8_2 == test_set2$classe
qplot(pitch_forearm,magnet_dumbbell_z,colour=correct8,data=test_set2)

Misclassified points by the model with 4 variables on the 2nd test set

correct4 <- pred_testset4_2 == test_set2$classe
qplot(pitch_forearm,magnet_dumbbell_z,colour=correct4,data=test_set2)

We now use our different models to predict on the downloaded test set. The results are the same for all models.

testing_proc <- testing[ , which(names(testing) %in% names(train_set))]
pred_Testset <- predict(rfor_fitall, newdata = testing)
print(pred_Testset)

[1] B A B A A E D B A A B C B A E E A B B B Levels: A B C D E

pred_Testset20 <- predict(rfor_fit20, newdata = testing)
print(pred_Testset20)

[1] B A B A A E D B A A B C B A E E A B B B Levels: A B C D E

pred_Testset8 <- predict(rfor_fit8, newdata = testing)
print(pred_Testset8)

[1] B A B A A E D B A A B C B A E E A B B B Levels: A B C D E

pred_Testset4 <- predict(rfor_fit4, newdata = testing)
print(pred_Testset8)

[1] B A B A A E D B A A B C B A E E A B B B Levels: A B C D E

For the second part of our analysis we will compare our random forest model with several other modesl using different algorithms

PLS Model

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
pls_fitall = train(classe ~ ., data=train_set, method="pls", preProc = c("center", "scale"), tunelength = 15, trControl=ctrl)

We examine the model for its accuracy on the training set and the testing set1.

library(knitr)
print(kable(pls_fitall$results))

ncomp	Accuracy	Kappa	AccuracySD	KappaSD
1	0.3287191	0.1112857	0.0062185	0.0078111
2	0.3419235	0.1318052	0.0074621	0.0093767
3	0.3698202	0.1795564	0.0043800	0.0058132

print(kable(getTrainPerf(pls_fitall)))

TrainAccuracy	TrainKappa	method
0.3698202	0.1795564	pls

plspred_Vset <- predict(pls_fitall, newdata = test_set1)
plsModel_accuracyall_1 <- confusionMatrix(plspred_Vset, test_set1$classe)
print(kable(plsModel_accuracyall_1$overall))

Accuracy	0.3722392
Kappa	0.1826228
AccuracyLower	0.3598707
AccuracyUpper	0.3847345
AccuracyNull	0.2844037
AccuracyPValue	0.0000000
McnemarPValue	0.0000000

print(kable(table(plspred_Vset, test_set1$classe)))

	A	B	C	D	E
A	1279	523	584	197	252
B	123	315	63	165	252
C	83	90	210	91	206
D	28	64	0	65	50
E	161	147	169	447	322

The pls model performed poorly on the training set and the the testing set1.

Multinom Model

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
multinom_fitall = train(classe ~ ., data=train_set, method="multinom", trControl=ctrl)

We examine the model for its accuracy on the training set and the testing set1.

library(knitr)
print(kable(multinom_fitall$results))

decay	Accuracy	Kappa	AccuracySD	KappaSD
0e+00	0.6361228	0.5376890	0.0137046	0.0174132
1e-04	0.6360804	0.5376353	0.0136478	0.0173401
1e-01	0.6360804	0.5376353	0.0136478	0.0173401

print(kable(getTrainPerf(multinom_fitall)))

TrainAccuracy	TrainKappa	method
0.6361228	0.537689	multinom

multinompred_Vset <- predict(multinom_fitall, newdata = test_set1)
multinomModel_accuracyall_1 <- confusionMatrix(multinompred_Vset, test_set1$classe)
print(kable(multinomModel_accuracyall_1$overall))

Accuracy	0.6629290
Kappa	0.5701510
AccuracyLower	0.6506893
AccuracyUpper	0.6750070
AccuracyNull	0.2844037
AccuracyPValue	0.0000000
McnemarPValue	0.0000000

print(kable(table(multinompred_Vset, test_set1$classe)))

	A	B	C	D	E
A	1446	211	188	107	118
B	59	648	83	50	126
C	30	84	572	116	101
D	89	106	83	627	128
E	50	90	100	65	609

Naive Bayes Model

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
nb_fitall = train(classe ~ ., data=train_set, method="nb", trControl=ctrl)

We examine the model for its accuracy on the training set and the testing set1.

library(knitr)
print(kable(nb_fitall$results))

usekernel	fL	Accuracy	Kappa	AccuracySD	KappaSD
FALSE	0	0.5527360	0.4355056	0.0129700	0.0158605
TRUE	0	0.7366274	0.6687874	0.0117819	0.0146712

print(kable(getTrainPerf(nb_fitall)))

TrainAccuracy	TrainKappa	method
0.7366274	0.6687874	nb

nbpred_Vset <- predict(nb_fitall, newdata = test_set1)
nbModel_accuracyall_1 <- confusionMatrix(nbpred_Vset, test_set1$classe)
print(kable(nbModel_accuracyall_1$overall))

Accuracy	0.7505946
Kappa	0.6863954
AccuracyLower	0.7393338
AccuracyUpper	0.7616066
AccuracyNull	0.2844037
AccuracyPValue	0.0000000
McnemarPValue	0.0000000

print(kable(table(nbpred_Vset, test_set1$classe)))

	A	B	C	D	E
A	1214	87	58	69	29
B	61	815	71	5	121
C	186	154	871	169	57
D	206	75	24	680	37
E	7	8	2	42	838

Neural Network Model

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
nnet_fitall = train(classe ~ ., data=train_set, method="nnet", trControl=ctrl)

We examine the model for its accuracy on the training set and the testing set1.

library(knitr)
print(kable(nnet_fitall$results))

size	decay	Accuracy	Kappa	AccuracySD	KappaSD
1	0e+00	0.3227755	0.1195081	0.0201697	0.0494054
1	1e-04	0.3321170	0.1318785	0.0229701	0.0422658
1	1e-01	0.3347498	0.1243856	0.0190209	0.0490659
3	0e+00	0.3673161	0.2003524	0.0312514	0.0384186
3	1e-04	0.3648140	0.1806105	0.0415416	0.0754668
3	1e-01	0.3679883	0.1772609	0.0317326	0.0687148
5	0e+00	0.3924926	0.2185783	0.0313944	0.0560151
5	1e-04	0.4060816	0.2403058	0.0363172	0.0517884
5	1e-01	0.4075182	0.2475214	0.0388828	0.0579223

print(kable(getTrainPerf(nnet_fitall)))

TrainAccuracy	TrainKappa	method
0.4075182	0.2475214	nnet

nnetpred_Vset <- predict(nnet_fitall, newdata = test_set1)
nnetModel_accuracyall_1 <- confusionMatrix(nnetpred_Vset, test_set1$classe)
print(kable(nnetModel_accuracyall_1$overall))

Accuracy	0.4235474
Kappa	0.2735395
AccuracyLower	0.4108797
AccuracyUpper	0.4362909
AccuracyNull	0.2844037
AccuracyPValue	0.0000000
McnemarPValue	0.0000000

print(kable(table(nnetpred_Vset, test_set1$classe)))

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

Linear Discriminatory Analysis Model

ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
lda_fitall = train(classe ~ ., data=train_set, method="lda", trControl=ctrl, preProc = c("center", "scale"))

We examine the model for its accuracy on the training set and the testing set1.

library(knitr)
print(kable(lda_fitall$results))

parameter	Accuracy	Kappa	AccuracySD	KappaSD
none	0.6951	0.6139345	0.0113555	0.0143503

print(kable(getTrainPerf(lda_fitall)))

TrainAccuracy	TrainKappa	method
0.6951	0.6139345	lda

ldapred_Vset <- predict(nnet_fitall, newdata = test_set1)
ldaModel_accuracyall_1 <- confusionMatrix(ldapred_Vset, test_set1$classe)
print(kable(ldaModel_accuracyall_1$overall))

Accuracy	0.4235474
Kappa	0.2735395
AccuracyLower	0.4108797
AccuracyUpper	0.4362909
AccuracyNull	0.2844037
AccuracyPValue	0.0000000
McnemarPValue	0.0000000

print(kable(table(ldapred_Vset, test_set1$classe)))

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

Statistical Significance

We determine if there is a statistically significant difference between the three models performance.

resampls = resamples(list(RF = rfor_fit4, PLS = pls_fitall,  MULTINOM = multinom_fitall, NB = nb_fitall, NNET = nnet_fitall, LDA = lda_fitall))
print(kable(summary(resampls)[3]))

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
RF	0.9171	0.9194	0.9221	0.9227	0.9263	0.9291
PLS	0.3629	0.3674	0.3707	0.3698	0.3713	0.3782
MULTINOM	0.6105	0.6271	0.6398	0.6361	0.6449	0.6528
NB	0.7148	0.7332	0.7380	0.7366	0.7459	0.7499
NNET	0.3424	0.3779	0.4152	0.4075	0.4255	0.4622
LDA	0.6675	0.6932	0.6953	0.6951	0.7011	0.7096

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
RF	0.8952	0.8981	0.9015	0.9022	0.9068	0.9103
PLS	0.1705	0.1765	0.1792	0.1796	0.1826	0.1901
MULTINOM	0.5060	0.5257	0.5422	0.5377	0.5494	0.5592
NB	0.6414	0.6645	0.6706	0.6688	0.6803	0.6853
NNET	0.1558	0.2147	0.2577	0.2475	0.2757	0.3286
LDA	0.5790	0.6116	0.6140	0.6139	0.6214	0.6324

print(kable(summary(diff(resampls))[3], caption = "P values and mean difference between models"))

P values and mean difference between models

	RF	PLS	MULTINOM	NB	NNET	LDA
RF		0.55286	0.28656	0.18605	0.51516	0.22758
PLS	< 2.2e-16		-0.26630	-0.36681	-0.03770	-0.32528
MULTINOM	1.054e-11	1.846e-11		-0.10050	0.22860	-0.05898
NB	2.478e-11	1.202e-13	5.590e-07		0.32911	0.04153
NNET	2.643e-10	0.1692774	8.270e-07	4.400e-08		-0.28758
LDA	1.405e-12	5.157e-13	0.0001683	0.0004443	3.120e-08

	RF	PLS	MULTINOM	NB	NNET	LDA
RF		0.72266	0.36453	0.23343	0.65470	0.28828
PLS	< 2.2e-16		-0.35813	-0.48923	-0.06796	-0.43438
MULTINOM	1.035e-11	1.253e-11		-0.13110	0.29017	-0.07625
NB	2.252e-11	7.435e-14	4.600e-07		0.42127	0.05485
NNET	8.966e-10	0.0661227	2.744e-06	1.328e-07		-0.36641
LDA	1.344e-12	2.113e-13	0.0001408	0.0002895	7.344e-08

bwplot(resampls,metric="Accuracy")

Conclusions

The random forest model achieved a accuracy of more than 90% on first testing set. The multinomial, naive Bayes, and linear discriminatory analysis models performed moderately well with 60% -80% accuracy. The neural network and partial least squares model peformed poorly at around 40%.

sessionInfo()

## R version 3.2.4 (2016-03-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] nnet_7.3-12         pls_2.5-0           randomForest_4.6-12
## [4] knitr_1.12.3        caret_6.0-64        ggplot2_2.1.0      
## [7] lattice_0.20-33    
## 
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-14   digest_0.6.9       htmltools_0.3     
##  [4] minqa_1.2.4        splines_3.2.4      MatrixModels_0.4-1
##  [7] scales_0.3.0       grid_3.2.4         stringr_1.0.0     
## [10] e1071_1.6-7        lme4_1.1-11        munsell_0.4.3     
## [13] highr_0.5.1        labeling_0.3       foreach_1.4.3     
## [16] iterators_1.0.8    mgcv_1.8-12        Matrix_1.2-4      
## [19] MASS_7.3-45        plyr_1.8.3         stats4_3.2.4      
## [22] stringi_1.0-1      pbkrtest_0.4-6     magrittr_1.5      
## [25] car_2.1-1          reshape2_1.4.1     rmarkdown_0.9.5   
## [28] evaluate_0.8.3     gtable_0.2.0       colorspace_1.2-6  
## [31] yaml_2.1.13        tools_3.2.4        parallel_3.2.4    
## [34] nloptr_1.0.4       nlme_3.1-126       quantreg_5.21     
## [37] class_7.3-14       formatR_1.3        Rcpp_0.12.4       
## [40] SparseM_1.7

Predicting Errors in Execution of Physical Exercises

DocOfi

October 16, 2015

Synopsis

Introduction

Downloading the Data

Reading and Processing the data

Setting the variables to their correct class

Creating a Train set and a Validation set

Model Creation

Assessing Model Accuracy

Reducing the Number of Features

Identifying Variables with High Correlation

names(train_set)[highcor]

Predicting on the second Test Set

Conclusions

Misclassified points by the model with all variables on the 2nd test set

Misclassified points by the model with 20 variables on the 2nd test set

Misclassified points by the model with 8 variables on the 2nd test set

Misclassified points by the model with 4 variables on the 2nd test set

For the second part of our analysis we will compare our random forest model with several other modesl using different algorithms

PLS Model

Multinom Model

Naive Bayes Model

Neural Network Model

Linear Discriminatory Analysis Model

Statistical Significance

Conclusions

end

end

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62

	A	B	C	D	E
A	972	126	274	129	125
B	145	435	36	111	339
C	444	226	657	328	211
D	78	295	57	367	345
E	35	57	2	30	62