The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
library(caret)
library(randomForest)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
if (!require('rattle'))
{
install.packages('rattle');
library(rattle);
}
if (!require('e1071'))
{
install.packages('e1071');
library(e1071);
}
set.seed(1234)
In this analysis, 70% data are used for training and 30% are used for testing.
UrlTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
UrlTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(UrlTrain))
testing <- read.csv(url(UrlTest))
inTrain <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
myTraining <- training[inTrain, ]
myTesting <- training[-inTrain, ]
The following codes are used to clean the data sets. Major cleaning process include: remove missing values, assign columns’ names, dimension reduction, data transformation, and data coerce.
myDataNZV <- nearZeroVar(myTraining, saveMetrics=TRUE)
myNZVvars <- names(myTraining) %in% c("new_window", "kurtosis_roll_belt", "kurtosis_picth_belt",
"kurtosis_yaw_belt", "skewness_roll_belt", "skewness_roll_belt.1", "skewness_yaw_belt",
"max_yaw_belt", "min_yaw_belt", "amplitude_yaw_belt", "avg_roll_arm", "stddev_roll_arm",
"var_roll_arm", "avg_pitch_arm", "stddev_pitch_arm", "var_pitch_arm", "avg_yaw_arm",
"stddev_yaw_arm", "var_yaw_arm", "kurtosis_roll_arm", "kurtosis_picth_arm",
"kurtosis_yaw_arm", "skewness_roll_arm", "skewness_pitch_arm", "skewness_yaw_arm",
"max_roll_arm", "min_roll_arm", "min_pitch_arm", "amplitude_roll_arm", "amplitude_pitch_arm",
"kurtosis_roll_dumbbell", "kurtosis_picth_dumbbell", "kurtosis_yaw_dumbbell", "skewness_roll_dumbbell",
"skewness_pitch_dumbbell", "skewness_yaw_dumbbell", "max_yaw_dumbbell", "min_yaw_dumbbell",
"amplitude_yaw_dumbbell", "kurtosis_roll_forearm", "kurtosis_picth_forearm", "kurtosis_yaw_forearm",
"skewness_roll_forearm", "skewness_pitch_forearm", "skewness_yaw_forearm", "max_roll_forearm",
"max_yaw_forearm", "min_roll_forearm", "min_yaw_forearm", "amplitude_roll_forearm",
"amplitude_yaw_forearm", "avg_roll_forearm", "stddev_roll_forearm", "var_roll_forearm",
"avg_pitch_forearm", "stddev_pitch_forearm", "var_pitch_forearm", "avg_yaw_forearm",
"stddev_yaw_forearm", "var_yaw_forearm")
myTraining <- myTraining[!myNZVvars]
myTraining <- myTraining[c(-1)]
temp <- myTraining
for(i in 1:length(myTraining)) {
if( sum( is.na( myTraining[, i] ) ) /nrow(myTraining) >= .6 ) {
for(j in 1:length(temp)) {
if( length( grep(names(myTraining[i]), names(temp)[j]) ) ==1) {
temp <- temp[ , -j]
}
}
}
}
myTraining <- temp
rm(temp)
clean1 <- colnames(myTraining)
clean2 <- colnames(myTraining[, -58])
myTesting <- myTesting[clean1]
testing <- testing[clean2]
for (i in 1:length(testing) ) {
for(j in 1:length(myTraining)) {
if( length( grep(names(myTraining[i]), names(testing)[j]) ) ==1) {
class(testing[j]) <- class(myTraining[i])
}
}
}
testing <- rbind(myTraining[2, -58] , testing)
testing <- testing[-1,]
In this step, the cleaned data sets will be used to train the models and do predict.
Following plot is the results from running decision tree model on traning data set.
mod.dt <- rpart(classe ~ ., data=myTraining, method="class")
fancyRpartPlot(mod.dt)
The accuracy of decision tree model is shown below.
predictions.dt <- predict(mod.dt, myTesting, type = "class")
confusionMatrix(predictions.dt, myTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1606 53 3 0 0
## B 40 947 71 50 0
## C 28 130 930 108 49
## D 0 9 22 752 133
## E 0 0 0 54 900
##
## Overall Statistics
##
## Accuracy : 0.8726
## 95% CI : (0.8638, 0.881)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8389
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9594 0.8314 0.9064 0.7801 0.8318
## Specificity 0.9867 0.9661 0.9352 0.9667 0.9888
## Pos Pred Value 0.9663 0.8547 0.7470 0.8210 0.9434
## Neg Pred Value 0.9839 0.9598 0.9793 0.9573 0.9631
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2729 0.1609 0.1580 0.1278 0.1529
## Detection Prevalence 0.2824 0.1883 0.2116 0.1556 0.1621
## Balanced Accuracy 0.9730 0.8988 0.9208 0.8734 0.9103
Following plot is the results from running random forest model on traning data set.
mod.rf <- randomForest(classe ~. , data=myTraining)
predictions.rf <- predict(mod.rf, myTesting, type = "class")
confusionMatrix(predictions.rf, myTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1137 1 0 0
## C 0 0 1023 2 0
## D 0 0 2 962 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9976, 0.9995)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 0.9971 0.9979 1.0000
## Specificity 0.9995 0.9998 0.9996 0.9996 1.0000
## Pos Pred Value 0.9988 0.9991 0.9980 0.9979 1.0000
## Neg Pred Value 1.0000 0.9996 0.9994 0.9996 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1932 0.1738 0.1635 0.1839
## Detection Prevalence 0.2848 0.1934 0.1742 0.1638 0.1839
## Balanced Accuracy 0.9998 0.9990 0.9983 0.9988 1.0000
By comparing the performance of decision tree model and random forest model, we notice that random forest model outperformed. Thus, random forest model is selected to do predcit fro 20 test cases.
The results of applying randome forest algorithm to the 20 test cases are show below.
predictions.20 <- predict(mod.rf, testing, type = "class")
predictions.20
## 2 31 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E