pml_training<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",na.strings=c("NA",""))
dim(pml_training) # 19622 rows, 160 cols
## [1] 19622 160
pml_testing<-read.csv("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",na.strings=c("NA",""))
dim(pml_testing) # 20 rows, 160 cols
## [1] 20 160
Step 1. Delete 100 columns with unavailable (NA) or missing (ββ) data.
non_missing <- apply(!is.na(pml_training),2,sum)>19621
pml_training1 <- pml_training[,non_missing]
dim(pml_training1) # 19,622 rows, 60 cols
## [1] 19622 60
non_missing <- apply(!is.na(pml_testing),2,sum)>19
pml_testing2 <- pml_testing[,non_missing]
dim(pml_testing2) # 20 rows, 60 cols
## [1] 20 60
Step 2. Delete 406 rows in which the new_window variable equals βyes.β
pml_training2 <- subset(pml_training1,new_window=="no")
dim(pml_training2) # 19,216 rows, 60 cols
## [1] 19216 60
Step 3. Delete index, timestamps and new_window variable (Columns 1, 3-6). The remaining columns include 54 predictors and the classe variable.
pml_training2_user_name <- pml_training2[,2]
pml_training2_predictors <- pml_training2[7:60]
pml_training3 <- cbind(pml_training2_user_name,pml_training2_predictors)
dim(pml_training3) # 19,216 rows, 55 cols
## [1] 19216 55
pml_testing2_user_name <- pml_testing2[,2]
pml_testing2_predictors <- pml_testing2[,7:60]
pml_testing3 <- cbind(pml_testing2_user_name,pml_testing2_predictors)
dim(pml_testing3) # 20 rows, 55 cols
## [1] 20 55
library(caret)
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.2
InTrain = createDataPartition(pml_training3$classe,p=0.4)[[1]]
training <- pml_training3[InTrain,] # 7689 rows, 55 cols
dim(training)
## [1] 7689 55
A generalized boosted (gbm) regression model was fitted to training data using the train function.
set.seed(125)
gbmModel <- train(classe~.,method="gbm",data=training,verbose=FALSE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.1.2
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.1.2
print(gbmModel$finalModel)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 58 predictors of which 45 had non-zero influence.
gbmPredictions <- predict(gbmModel,training)
confusionMatrix(training$classe,gbmPredictions)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2189 0 0 0 0
## B 9 1470 8 1 0
## C 0 6 1333 1 1
## D 0 2 8 1249 0
## E 0 2 1 8 1401
##
## Overall Statistics
##
## Accuracy : 0.9939
## 95% CI : (0.9919, 0.9955)
## No Information Rate : 0.2859
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9923
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9959 0.9932 0.9874 0.9921 0.9993
## Specificity 1.0000 0.9971 0.9987 0.9984 0.9983
## Pos Pred Value 1.0000 0.9879 0.9940 0.9921 0.9922
## Neg Pred Value 0.9984 0.9984 0.9973 0.9984 0.9998
## Prevalence 0.2859 0.1925 0.1756 0.1637 0.1823
## Detection Rate 0.2847 0.1912 0.1734 0.1624 0.1822
## Detection Prevalence 0.2847 0.1935 0.1744 0.1637 0.1836
## Balanced Accuracy 0.9980 0.9952 0.9931 0.9953 0.9988
A random forest (rf) model was fitted to training data using the train function with five-fold cross validation (cv).
set.seed(125)
rfModel <-train(classe~.,data=training,method="rf", trControl=trainControl(method="cv",number=5), prox=TRUE,allowParallel=TRUE)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.1.2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
print(rfModel$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, allowParallel = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 30
##
## OOB estimate of error rate: 0.69%
## Confusion matrix:
## A B C D E class.error
## A 2188 1 0 0 0 0.0004568296
## B 12 1470 5 1 0 0.0120967742
## C 0 12 1326 3 0 0.0111856823
## D 0 0 12 1247 0 0.0095313741
## E 0 0 0 7 1405 0.0049575071
rfPredictions <- predict(rfModel,training)
confusionMatrix(training$classe,rfPredictions)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2189 0 0 0 0
## B 0 1488 0 0 0
## C 0 0 1341 0 0
## D 0 0 0 1259 0
## E 0 0 0 0 1412
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9995, 1)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2847 0.1935 0.1744 0.1637 0.1836
## Detection Rate 0.2847 0.1935 0.1744 0.1637 0.1836
## Detection Prevalence 0.2847 0.1935 0.1744 0.1637 0.1836
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The random forest model should be highly accurate due to the large sample size of the training data and five-fold cross validation, # i.e., OOB error < 0.5%.
Predicted values should match actual values because user name and window number are common factors in both training and test datasets, # i.e., 20 out of 20.