By Sue Lynn
Background Introduction
The goal of your project is to predict the manner in which they did the exercise
In order to reproduce the same results, you need a certain set of packages, as well as setting a pseudo random seed equal to the one I used. *Note:To install, for instance, the caret package in R, run this command: install.packages(“caret”)
The following Libraries were used for this project, which you should install - if not done yet - and load on your working environment.
library(caret)
## Warning: package 'caret' was built under R version 3.2.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.2.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.2.3
library(ggplot2)
library(grid)
Getting The Data
Downloading the training data set and the testing the data set into the hard drive.
train = read.csv("C:\\Users\\sue\\Documents\\R\\pml-training.csv",header=TRUE)
train_used = train[,c(8:11,37:49,60:68,84:86,102,113:124,140,151:160)]
testing = read.csv("C:\\Users\\sue\\Documents\\R\\pml-testing.csv",header=TRUE)
test_used = testing[,c(8:11,37:49,60:68,84:86,102,113:124,140,151:160)]
The raw dataset contained 19622 rows of data, with 160 variables. The clearning of the data was done by removing the many variables that contained a large missing data (usually with only one row of data), so these were removed from the dataset. In addition, variables not concerning the movement sensors were also removed. This resulted in a dataset of 53 variables.
dim(train)
## [1] 19622 160
dim(train_used)
## [1] 19622 53
Partioning The Training Set Into Two
The Model
Many methods of classification were attempted, including niave Bayes, multinomial logistic regression, and decision trees. It was determined that the Random Forest method produced the best results. In addition, principal component analysis was attempted however this greatly reduced the prediction accuracy.
Cross validation was not used, as, according to the creators of the Random Forest algorithm: “In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error.” - Leo Breiman and Adele Cutler
Model Applied to Testing Dataset
predictionTesting = predict(random_forest, newdata= myTesting)
confusionMatrix(predictionTesting, myTesting$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 10 0 0 0
## B 0 1506 8 0 0
## C 0 2 1360 14 1
## D 0 0 0 1270 3
## E 0 0 0 2 1438
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9931, 0.9964)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9921 0.9942 0.9876 0.9972
## Specificity 0.9982 0.9987 0.9974 0.9995 0.9997
## Pos Pred Value 0.9955 0.9947 0.9877 0.9976 0.9986
## Neg Pred Value 1.0000 0.9981 0.9988 0.9976 0.9994
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1919 0.1733 0.1619 0.1833
## Detection Prevalence 0.2858 0.1930 0.1755 0.1622 0.1835
## Balanced Accuracy 0.9991 0.9954 0.9958 0.9936 0.9985