**#Prediction Assignment Writeup“**
The goal of this project to predict the the manner of people did during exercise using data from http://groupware.les.inf.puc-rio.br/har.
The training dataset is from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. There are 19622 rows of training datasets, which means there are 19622 dataset for training, with 160 variables (or features).
The testing dataset is from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv There are 20 rows of datasets, which means there are 20 dataset for training, with 160 variables (or features).
This project will partition the training dataset into 70% for training, and 30% for validation on creating the machine learning model.
Not all 160 features are needed, so we need to clean the dataset by removing some columns. For example, columns with NA values, DIV/0, empty columns, and column with people names that is not correlated with the expected predicted classes.
## Data loading
setwd("..")
pmlTrain<-read.csv("pml-training.csv", header=T, na.strings=c("NA", "#DIV/0!"))
pmlTest<-read.csv("pml-testing.csv", header=T, na.string=c("NA", "#DIV/0!"))
## NA exclusion for all available variables
noNApmlTrain<-pmlTrain[, apply(pmlTrain, 2, function(x) !any(is.na(x)))]
dim(noNApmlTrain)
## [1] 19622 60
## variables with user information, time and undefined
cleanpmlTrain<-noNApmlTrain[,-c(1:8)]
dim(cleanpmlTrain)
## [1] 19622 52
## 20 test cases provided clean info - Validation data set
cleanpmltest<-pmlTest[,names(cleanpmlTrain[,-52])]
dim(cleanpmltest)
## [1] 20 51
The training dataset into 70% for training, and 30% for validation on creating the machine learning model, using these codes:
#Create Partition
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain<-createDataPartition(y=cleanpmlTrain$classe, p=0.70,list=F)
training<-cleanpmlTrain[inTrain,]
validate<-cleanpmlTrain[-inTrain,]
#Training and validate set dimensions
dim(training)
## [1] 13737 52
dim(validate)
## [1] 5885 52
The training dataset were trained using Random Forest Trees. 3 fold cross validation was used control the model. Random forest trees were generated for the training dataset using cross-validation. Then the generated algorithm was examined under the partitioned training set to examine the accuracy and estimated error of prediction.
By using 51 features for five classes using cross-validation at a 3-fold an accuracy of 99.54% with a 95% CI [0.9933, 0.997] was achieved accompanied by a Kappa value of 0.9942.
library(caret)
set.seed(786541)
pmlTrainControl<-trainControl(method="cv", number=3, allowParallel=T, verbose=T)
pmlTrainedModel<-train(classe~.,data=training, method="rf", trControl=pmlTrainControl, verbose=F)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry=26
## - Fold1: mtry=26
## + Fold1: mtry=51
## - Fold1: mtry=51
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry=26
## - Fold2: mtry=26
## + Fold2: mtry=51
## - Fold2: mtry=51
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry=26
## - Fold3: mtry=26
## + Fold3: mtry=51
## - Fold3: mtry=51
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 26 on full training set
The trained model is then tested on the validation dataset (the 30% data partitioned from original training dataset.
predrf<-predict(pmlTrainedModel, newdata=validate)
confusionMatrix(predrf, validate$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 10 0 0 0
## B 1 1126 7 0 0
## C 0 2 1014 12 2
## D 0 0 5 952 0
## E 1 1 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.9906, 0.995)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9912
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9886 0.9883 0.9876 0.9982
## Specificity 0.9976 0.9983 0.9967 0.9990 0.9996
## Pos Pred Value 0.9941 0.9929 0.9845 0.9948 0.9982
## Neg Pred Value 0.9995 0.9973 0.9975 0.9976 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1913 0.1723 0.1618 0.1835
## Detection Prevalence 0.2858 0.1927 0.1750 0.1626 0.1839
## Balanced Accuracy 0.9982 0.9935 0.9925 0.9933 0.9989
The trained model is then used to make prediction on the 20 dataset given at the beginning of the project.
pred20<-predict(pmlTrainedModel, newdata=cleanpmltest)