Load the data, keep only the features with full values, delete useless features and split to training and validation datasets.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
setwd("/Users/theofpa/Development/datasciencecoursera/ml-project/")
traincsv<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testcsv<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(traincsv, destfile = "pml-training.csv", method = "curl")
download.file(testcsv, destfile = "pml-testing.csv", method = "curl")
trainingraw <- read.table("./pml-training.csv",sep=",",na.strings = c("NA",""),header=TRUE)
testing <- read.table("./pml-testing.csv",sep=",",na.strings = c("NA",""),header=TRUE)
inTrain <- createDataPartition(trainingraw$classe, p=0.70, list=FALSE)
training <- trainingraw[inTrain,]
validation <- trainingraw[-inTrain,]
training<-training[,colSums(is.na(training)) == 0]
classe<-training$classe
nums <- sapply(training, is.numeric)
training<-cbind(classe,training[,nums])
training$X<-training$num_window<-NULL
validation<-validation[,colSums(is.na(validation)) == 0]
vclasse<-validation$classe
vnums <- sapply(validation, is.numeric)
validation<-cbind(vclasse,validation[,vnums])
colnames(validation)[1]<-"classe"
validation$X<-validation$num_window<-NULL
testing<-testing[,colSums(is.na(testing)) == 0]
tnums <- sapply(testing, is.numeric)
testing<-testing[,tnums]
testing$X<-testing$num_window<-NULL
Fit a model using random forest, running in parallel with 8 processes on i7 the training of the model took ~22 minutes.
library(doMC)
registerDoMC(cores = 8)
fit <- train(training$classe~.,data=training, method="rf")
save(fit,file="fit.RData")
load(file = "./fit.RData")
fit$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9920 0.9899 0.0016031 0.002028
## 2 28 0.9962 0.9952 0.0008519 0.001077
## 3 54 0.9909 0.9885 0.0034958 0.004420
Using the model that we’ve trained, we’re performing a cross validation with the rest of data from the dataset reserved for this reason. The out of error rate is expected to be less than 1%, as the accuracy of the model observed above is 99.88%.
traincontrol <- trainControl(method = "cv", number = 5)
fit_crossvalidation <- train(validation$classe~.,data=validation, method="rf",trControl=traincontrol)
save(fit_crossvalidation,file="fit_crossvalidation.RData")
load(file="./fit_crossvalidation.RData")
fit_crossvalidation$resample
## Accuracy Kappa Resample
## 1 0.9966 0.9957 Fold1
## 2 0.9889 0.9860 Fold3
## 3 0.9889 0.9860 Fold2
## 4 0.9907 0.9882 Fold5
## 5 0.9907 0.9882 Fold4
fit_crossvalidation$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9832 0.9787 0.002282 0.002890
## 2 28 0.9912 0.9888 0.003160 0.003999
## 3 54 0.9866 0.9830 0.004944 0.006261
confusionMatrix(predict(fit_crossvalidation, newdata=validation), validation$classe)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 4 0 0 0
## B 3 1128 4 0 0
## C 0 7 1019 7 0
## D 0 0 3 956 7
## E 0 0 0 1 1075
##
## Overall Statistics
##
## Accuracy : 0.994
## 95% CI : (0.992, 0.996)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.992
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.998 0.990 0.993 0.992 0.994
## Specificity 0.999 0.999 0.997 0.998 1.000
## Pos Pred Value 0.998 0.994 0.986 0.990 0.999
## Neg Pred Value 0.999 0.998 0.999 0.998 0.999
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.192 0.173 0.162 0.183
## Detection Prevalence 0.285 0.193 0.176 0.164 0.183
## Balanced Accuracy 0.999 0.994 0.995 0.995 0.997
Indeed, by calculating the out of sample error (the cross-validation estimate is an out-of-sample estimate) we get the value of 0.54%:
fit_crossvalidation$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 28
##
## OOB estimate of error rate: 0.54%
## Confusion matrix:
## A B C D E class.error
## A 1671 3 0 0 0 0.001792
## B 3 1130 6 0 0 0.007902
## C 0 2 1021 3 0 0.004873
## D 0 0 10 954 0 0.010373
## E 0 0 2 3 1077 0.004621
Finally, to predict the classe of the testing dataset, we’re applying the prediction using the model we’ve trained and output the results in the respective files as adviced by the instructor:
test_prediction<-predict(fit, newdata=testing)
test_prediction
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(test_prediction)