Get and clean the data

Load the data, keep only the features with full values, delete useless features and split to training and validation datasets.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
setwd("/Users/theofpa/Development/datasciencecoursera/ml-project/")
traincsv<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testcsv<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(traincsv, destfile = "pml-training.csv", method = "curl")
download.file(testcsv, destfile = "pml-testing.csv", method = "curl")
trainingraw <- read.table("./pml-training.csv",sep=",",na.strings = c("NA",""),header=TRUE)
testing <- read.table("./pml-testing.csv",sep=",",na.strings = c("NA",""),header=TRUE)

inTrain <- createDataPartition(trainingraw$classe, p=0.70, list=FALSE)
training <- trainingraw[inTrain,]
validation <- trainingraw[-inTrain,]

training<-training[,colSums(is.na(training)) == 0]
classe<-training$classe
nums <- sapply(training, is.numeric)
training<-cbind(classe,training[,nums])
training$X<-training$num_window<-NULL

validation<-validation[,colSums(is.na(validation)) == 0]
vclasse<-validation$classe
vnums <- sapply(validation, is.numeric)
validation<-cbind(vclasse,validation[,vnums])
colnames(validation)[1]<-"classe"
validation$X<-validation$num_window<-NULL

testing<-testing[,colSums(is.na(testing)) == 0]
tnums <- sapply(testing, is.numeric)
testing<-testing[,tnums]
testing$X<-testing$num_window<-NULL

Model building

Fit a model using random forest, running in parallel with 8 processes on i7 the training of the model took ~22 minutes.

library(doMC)
registerDoMC(cores = 8)
fit <- train(training$classe~.,data=training, method="rf")
save(fit,file="fit.RData")
load(file = "./fit.RData")
fit$results
##   mtry Accuracy  Kappa AccuracySD  KappaSD
## 1    2   0.9920 0.9899  0.0016031 0.002028
## 2   28   0.9962 0.9952  0.0008519 0.001077
## 3   54   0.9909 0.9885  0.0034958 0.004420

Error estimation with cross validation

Using the model that we’ve trained, we’re performing a cross validation with the rest of data from the dataset reserved for this reason. The out of error rate is expected to be less than 1%, as the accuracy of the model observed above is 99.88%.

traincontrol <- trainControl(method = "cv", number = 5)
fit_crossvalidation <- train(validation$classe~.,data=validation, method="rf",trControl=traincontrol)
save(fit_crossvalidation,file="fit_crossvalidation.RData")
load(file="./fit_crossvalidation.RData")
fit_crossvalidation$resample
##   Accuracy  Kappa Resample
## 1   0.9966 0.9957    Fold1
## 2   0.9889 0.9860    Fold3
## 3   0.9889 0.9860    Fold2
## 4   0.9907 0.9882    Fold5
## 5   0.9907 0.9882    Fold4
fit_crossvalidation$results
##   mtry Accuracy  Kappa AccuracySD  KappaSD
## 1    2   0.9832 0.9787   0.002282 0.002890
## 2   28   0.9912 0.9888   0.003160 0.003999
## 3   54   0.9866 0.9830   0.004944 0.006261
confusionMatrix(predict(fit_crossvalidation, newdata=validation), validation$classe)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    4    0    0    0
##          B    3 1128    4    0    0
##          C    0    7 1019    7    0
##          D    0    0    3  956    7
##          E    0    0    0    1 1075
## 
## Overall Statistics
##                                         
##                Accuracy : 0.994         
##                  95% CI : (0.992, 0.996)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.992         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.998    0.990    0.993    0.992    0.994
## Specificity             0.999    0.999    0.997    0.998    1.000
## Pos Pred Value          0.998    0.994    0.986    0.990    0.999
## Neg Pred Value          0.999    0.998    0.999    0.998    0.999
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.192    0.173    0.162    0.183
## Detection Prevalence    0.285    0.193    0.176    0.164    0.183
## Balanced Accuracy       0.999    0.994    0.995    0.995    0.997

Indeed, by calculating the out of sample error (the cross-validation estimate is an out-of-sample estimate) we get the value of 0.54%:

fit_crossvalidation$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 28
## 
##         OOB estimate of  error rate: 0.54%
## Confusion matrix:
##      A    B    C   D    E class.error
## A 1671    3    0   0    0    0.001792
## B    3 1130    6   0    0    0.007902
## C    0    2 1021   3    0    0.004873
## D    0    0   10 954    0    0.010373
## E    0    0    2   3 1077    0.004621

Predict the 20 test cases

Finally, to predict the classe of the testing dataset, we’re applying the prediction using the model we’ve trained and output the results in the respective files as adviced by the instructor:

test_prediction<-predict(fit, newdata=testing)
test_prediction
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(test_prediction)