Practical Machine Learning Assignment

Preprocessing the data

First, we load the testing and training data.

train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")

First, we’ll separate the training data into two parts, one to build the model and the other to validate it, before we apply it on the test set.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(123456)
trainset <- createDataPartition(train$classe, p = 0.8, list = FALSE)
Training <- train[trainset, ]
Validation <- train[-trainset, ]

Next the data needs to be cleaned up a bit to remove features with close to no variance and also features with lots of missing data.

# exclude near zero variance features
nzvcol <- nearZeroVar(Training)
Training <- Training[, -nzvcol]

# exclude columns with more more missing values exclude descriptive
# columns like name etc
cntlength <- sapply(Training, function(x) {
    sum(!(is.na(x) | x == ""))
})
nullcol <- names(cntlength[cntlength < 0.6 * length(Training$classe)])
descriptcol <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
    "cvtd_timestamp", "new_window", "num_window")
excludecols <- c(descriptcol, nullcol)
Training <- Training[, !names(Training) %in% excludecols]

Create and test the model

Next use Random Forest to build a model and then test it to find out how accurate it is.

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

##Build the model
rfModel <- randomForest(classe ~ ., data = Training, importance = TRUE, ntrees = 10)

##Test the model
ptraining <- predict(rfModel, Training)
print(confusionMatrix(ptraining, Training$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.284 
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 1     
##  Mcnemar's Test P-Value : NA    
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    1.000    1.000    1.000    1.000
## Specificity             1.000    1.000    1.000    1.000    1.000
## Pos Pred Value          1.000    1.000    1.000    1.000    1.000
## Neg Pred Value          1.000    1.000    1.000    1.000    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.194    0.174    0.164    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    1.000    1.000    1.000    1.000

This looks like it works really well, but to make sure we haven’t ended up with an overfitted model, let’s test it against the validation set.

pvalidation <- predict(rfModel, Validation)
print(confusionMatrix(pvalidation, Validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    7    0    0    0
##          B    0  751    4    0    0
##          C    0    1  680    4    0
##          D    0    0    0  639    4
##          E    0    0    0    0  717
## 
## Overall Statistics
##                                         
##                Accuracy : 0.995         
##                  95% CI : (0.992, 0.997)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.994         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.989    0.994    0.994    0.994
## Specificity             0.998    0.999    0.998    0.999    1.000
## Pos Pred Value          0.994    0.995    0.993    0.994    1.000
## Neg Pred Value          1.000    0.997    0.999    0.999    0.999
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.191    0.173    0.163    0.183
## Detection Prevalence    0.286    0.192    0.175    0.164    0.183
## Balanced Accuracy       0.999    0.994    0.996    0.996    0.997

This looks pretty good too, so our model seems valid.

We will also run it against the test set and create the files for the submission part of this assignment.

ptest <- predict(rfModel, test)
ptest

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

answers <- as.vector(ptest)

pml_write_files = function(x) {
    n = length(x)
    for (i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
    }
}

pml_write_files(answers)