Report: Practical Machine Learning Course Project

Mahmoud Shaaban

Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Here we apply a random forest (RF) algorithm to predict the kind of activity they subject is doing based on the measures by thesis devices. We first load the data, split them into a training and a testing sets, remove the metadata, near zero and NAs variables. Then we apply the RF algorithm and validate the prediction results against the testing set.

Data Processing

Using the provided urls, we download the datasets and read them into R objects trainiing and testing.

Reading Data into R

url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if(!file.exists("pml-training.csv")) {
    download.file(url1, "pml-training.csv", method = "curl")
}
training <- read.csv("pml-training.csv") # downlaod and read training set

if(!file.exists("pml-testing.csv")){
    download.file(url2, "pml-testing.csv", method = "curl")
}
testing <- read.csv("pml-testing.csv") # downlaod and read testing set

Spliting the Data for Cross Validation

We start by spliting the training set into a train and test sets for cross validating the prediction models.

# cross validation
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(123)
indTrain <- createDataPartition(y=training$classe, p = 0.70, list=FALSE)
train <- training[indTrain,] ; test <- training[-indTrain,]
dim(train); dim(test)

## [1] 13737   160

## [1] 5885  160

Removing Metadata, Near Zero Values and NAs

Second, we remove variables that would result in poor prediction. These are the first seven variables of the data set, variables with values near zero and the ones with more than 70% missing values (NA).

train <- train[,-c(1:7)] # remove metadata

## removing near zero variables
nzv <- nearZeroVar(train, saveMetrics=TRUE)
train <- train[, nzv$nzv==FALSE]

# remove variables with missing values more than 70%
na <- colSums( is.na(train) )
naind <- na/nrow(train) > .7
train <- train[,!naind]

dim(train)

## [1] 13737    53

Number of variables is decreased to 53.

Applying Random Forest Algorithm

Here we chose to apply a random forst (RF) algorithm for prediction. The reason is in this case we have many variables that we predict with, and we thin a RF model will perform better.

## random forest
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(123)
mod <- randomForest(classe ~ ., data=train)
plot(mod, main = "Error Rates for the Random Forst Trees")

Here we cross validata the results by using the model we built mod to predict on the test set and we show the confusion matrix.

pred <- predict(mod, test)
cm <- confusionMatrix(pred, test$classe)
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    1 1133   12    0    0
##          C    0    0 1014   13    0
##          D    0    0    0  950    0
##          E    0    0    0    1 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9944          
##                  95% CI : (0.9921, 0.9961)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9929          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9947   0.9883   0.9855   1.0000
## Specificity            0.9986   0.9973   0.9973   1.0000   0.9998
## Pos Pred Value         0.9964   0.9887   0.9873   1.0000   0.9991
## Neg Pred Value         0.9998   0.9987   0.9975   0.9972   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1925   0.1723   0.1614   0.1839
## Detection Prevalence   0.2853   0.1947   0.1745   0.1614   0.1840
## Balanced Accuracy      0.9990   0.9960   0.9928   0.9927   0.9999

round(cm$overall['Accuracy'],2) # accuracy

## Accuracy 
##     0.99

1-round(cm$overall['Accuracy'],2) # out of sample error rate

## Accuracy 
##     0.01

The overall accuracy of the RF model is 0.99 when applied on the test data set for cross validation and the out of sample error rate equals 0.01 . It seems very accurate due to the abundance of the variables we predict with.

Predictin on Testing dataset

Here we predict on the 20 samples of the testing dataset using the RF model.

prediction <- predict(mod, testing)
print(prediction)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E