Practical Machine Learning

Overview

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Data Source

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source:

http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Preparations

# Don't forget to set the working directory
# 0. Load all the necessary packages
library(dplyr)
library(lubridate)
library(ggplot2)
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)

In the following section, the data would be loaded into workspace and some cleansing work will be conducted. The main purpose of this step is to remove all columns which are filled by NA. Additionaly, considering this project doesn’t have any type of time series data involved, time stamp and identification columns have been removed as well.

# 1. Load the data
test <- read.csv("./pml-testing.csv")
train <- read.csv("./pml-training.csv")

# 2. Data cleaning
# Create a partition with the training data
inTrain <- createDataPartition(train$classe, p = 0.7, list = FALSE)
test.set <- train[inTrain, ]
train.set <- train[-inTrain, ]
# Remove the columns contain NAs
na.list <- which(colSums(is.na(test)) > 0)
test.set <- test.set[, -na.list]
train.set <- train.set[, -na.list]
test <- test[, -na.list]
# Remove time stamp
test.set <- test.set[, -c(1:6)]
train.set <- train.set[, -c(1:6)]
test <- test[, -c(1:6)]

Exploratory Analysis

A heat map can quickly give out a brief image about the relationships among all these variables.

# 3. Exploratory Analysis
# Heat map
map.Matrix <- cor(train.set[, -c(54)])
heatmap(map.Matrix, scale = 'column')

Prediction Model Building

Random Forest

# 4. Prediction Model
# 4.1 Radom forest
# Set for reproducibility
set.seed(2018)
# Random forest model fit
rf.mod <- randomForest(classe ~., data = train.set, ntree = 1000)
# Prediction
rf.pred <- predict(rf.mod, test.set, type = "class")
# Result evaluation
rf.confmat <- confusionMatrix(rf.pred, test.set$classe)
rf.confmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906   22    0    0    0
##          B    0 2630   53    0    0
##          C    0    6 2340   27    0
##          D    0    0    3 2221    2
##          E    0    0    0    4 2523
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9915         
##                  95% CI : (0.9898, 0.993)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9892         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9895   0.9766   0.9862   0.9992
## Specificity            0.9978   0.9952   0.9971   0.9996   0.9996
## Pos Pred Value         0.9944   0.9802   0.9861   0.9978   0.9984
## Neg Pred Value         1.0000   0.9975   0.9951   0.9973   0.9998
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1915   0.1703   0.1617   0.1837
## Detection Prevalence   0.2859   0.1953   0.1727   0.1620   0.1840
## Balanced Accuracy      0.9989   0.9923   0.9869   0.9929   0.9994

Decision Tree

# 4.2 Decision tree
# Set for reproducibility
set.seed(2018)
# Decision tree model fit
dt.mod <- rpart(classe ~., data = train.set, method = "class")
fancyRpartPlot(dt.mod)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

# Decision tree prediction
dt.pred <- predict(dt.mod, newdata = test.set, type = "class")
dt.confmat <- confusionMatrix(dt.pred, test.set$classe)
dt.confmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3374  209    3   24    7
##          B  392 2098  264  442  132
##          C    1  123 1951   63   14
##          D  107  105  142 1555  219
##          E   32  123   36  168 2153
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8103          
##                  95% CI : (0.8036, 0.8168)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7604          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8638   0.7893   0.8143   0.6905   0.8527
## Specificity            0.9753   0.8890   0.9823   0.9501   0.9680
## Pos Pred Value         0.9328   0.6304   0.9066   0.7307   0.8571
## Neg Pred Value         0.9474   0.9462   0.9616   0.9400   0.9669
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2456   0.1527   0.1420   0.1132   0.1567
## Detection Prevalence   0.2633   0.2423   0.1567   0.1549   0.1829
## Balanced Accuracy      0.9195   0.8391   0.8983   0.8203   0.9103

It is pretty obvious that Random Forest has a better level of accuracy, which is 0.9921, whereas the accuracy level of Decision Tree is only 0.7337.

Model Application

As mentioned above, it is pretty obvious that the Random Forest has a much more satisfactory performance of accuracy. Hence the Random Forest model will be applied to the actual test case.

# 5. Final Test
# Random Forest
rf.prediction.test <- predict(rf.mod, newdata = test)
rf.prediction.test

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

# Decision Tree
dt.prediction.test <- predict(dt.mod, newdata = test, type = "class")
dt.prediction.test

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  A  A  A  D  D  B  A  A  D  C  E  A  E  E  A  A  B  E 
## Levels: A B C D E

Practical Machine Learning - Course Project