Practical Machine Learning Project

OVERVIEW

When using devices such as Jawbone Up, Nike FuelBand, and Fitbit, people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project will focus on building prediction model to predict 20 different test cases to predict the manner in which people did the exercise. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

SET UP

Load packages

library(ggplot2)
library(caret)
library(rpart)
library(rattle)
library(rpart.plot)
library(randomForest)
library(readr)
library(dplyr)

Load data

pml_training <- read_csv("C:/Users/tuuye/Desktop/Data Science/John Hopkins/Machine Learning/pml-training.csv")
pml_testing <- read_csv("C:/Users/tuuye/Desktop/Data Science/John Hopkins/Machine Learning/pml-testing.csv")

Set seed

set.seed(190594)

DATA CLEANING

Check dimension of the training data

dim(pml_training)

## [1] 19622   160

The data frame has 19622 rows (observations) and 160 columns (variables). We will look at the first six rows of the dataset

head(pml_training)

## # A tibble: 6 x 160
##      X1 user_name raw_timestamp_p~ raw_timestamp_p~ cvtd_timestamp
##   <dbl> <chr>                <dbl>            <dbl> <chr>         
## 1     1 carlitos        1323084231           788290 05/12/2011 11~
## 2     2 carlitos        1323084231           808298 05/12/2011 11~
## 3     3 carlitos        1323084231           820366 05/12/2011 11~
## 4     4 carlitos        1323084232           120339 05/12/2011 11~
## 5     5 carlitos        1323084232           196328 05/12/2011 11~
## 6     6 carlitos        1323084232           304277 05/12/2011 11~
## # ... with 155 more variables: new_window <chr>, num_window <dbl>,
## #   roll_belt <dbl>, pitch_belt <dbl>, yaw_belt <dbl>,
## #   total_accel_belt <dbl>, kurtosis_roll_belt <chr>,
## #   kurtosis_picth_belt <chr>, kurtosis_yaw_belt <chr>,
## #   skewness_roll_belt <chr>, skewness_roll_belt.1 <chr>,
## #   skewness_yaw_belt <chr>, max_roll_belt <dbl>, max_picth_belt <dbl>,
## #   max_yaw_belt <chr>, min_roll_belt <dbl>, min_pitch_belt <dbl>,
## #   min_yaw_belt <chr>, amplitude_roll_belt <dbl>,
## #   amplitude_pitch_belt <dbl>, amplitude_yaw_belt <chr>,
## #   var_total_accel_belt <dbl>, avg_roll_belt <dbl>,
## #   stddev_roll_belt <dbl>, var_roll_belt <dbl>, avg_pitch_belt <dbl>,
## #   stddev_pitch_belt <dbl>, var_pitch_belt <dbl>, avg_yaw_belt <dbl>,
## #   stddev_yaw_belt <dbl>, var_yaw_belt <dbl>, gyros_belt_x <dbl>,
## #   gyros_belt_y <dbl>, gyros_belt_z <dbl>, accel_belt_x <dbl>,
## #   accel_belt_y <dbl>, accel_belt_z <dbl>, magnet_belt_x <dbl>,
## #   magnet_belt_y <dbl>, magnet_belt_z <dbl>, roll_arm <dbl>,
## #   pitch_arm <dbl>, yaw_arm <dbl>, total_accel_arm <dbl>,
## #   var_accel_arm <dbl>, avg_roll_arm <dbl>, stddev_roll_arm <dbl>,
## #   var_roll_arm <dbl>, avg_pitch_arm <dbl>, stddev_pitch_arm <dbl>,
## #   var_pitch_arm <dbl>, avg_yaw_arm <dbl>, stddev_yaw_arm <dbl>,
## #   var_yaw_arm <dbl>, gyros_arm_x <dbl>, gyros_arm_y <dbl>,
## #   gyros_arm_z <dbl>, accel_arm_x <dbl>, accel_arm_y <dbl>,
## #   accel_arm_z <dbl>, magnet_arm_x <dbl>, magnet_arm_y <dbl>,
## #   magnet_arm_z <dbl>, kurtosis_roll_arm <dbl>, kurtosis_picth_arm <chr>,
## #   kurtosis_yaw_arm <chr>, skewness_roll_arm <dbl>,
## #   skewness_pitch_arm <chr>, skewness_yaw_arm <chr>, max_roll_arm <dbl>,
## #   max_picth_arm <dbl>, max_yaw_arm <dbl>, min_roll_arm <dbl>,
## #   min_pitch_arm <dbl>, min_yaw_arm <dbl>, amplitude_roll_arm <dbl>,
## #   amplitude_pitch_arm <dbl>, amplitude_yaw_arm <dbl>,
## #   roll_dumbbell <dbl>, pitch_dumbbell <dbl>, yaw_dumbbell <dbl>,
## #   kurtosis_roll_dumbbell <dbl>, kurtosis_picth_dumbbell <dbl>,
## #   kurtosis_yaw_dumbbell <chr>, skewness_roll_dumbbell <dbl>,
## #   skewness_pitch_dumbbell <dbl>, skewness_yaw_dumbbell <chr>,
## #   max_roll_dumbbell <dbl>, max_picth_dumbbell <dbl>,
## #   max_yaw_dumbbell <dbl>, min_roll_dumbbell <dbl>,
## #   min_pitch_dumbbell <dbl>, min_yaw_dumbbell <dbl>,
## #   amplitude_roll_dumbbell <dbl>, amplitude_pitch_dumbbell <dbl>,
## #   amplitude_yaw_dumbbell <dbl>, total_accel_dumbbell <dbl>,
## #   var_accel_dumbbell <dbl>, avg_roll_dumbbell <dbl>,
## #   stddev_roll_dumbbell <dbl>, ...

In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. And those variables are in the columns #8 to #159. Column #160 classe is used as an outcome to build prediction model. The data from first seven columns are not related to our prediction so we will exclude those columns from our data

accelerometers = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(pml_training))
length(accelerometers)

## [1] 152

And include column#160 to the new data frame

data <- pml_training[, c(accelerometers, 160)]
dim(data)

## [1] 19622   153

Then we will check if the dataset has missing values or not

na_count <- colSums(is.na(data))
head(sort(na_count, decreasing = TRUE), n = 6)

##      kurtosis_roll_arm      skewness_roll_arm kurtosis_roll_dumbbell 
##                  19294                  19293                  19221 
##       max_yaw_dumbbell       min_yaw_dumbbell amplitude_yaw_dumbbell 
##                  19221                  19221                  19221

omitColumns = which(na_count > 19000)
data = data[, -omitColumns]
dim(data)

## [1] 19622    53

So from 153 columns, the data is reduced to 53 columns. Then, we will check type of each column

table(sapply(data[1,], class))

## 
## character   numeric 
##         1        52

DATA SPLITTING AND PREPROCESSING

We have training and testing data but not the validation data. we split the cleaned training set into a pure training data set (80%) and a validation data set (20%). We will use the validation data set to conduct cross validation in future steps.

inTrain <- createDataPartition(y = data$classe, p = 0.8, list = FALSE)

training <- data[inTrain,]
validation <- data[-inTrain,]

dim(training)

## [1] 15699    53

dim(validation)

## [1] 3923   53

dim(pml_testing)

## [1]  20 160

So we have all the data we need to build prediction model

Training data with 15699 observations and 53 variables
Testing data with 20 observations and 160 variables
Validation data with 3923 observations and 53 variables

DATA MODELING

Decision Tree

We fit a predictive model for activity recognition using Decision Tree algorithm.

modelTree <- rpart(classe ~., data = training, method = 'class')
prp(modelTree)

# We can create prettier tree by using `rattle` package
fancyRpartPlot(modelTree)

Now, we estimate the performance of the model on the validation data set.

predictTree <- predict(modelTree, newdata = validation, type = 'class')
accuracy.tree <- confusionMatrix(table(predictTree, validation$classe))
accuracy.tree

## Confusion Matrix and Statistics
## 
##            
## predictTree   A   B   C   D   E
##           A 899 107  15  27   8
##           B  31 428  60  54  56
##           C  27  67 503 102  95
##           D 140 137  92 426 100
##           E  19  20  14  34 462
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6928          
##                  95% CI : (0.6781, 0.7073)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6131          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8056   0.5639   0.7354   0.6625   0.6408
## Specificity            0.9441   0.9365   0.9102   0.8570   0.9728
## Pos Pred Value         0.8513   0.6804   0.6335   0.4760   0.8415
## Neg Pred Value         0.9243   0.8995   0.9422   0.9283   0.9232
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2292   0.1091   0.1282   0.1086   0.1178
## Detection Prevalence   0.2692   0.1603   0.2024   0.2281   0.1399
## Balanced Accuracy      0.8748   0.7502   0.8228   0.7598   0.8068

We see that the accuracy is low. We will check with the ‘random forest’. This model will automatically select important variables and is robust to correlated covariates and outliers in general.

# Plot 
plot(accuracy.tree$table, col = accuracy.tree$byClass, main = paste('Decision Tree - Accuracy = ', round(accuracy.tree$overall['Accuracy'],4)))

Data Forest

modelRf <- train(classe ~., data = training, method = 'rf', trControl = trainControl(method = 'cv', 5), ntree = 250)
modelRf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 250, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 250
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.6%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4461    3    0    0    0 0.000672043
## B   15 3016    6    0    1 0.007241606
## C    0   22 2712    4    0 0.009495982
## D    0    0   37 2534    2 0.015157404
## E    0    0    0    4 2882 0.001386001

#Applying model to the validation data
predict.rf <- predict(modelRf, newdata = validation)
rf <- confusionMatrix(table(predict.rf, validation$classe))
rf

## Confusion Matrix and Statistics
## 
##           
## predict.rf    A    B    C    D    E
##          A 1116    2    0    0    0
##          B    0  757    5    0    0
##          C    0    0  679    7    0
##          D    0    0    0  636    1
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9962          
##                  95% CI : (0.9937, 0.9979)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9952          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9927   0.9891   0.9986
## Specificity            0.9993   0.9984   0.9978   0.9997   1.0000
## Pos Pred Value         0.9982   0.9934   0.9898   0.9984   1.0000
## Neg Pred Value         1.0000   0.9994   0.9985   0.9979   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1930   0.1731   0.1621   0.1835
## Detection Prevalence   0.2850   0.1942   0.1749   0.1624   0.1835
## Balanced Accuracy      0.9996   0.9979   0.9953   0.9944   0.9993

We see that the accuracy is much higher than the Decision Tree Model so we will use this model to to do prediction

# Plot the matrix
plot(rf$table, col = rf$byClass, main = paste("Random Forest - Accuracy =", round(rf$overall['Accuracy'], 4)))

Applying the selected model to the testing data

We will use modelRF to the testing data to predict the 20 results

predict.test <- predict(modelRf, newdata = pml_testing)
predict.test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E