Summary

This report is the final course project for the course Practical Machine Learning which is a part of Data Science Specialization by Johns Hopkins University on Coursera.

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Using the collected data, I will build a machine learning model which will categorize the data into A, B, C, D and E category which specifies how well the workout has been done.

Getting and Cleaning Data

Source: Human Activity Recognition
Training Data: pml_training
Test Data: pml_testing

Loading Libraries:

library(tidyverse)
library(caret)
library(randomForest)
library(corrplot)

Downloading Data:

if(!file.exists("pml-training.csv")){
      download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", 
                    destfile = "pml-training.csv")
}
if(!file.exists("pml-testing.csv")){
      download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", 
                    destfile = "pml-testing.csv")
}
pml_training <- read_csv("pml-training.csv", na = c("#DIV/0!", "NA", ""))
pml_testing <- read_csv("pml-testing.csv", na = c("#DIV/0!", "NA", ""))

Analysing Data:

dim(pml_training)
## [1] 19622   160
table(complete.cases(pml_training))
## 
## FALSE 
## 19622

From above table, we can conclude that every row contains NULL value. If the NULL values are not required to create our model, it can cause the model to give wrong impression of the data and it is better if we filter it out.
I will set the threshold for inclusion of column to have at least 80% of its field NOT NULL.

relevent_col <- colnames(pml_training)[apply(is.na(pml_training), 2, sum) < (0.8 * nrow(pml_training))]
length(relevent_col)
## [1] 60

I have reduced the number of relevent columns from 160 to 60. Let us take a look at the names of the columns which have been included.

print(relevent_col)
##  [1] "X1"                   "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

There are still many columns which doesn’t need to be included into the modeling as their values aren’t going to add anything to our prediction. For example, we don’t need the time at which the observations were taken, or the name of persons, etc. I will only keep the columns which contain the data from the devices and their classification,‘classe’.

relevent_col <- relevent_col[grep(pattern = "_belt|_arm|_dumbbell|_forearm", x = relevent_col)]
pml_training <- pml_training[,c(relevent_col, "classe")]
pml_training$classe <- factor(pml_training$classe)
pml_testing <- pml_testing[,relevent_col]
table(complete.cases(pml_training))
## 
##  TRUE 
## 19622
table(complete.cases(pml_training))
## 
##  TRUE 
## 19622

At this point, I have cleaned up all the mess I could have from the dataset and I am ready to move into the next section.

Exploratory Data Analysis

Before starting with any kind of analysis, I will distribute the training set into training and validation set so that I can try different models without touching the test data and get an estimate of the accuracy of our models.

set.seed(666)
inTrain <- createDataPartition(pml_training$classe, p = 0.8, list = FALSE)
training <- pml_training[inTrain,]
validation <- pml_training[-inTrain,]
dim(training)
## [1] 15699    53
dim(validation)
## [1] 3923   53

Now, I will check the correlation between every variable in the data set.

cor_matrix <- cor(training[sapply(training, is.numeric)])
corrplot(corr = cor_matrix, order = "FPC", method = "square", tl.cex = 0.45, tl.col = "black", number.cex = 0.25)

From the plot, I can infer that there are a few variables which are highly co̥rrelated. Using dimension reduction techniques such as PCA can be used to leverage speed over accuracy but as my model without preprocessing takes less than 2 minute on my system, I will build the model without any preprocessing.

Model Selection and Creation

For this kind of problem, when we have to solve classification problem using a large dataset and our focus is accuracy, Random Forest is the safest bet.

time1 <- proc.time()
model_rf <- randomForest(classe ~ ., data = training)
time2 <- proc.time()
time2
##    user  system elapsed 
##   67.06    3.37   72.04
# Just to check which variables more important.
varImpPlot(model_rf)

# Now I will test this model on validation data set to check the accuracy of the model.
pred_rf <- predict(model_rf, validation)
confusionMatrix(validation$classe, pred_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    2  756    1    0    0
##          C    0    4  679    1    0
##          D    0    0    3  640    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9972         
##                  95% CI : (0.995, 0.9986)
##     No Information Rate : 0.285          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9965         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9947   0.9941   0.9984   1.0000
## Specificity            1.0000   0.9991   0.9985   0.9991   1.0000
## Pos Pred Value         1.0000   0.9960   0.9927   0.9953   1.0000
## Neg Pred Value         0.9993   0.9987   0.9988   0.9997   1.0000
## Prevalence             0.2850   0.1937   0.1741   0.1634   0.1838
## Detection Rate         0.2845   0.1927   0.1731   0.1631   0.1838
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9991   0.9969   0.9963   0.9988   1.0000

Course Project Prediction Quiz

Applying the model to the pml_testing dataset.

pred_test <- predict(model_rf, pml_testing)
pred_test
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Result

‘Validation’ dataset was separated from the training data set before modeling so we can conclude that out of sample accuracy of our model is 99.72%.