Executive Summary

This detailed analysis has been performed to fulfill the requirements of the course project for the course Practical Machine Learning offered by the Johns Hopkins University on Coursera. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Read more about the data set details in the Groupware@LES website.

The main objectives of this project are as follows

Data retrieval, processing and transformation

This section consists of the steps followed to setup our R environment, get the required data, clean and process it.

Setting up required environment in R

In the following code segment, we set the required global options and load the required packages in R.

library(knitr)
opts_chunk$set(cache=TRUE,echo=TRUE)
options(width=120)
library(caret)
library(randomForest)
library(pander)


Setting the working directory

Here we set the working directory as required, it can be changed according to your preference in your personal computer.

setwd("E:/MOOCs/Coursera/Data Science - Specialization/8.   Practical Machine Learning/Course Project")


Getting the required data

The training data is available here: Training Data

The test data is available here: Testing Data

The data can be downloaded using the following code segment.

downloadDataset <- function(URL="", destFile="data.csv"){
  if(!file.exists(destFile)){
    download.file(URL, destFile, method="curl")
  }else{
    message("Dataset already downloaded.")
  }
}

trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
downloadDataset(trainURL, "pml-training.csv")
## Dataset already downloaded.
downloadDataset(testURL, "pml-testing.csv")
## Dataset already downloaded.


Loading the data

Next, we read the file using appropriate functions and load in the data using the following commands. We just check if the data is loaded correctly by viewing a few rows and columns of the data frame.

training <- read.csv("pml-training.csv",na.strings=c("NA",""))
testing <-read.csv("pml-testing.csv",na.strings=c("NA",""))
dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160
training[1:5,c('user_name','classe','num_window','roll_belt','pitch_belt')]
##   user_name classe num_window roll_belt pitch_belt
## 1  carlitos      A         11      1.41       8.07
## 2  carlitos      A         11      1.41       8.07
## 3  carlitos      A         11      1.42       8.07
## 4  carlitos      A         12      1.48       8.05
## 5  carlitos      A         12      1.48       8.07


Processing the data

First, we check how many columns have NA values in the training and testing data and what is the quantity of NA values present.

sum(is.na(training))  # Total NA values
[1] 1921600
t1 <- table(colSums(is.na(training)))
t2 <- table(colSums(is.na(testing)))
pandoc.table(t1, style = "grid", justify = 'left', caption = 'Training data column NA frequencies')


+-----+---------+
| 0   | 19216   |
+=====+=========+
| 60  | 100     |
+-----+---------+

Table: Training data column NA frequencies
pandoc.table(t2, style = "grid", justify = 'left', caption = 'Testing data column NA frequencies')


+-----+------+
| 0   | 20   |
+=====+======+
| 60  | 100  |
+-----+------+

Table: Testing data column NA frequencies

Looking at the above values it is clear that 60 variables have 0 NA values while the rest have NA values for almost all the rows of the dataset, so we are going to ignore them using the following code segment.

# for training dataset
columnNACounts <- colSums(is.na(training))        # getting NA counts for all columns
badColumns <- columnNACounts >= 19000             # ignoring columns with majority NA values
cleanTrainingdata <- training[!badColumns]        # getting clean data
sum(is.na(cleanTrainingdata))                     # checking for NA values
## [1] 0
cleanTrainingdata <- cleanTrainingdata[, c(7:60)] # removing unnecessary columns

# for testing dataset
columnNACounts <- colSums(is.na(testing))         # getting NA counts for all columns
badColumns <- columnNACounts >= 20                # ignoring columns with majority NA values
cleanTestingdata <- testing[!badColumns]        # getting clean data
sum(is.na(cleanTestingdata))                     # checking for NA values
## [1] 0
cleanTestingdata <- cleanTestingdata[, c(7:60)] # removing unnecessary columns

Now since we don’t have any NA values we are ready to do some exploratory analysis and build our prediction model.

Exploratory Data Analysis

We look at some summary statistics and frequency plot for the classe variable.

s <- summary(cleanTrainingdata$classe)
pandoc.table(s, style = "grid", justify = 'left', caption = '`classe` frequencies')


+------+------+------+------+------+
| A    | B    | C    | D    | E    |
+======+======+======+======+======+
| 5580 | 3797 | 3422 | 3216 | 3607 |
+------+------+------+------+------+

Table: `classe` frequencies
plot(cleanTrainingdata$classe,col=rainbow(5),main = "`classe` frequency plot")

plot of chunk unnamed-chunk-7

Model building

In this section, we will build a machine learning model for predicting the classe value based on the other features of the dataset.

Data partitioning

First we partition the cleanTrainingdata dataset into training and testing data sets for building our model using the following code segment.

partition <- createDataPartition(y = cleanTrainingdata$classe, p = 0.6, list = FALSE)
trainingdata <- cleanTrainingdata[partition, ]
testdata <- cleanTrainingdata[-partition, ]


Model building

Now, using the features in the trainingdata dataset, we will build our model using the Random Forest machine learning technique.

#trainInds <- sample(nrow(cleanTrainingdata), 3000)
#trainingdata<-cleanTrainingdata[trainInds,]
model <- train(classe ~ ., data = trainingdata, method = "rf", prox = TRUE, 
               trControl = trainControl(method = "cv", number = 4, allowParallel = TRUE))
model
## Random Forest 
## 
## 11776 samples
##    53 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## 
## Summary of sample sizes: 8831, 8834, 8832, 8831 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.002        0.002   
##   30    1         1      0.001        0.002   
##   50    1         1      0.004        0.006   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

We build the model using 4-fold cross validation.

In sample accuracy

Here, we calculate the in sample accuracy which is the prediction accuracy of our model on the training data set.

training_pred <- predict(model, trainingdata)
confusionMatrix(training_pred, trainingdata$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2279    0    0    0
##          C    0    0 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165
## 
## Overall Statistics
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.284 
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 1     
##  Mcnemar's Test P-Value : NA    
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    1.000    1.000    1.000    1.000
## Specificity             1.000    1.000    1.000    1.000    1.000
## Pos Pred Value          1.000    1.000    1.000    1.000    1.000
## Neg Pred Value          1.000    1.000    1.000    1.000    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.194    0.174    0.164    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    1.000    1.000    1.000    1.000

Thus from the above statistics we see that the in sample accuracy value is 1 which is 100%.

Out of sample accuracy

Here, we calculate the out of sample accuracy which is the prediction accuracy of our model on the testing data set.

testing_pred <- predict(model, testdata)
confusionMatrix(testing_pred, testdata$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228    0    0    0    0
##          B    3 1515    1    0    0
##          C    0    3 1364    1    0
##          D    0    0    3 1284    1
##          E    1    0    0    1 1441
## 
## Overall Statistics
##                                         
##                Accuracy : 0.998         
##                  95% CI : (0.997, 0.999)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.998         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.998    0.998    0.997    0.998    0.999
## Specificity             1.000    0.999    0.999    0.999    1.000
## Pos Pred Value          1.000    0.997    0.997    0.997    0.999
## Neg Pred Value          0.999    1.000    0.999    1.000    1.000
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.174    0.164    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       0.999    0.999    0.998    0.999    0.999

Thus from the above statistics we see that the out of sample accuracy value is 0.998 which is 99.8%.


Prediction Assignment

Here, we apply the machine learning algorithm we built above, to each of the 20 test cases in the testing data set provided.

answers <- predict(model, cleanTestingdata)
answers <- as.character(answers)
answers
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B" "B"

Finally, we write the answers to files as specified by the course instructor using the following code segment.

pml_write_files = function(x) {
    n = length(x)
    for (i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
    }
}

pml_write_files(answers)

On submission, we would see that all the predicted values are correct and a score of 20\20 is obtained.

Conclusion

We chose Random Forest as our machine learning algorithm for building our model because,

We also obtained a really good accuracy based on the statistics we obtained above.