Course Project: Machine Learning Coursera

GitHub Repo

https://github.com/dhivyar/MachineLearningProject

Writeup

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Read Me

Download the MachineLearningCoursera.Rmd, pml-training.csv and pml-testing.csv to a local directory.
Open the .Rmd file in R Studio.
Set the R session’s working directory to that local directory.
Click the “Knit HTML” button to view the HTML file with R results and plots.

View the HTML file with the algorithm and results published at: https://github.com/dhivyar/MachineLearningProject/blob/master/CourseProject/ProjectPage.md

Machine Learning Algorithm

Including the required packages

suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RANN))
suppressPackageStartupMessages(library(corrplot))
suppressPackageStartupMessages(library(kernlab))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(randomForest))

Reading the test and train data sets

# Reading the data
train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")

Analyzing class of the variables in the train data set

table(sapply(train,class))

## 
##  factor integer numeric 
##      37      35      88

class(train$classe)

## [1] "factor"

Since the variable to be predicted is of class “factor”. We will try to see if it is a factor with levels or without levels.

table(train$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

This factor variable classe has 5 levels: A, B, C, D and E. Linear support vector machines or random forests should work well for building a machine learning algorithm to predict a categorical variable with labels and < 100k samples.

Next its vital that we change all factor variables into numeric ones, so that most modeling algorithms can train over it. So we assign dummies for each level in the factor variables and all 160 variables are of class numeric now.

after_dummy <- lapply(train, as.numeric)
after_dummy_test <- lapply(test,as.numeric)
after_dummy <- as.data.frame(after_dummy)
after_dummy_test <- as.data.frame(after_dummy_test)
table(sapply(after_dummy,class))

## 
## numeric 
##     160

Now we see that there are 160 predictors, lets see if any of them have zero variability

newtrain <- nearZeroVar(after_dummy,saveMetrics=T)
table(newtrain$nzv)

## 
## FALSE  TRUE 
##   100    60

after_nzv <- after_dummy[,newtrain$nzv==FALSE]
after_nzv_test <- after_dummy_test[,newtrain$nzv==FALSE]
rm(train)
rm(test)
after_nzv <- lapply(after_nzv, as.numeric)
after_nzv <- as.data.frame(after_nzv)
after_nzv_test <- lapply(after_nzv_test, as.numeric)
after_nzv_test <- as.data.frame(after_nzv_test)

We see that for 60 predictors, the near zero variance is TRUE, which means these predictors have very minimal prediction capabilities. So it is okay to remove them from the data set. We are now left with 100 predictors

Next, we perform missing value treatment for the dataset using imputational k- nearest neighbour algorithm

# k nearest neighbour
obj <- preProcess(after_nzv[,-100],method="knnImpute")
table(is.na(after_nzv))

## 
##   FALSE    TRUE 
## 1174344  787856

table(is.na(after_nzv_test))

## 
## FALSE  TRUE 
##  1180   820

summary <- as.data.frame(summary(after_nzv))
missing <- predict(obj,after_nzv[,-100])
missing.1 <- predict(obj,after_nzv_test[,-100])

Next we find the correlation between features and draw the correlation plot

# Find correlation excluding the variable to be predicted
M <- abs(cor(missing))
M1 <- abs(cor(missing.1))
# Correlation with itself is 1, so resetting that 
diag(M) <- 0
diag(M1) <- 0

Drawing the correlation plot for features with correlation > 80%

corrplot(M>0.8)

Flagging highly correlated predictors and removing them off. We are now left with 61 predcitors.

# Flagging high correlation
highlyCorr <- findCorrelation(M, cutoff = 0.8)
filteredCorr <- missing[,-highlyCorr]
filteredCorr.t <- missing.1[,-highlyCorr]

Next, we know one of the most important pre processing to be performed is correction for skewness. We Center and Scale the histograms of the predictors. But we see that the predictors are already standardized with mean 0 and standard deviation 1, so we bypass this step.

head(lapply(filteredCorr, function(x) mean(x)))

## $X
## [1] -2.48146e-17
## 
## $user_name
## [1] -6.284771e-17
## 
## $raw_timestamp_part_1
## [1] -8.60877e-14
## 
## $raw_timestamp_part_2
## [1] 4.582767e-17
## 
## $num_window
## [1] 6.066985e-17
## 
## $pitch_belt
## [1] 1.846961e-17

head(lapply(filteredCorr, function(x) sd(x)))

## $X
## [1] 1
## 
## $user_name
## [1] 1
## 
## $raw_timestamp_part_1
## [1] 1
## 
## $raw_timestamp_part_2
## [1] 1
## 
## $num_window
## [1] 1
## 
## $pitch_belt
## [1] 1

filteredCorr$classe <- after_nzv$classe
filteredCorr.t$classe <- 0
finalTrain <- filteredCorr
finalTest <- filteredCorr.t
# Good practice to keep environment free of unnecessary clutter
rm(after_dummy)
rm(after_nzv)
rm(missing)
rm(M)
rm(highlyCorr)
rm(obj)
rm(after_dummy_test)
rm(after_nzv_test)
rm(missing.1)
rm(M1)
set.seed(1500)

Cross Validation

Creating samples in the train data set. We expect the error to be around 1%.

inTrain <- createDataPartition(y=finalTrain$class,p=0.75,list=F)
train_sample <- finalTrain[inTrain,]
test_sample <- finalTrain[-inTrain,]

Using Support Vector Machine Classifier; Since we are predicting a category and this is labeled data and we have less than 100k samples. The summary of the model is given below and so is the confusion matrix. Our accuracy here is:

Support Vector Machine Classifier Accuracy: 0.9984

model_svm <- svm(as.factor(classe)~., data=train_sample)
summary(model_svm)

## 
## Call:
## svm(formula = as.factor(classe) ~ ., data = train_sample)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.01639344 
## 
## Number of Support Vectors:  4678
## 
##  ( 737 1187 1024 896 834 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  1 2 3 4 5

prediction <- predict(model_svm,test_sample[,-62])
confusionMatrix(prediction,test_sample$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 1394    0    0    0    0
##          2    0  938    1    0    0
##          3    0    1  864    2    0
##          4    0    0    0  801    2
##          5    0    1    0    1  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9984          
##                  95% CI : (0.9968, 0.9993)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9979          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   0.9979   0.9988   0.9963   0.9978
## Specificity            1.0000   0.9997   0.9993   0.9995   0.9995
## Pos Pred Value         1.0000   0.9989   0.9965   0.9975   0.9978
## Neg Pred Value         1.0000   0.9995   0.9998   0.9993   0.9995
## Prevalence             0.2843   0.1917   0.1764   0.1639   0.1837
## Detection Rate         0.2843   0.1913   0.1762   0.1633   0.1833
## Detection Prevalence   0.2843   0.1915   0.1768   0.1637   0.1837
## Balanced Accuracy      1.0000   0.9988   0.9991   0.9979   0.9986

Using Random Forest Classifier; Since it can predict categorical variables and can train over categorical predictors.The summary of the model is given below and so is the confusion matrix. Our accuracy here is:

Random Forest Algorithm Accuracy : 0.9998

model_rf <- randomForest(as.factor(classe)~ ., data=train_sample,importance=TRUE,proximity=TRUE)
print(model_rf)

## 
## Call:
##  randomForest(formula = as.factor(classe) ~ ., data = train_sample,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.01%
## Confusion matrix:
##      1    2    3    4    5  class.error
## 1 4186    0    0    0    0 0.0000000000
## 2    0 2857    0    0    0 0.0000000000
## 3    0    1 2556    0    0 0.0003910833
## 4    0    0    0 2412    0 0.0000000000
## 5    0    0    0    0 2706 0.0000000000

prediction_rf <- predict(model_rf,test_sample[,-62])
confusionMatrix(prediction_rf,test_sample$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 1394    0    0    0    0
##          2    0  939    0    0    0
##          3    0    0  865    0    0
##          4    0    1    0  804    0
##          5    0    0    0    0  901
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9998     
##                  95% CI : (0.9989, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9997     
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   0.9989   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   0.9998   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   0.9988   1.0000
## Neg Pred Value         1.0000   0.9997   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1917   0.1764   0.1639   0.1837
## Detection Rate         0.2843   0.1915   0.1764   0.1639   0.1837
## Detection Prevalence   0.2843   0.1915   0.1764   0.1642   0.1837
## Balanced Accuracy      1.0000   0.9995   1.0000   0.9999   1.0000

Final predictions over the test set

prediction_final_svm <- predict(model_svm, finalTest[,-62])
prediction_final <- predict(model_rf, finalTest[,-62])

Variable importance plot showing the model capable predictors, which is why removing them would make the model go grossly wrong.

varImpPlot(model_rf)

Output of the prediction anlysis

answers <- chartr("12345", "ABCDE", prediction_final)
answers

##  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
## [18] "B" "A" "B"