Coursera Machine Learning Project

Introduction

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment. The data sets was originated from here: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012 Read more: http://groupware.les.inf.puc-rio.br/har#ixzz40iHm7JQz

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

Goal

For this project, we are given data from accelerometers on the belt, forearm, arm, and dumbell of 6 research study participants. Our training data consists of accelerometer data and a label identifying the quality of the activity the participant was doing. Our testing data consists of accelerometer data without the identifying label. Our goal is to predict the labels for the test set observations.

Getting & Cleaning Data

Notes: Studied this excellent pdf for implementation this Project. https://www.r-project.org/nosvn/conferences/useR-2013/Tutorials/kuhn/user_caret_2up.pdf

The first step is to get the data and check all the fields in training and test data are exactly same.

#Loading all the libraries
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

#library(rattle)
url_raw_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training <- "pml-training.csv"
training<-read.csv(training, na.strings=c("NA","","#DIV/0!"), header=TRUE)
#download.file(url=url_raw_training, destfile=file_dest_training, method="curl")
url_raw_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- "pml-testing.csv"
testing <- read.csv(testing, na.strings=c("NA","","#DIV/0!"), header=TRUE)
#download.file(url=url_raw_testing, destfile=file_dest_testing, method="curl")

# Get column names
colnames_train <- colnames(training)
colnames_test <- colnames(testing)

# Verify that the column names (excluding classe and problem_id) are identical in the training and test set.
all.equal(colnames_train[1:length(colnames_train)-1], colnames_test[1:length(colnames_train)-1])

## [1] TRUE

Creating training and test set.Also, Clean data values which are null.

# remove variables with nearly zero variance
nzv <- nearZeroVar(training)
training <- training[, -nzv]
testing <- testing[, -nzv]

# remove variables that are almost always NA
mostlyNA <- sapply(training, function(x) mean(is.na(x))) > 0.90
training <- training[, mostlyNA==F]
testing <- testing[, mostlyNA==F]

# remove variables that don't make much sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp . . .), which happen to be the first seven variables
training <- training[, -(1:7)]
testing <- testing[, -(1:7)]

#Predictors information after data cleaning
nearZeroVar(training,saveMetrics = TRUE)

##                      freqRatio percentUnique zeroVar   nzv
## pitch_belt            1.036082     9.3772296   FALSE FALSE
## yaw_belt              1.058480     9.9734991   FALSE FALSE
## total_accel_belt      1.063160     0.1477933   FALSE FALSE
## gyros_belt_x          1.058651     0.7134849   FALSE FALSE
## gyros_belt_y          1.144000     0.3516461   FALSE FALSE
## gyros_belt_z          1.066214     0.8612782   FALSE FALSE
## accel_belt_x          1.055412     0.8357966   FALSE FALSE
## accel_belt_y          1.113725     0.7287738   FALSE FALSE
## accel_belt_z          1.078767     1.5237998   FALSE FALSE
## magnet_belt_x         1.090141     1.6664968   FALSE FALSE
## magnet_belt_y         1.099688     1.5187035   FALSE FALSE
## magnet_belt_z         1.006369     2.3290184   FALSE FALSE
## roll_arm             52.338462    13.5256345   FALSE FALSE
## pitch_arm            87.256410    15.7323412   FALSE FALSE
## yaw_arm              33.029126    14.6570176   FALSE FALSE
## total_accel_arm       1.024526     0.3363572   FALSE FALSE
## gyros_arm_x           1.015504     3.2769341   FALSE FALSE
## gyros_arm_y           1.454369     1.9162165   FALSE FALSE
## gyros_arm_z           1.110687     1.2638875   FALSE FALSE
## accel_arm_x           1.017341     3.9598410   FALSE FALSE
## accel_arm_y           1.140187     2.7367241   FALSE FALSE
## accel_arm_z           1.128000     4.0362858   FALSE FALSE
## magnet_arm_x          1.000000     6.8239731   FALSE FALSE
## magnet_arm_y          1.056818     4.4439914   FALSE FALSE
## magnet_arm_z          1.036364     6.4468454   FALSE FALSE
## roll_dumbbell         1.022388    84.2065029   FALSE FALSE
## pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
## yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
## total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
## gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
## gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
## gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
## accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
## accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
## accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
## magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
## magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
## magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
## roll_forearm         11.589286    11.0895933   FALSE FALSE
## pitch_forearm        65.983051    14.8557741   FALSE FALSE
## yaw_forearm          15.322835    10.1467740   FALSE FALSE
## total_accel_forearm   1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
## accel_forearm_x       1.126437     4.0464784   FALSE FALSE
## accel_forearm_y       1.059406     5.1116094   FALSE FALSE
## accel_forearm_z       1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
## classe                1.469581     0.0254816   FALSE FALSE

Creating Training data set (60%) and test data (40%) from given training data

set.seed(007)
intrain <- createDataPartition(y=training$classe, p=0.6, list=FALSE)
my_training <- training[intrain, ]; my_testing <- training[-intrain, ]
dim(my_training); dim(my_testing)

## [1] 11776    52

## [1] 7846   52

Lets see how our remaining predictors with lowest unique percentage fit together

featurePlot(x=training[,c("total_accel_forearm","total_accel_dumbbell","total_accel_arm","gyros_belt_x","gyros_belt_y","gyros_belt_z")],y=training$classe)

Data cleaning process is now compelted. We have 52 predictor variables left after cleaning. These predictors will be used to fit different ML models.

Prediction using ML Algorithms

As hinted by professor during video lectures, Decision tree combined with boosting and Random forest algorithms perform best (accurate) though they might be slow. Since, our training data is 11776 records, we will be training models based on those algorithms.

I will start with decision tree.

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

Note: This tree plotting techniques derived from : http://blog.revolutionanalytics.com/2013/06/plotting-classification-and-regression-trees-with-plotrpart.html

Decision Tree Evaluation

predictions_dectree <- predict(model_dectree, my_testing, type = "class")
confusionMatrix(predictions_dectree, my_testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1975  296   21  105   70
##          B   50  743  119   45  163
##          C   42  182  954  116  193
##          D  143  240  222  944  206
##          E   22   57   52   76  810
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6916          
##                  95% CI : (0.6812, 0.7018)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6093          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8849   0.4895   0.6974   0.7341   0.5617
## Specificity            0.9124   0.9404   0.9177   0.8764   0.9677
## Pos Pred Value         0.8006   0.6634   0.6416   0.5379   0.7965
## Neg Pred Value         0.9522   0.8848   0.9349   0.9439   0.9075
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2517   0.0947   0.1216   0.1203   0.1032
## Detection Prevalence   0.3144   0.1427   0.1895   0.2237   0.1296
## Balanced Accuracy      0.8986   0.7149   0.8075   0.8052   0.7647

The evaluation matrix shows that, our decision tree model achieved accuray of just 69.16% which is not satisfactory in our case. We will divert from our initial planning to apply boosting algorithm to decision tree as decision tree achived far less accuracy then we have expected. We will directly apply random forest algorithms and see how it performs.

Lets quicly see the results of random forest.

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 26
## 
##         OOB estimate of  error rate: 0.87%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3341    5    1    0    1 0.002090800
## B   17 2251   10    0    1 0.012286090
## C    0   20 2027    7    0 0.013145083
## D    0    0   26 1901    3 0.015025907
## E    0    1    4    7 2153 0.005542725

It took my machine around 5 Mins to generate this above model.

Random Forest Evaluation

predictions_randfor <- predict(model_randfor, my_testing)
confusionMatrix(predictions_randfor, my_testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2229   16    0    0    0
##          B    2 1500    4    0    0
##          C    0    2 1359    9    1
##          D    1    0    5 1274    6
##          E    0    0    0    3 1435
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9918, 0.9954)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9921          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9987   0.9881   0.9934   0.9907   0.9951
## Specificity            0.9971   0.9991   0.9981   0.9982   0.9995
## Pos Pred Value         0.9929   0.9960   0.9912   0.9907   0.9979
## Neg Pred Value         0.9995   0.9972   0.9986   0.9982   0.9989
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1912   0.1732   0.1624   0.1829
## Detection Prevalence   0.2861   0.1919   0.1747   0.1639   0.1833
## Balanced Accuracy      0.9979   0.9936   0.9958   0.9944   0.9973

This matrix as you can see shows that, our random forest model has achieved 99.41% accuracy and 0.0059 out of sample error., which is far more better than decision tree model that we evaluated previously. Getting 99.41% accuracy is great, so I will using this model for test set rather than decision tree.

Prediction on Test data

Lets use the provided test data set and evaluate the output from our best model.

predictions_test <- predict(model_randfor, testing)
predictions_test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E