Coursera Machine Learning

Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self-movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behaviour, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The goal of this project is to use Machine Learning techniques to predict whether users are performing exercises correctly or not based on data collected data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants, who were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

Registering Libraries

A few R libriries are necessary to perform the Anlysis.

## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Loading data

We load the training data can be found here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv an split it into training and testing datsets by running the createDataPartition function:

full <-  read.csv("C:\\git\\machinelearning\\pml-training.csv")

#By default, simple bootstrap resampling is used 
inTrain <- createDataPartition(y=full$classe, p=0.75,list = FALSE)
train <- full[inTrain,]
test <- full[-inTrain,]

Variable Selection

The full data set has 160 variables, most of which are not usefull for prediction due to the huge amount of null (or NA’s) values. So, an analysis was run on each of the variables to identify the ones who are suitable to be used. The whole analysis code won’t be included here because its huge, but as an example I can show variable “kurtosis_picth_belt” which has 19216 “blank” values ou of 19622 thus not meaningfull to the analysis.

head(count(full, "kurtosis_picth_belt"))

##   kurtosis_picth_belt  freq
## 1                     19216
## 2           -0.021887     1
## 3           -0.060755     1
## 4           -0.099173     1
## 5           -0.108371     1
## 6           -0.109078     1

So after the analysins, the Unnecessary variables are removed from the train data.

#Removing Unnecessary variables from the train data.

train <- train[,c("classe", "num_window",    "pitch_arm",  "pitch_belt",  "pitch_dumbbell",  "pitch_forearm",   "roll_arm", "roll_belt",    "roll_dumbbell",    "roll_forearm", "total_accel_arm",  "total_accel_belt", "total_accel_dumbbell", "total_accel_forearm",  "yaw_arm",  "yaw_belt", "yaw_dumbbell", "yaw_forearm",  "accel_arm_x",  "accel_arm_y",  "accel_arm_z",  "accel_belt_x", "accel_belt_y", "accel_belt_z", "accel_dumbbell_x", "accel_dumbbell_y", "accel_dumbbell_z", "accel_forearm_x",  "accel_forearm_y",  "accel_forearm_z",  "gyros_arm_x",  "gyros_arm_y",  "gyros_arm_z",  "gyros_belt_x", "gyros_belt_y", "gyros_belt_z", "gyros_dumbbell_x", "gyros_dumbbell_y", "gyros_dumbbell_z", "gyros_forearm_x",  "gyros_forearm_y",  "gyros_forearm_z",  "magnet_arm_x", "magnet_arm_y", "magnet_arm_z", "magnet_belt_x",    "magnet_belt_y",    "magnet_belt_z",    "magnet_dumbbell_x",    "magnet_dumbbell_y",    "magnet_dumbbell_z",    "magnet_forearm_x", "magnet_forearm_y", "magnet_forearm_z")]

Run Analysis

A model is created using random Forests with 50 trees and the test data is predicted using the model:

fitforest <- randomForest(classe ~ ., data=train, ntree=50)
predForest <-predict(fitforest, test)

Then a confusion matrix is built to check the Out of sample error using cross-validation. We can also analyse Sensitivity and Specificit by class

cm <- confusionMatrix(predForest, test$classe)
cm$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  946    1    0    0
##          C    0    2  853    2    0
##          D    0    0    1  801    2
##          E    0    0    0    1  899

 cm$byClass

##          Sensitivity Specificity Pos Pred Value Neg Pred Value Prevalence
## Class: A   1.0000000   0.9997150      0.9992837      1.0000000  0.2844617
## Class: B   0.9968388   0.9997472      0.9989440      0.9992418  0.1935155
## Class: C   0.9976608   0.9990121      0.9953326      0.9995058  0.1743475
## Class: D   0.9962687   0.9992683      0.9962687      0.9992683  0.1639478
## Class: E   0.9977802   0.9997502      0.9988889      0.9995005  0.1837276
##          Detection Rate Detection Prevalence Balanced Accuracy
## Class: A      0.2844617            0.2846656         0.9998575
## Class: B      0.1929038            0.1931077         0.9982930
## Class: C      0.1739396            0.1747553         0.9983365
## Class: D      0.1633361            0.1639478         0.9977685
## Class: E      0.1833197            0.1835237         0.9987652