Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

he training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Goal

The goal of the project is to predict how the excercise was done or predicting the “classe” variable in the training set.

Reading the data

Setting the working directory and importing the data

setwd("D:/git/PML")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Importing the libraries

library(lubridate); library(caret); library(randomForest); library(dplyr); library(rpart);library(e1071);

## Warning: package 'lubridate' was built under R version 3.4.2

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## Warning: package 'caret' was built under R version 3.4.2

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.2

## Warning: package 'randomForest' was built under R version 3.4.2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'e1071' was built under R version 3.4.2

Cleaning the data

There are many columns with all 0 or NA value. We would be removing them basis the data test set data since we would only have the test set data to use for predictions. Hence I would build my model only with data available in the test set.

datacleantraining <- training[,colSums(is.na(testing)) == 0]
datacleantesting <- testing[,colSums(is.na(testing)) ==0]

# changing the time and data format 
datacleantraining$cvtd_timestamp <- as.Date(datacleantraining$cvtd_timestamp, format = "%m/%d/%Y %H:%M")

datacleantesting$cvtd_timestamp <- as.Date(datacleantesting$cvtd_timestamp, format = "%m/%d/%Y %H:%M")

Exploratory Analysis

table(datacleantraining$classe, datacleantraining$user_name)

##    
##     adelmo carlitos charles eurico jeremy pedro
##   A   1165      834     899    865   1177   640
##   B    776      690     745    592    489   505
##   C    750      493     539    489    652   499
##   D    515      486     642    582    522   469
##   E    686      609     711    542    562   497

qplot(datacleantraining$classe, xlab = "Activity Class", fill = datacleantraining$user_name)

qplot(datacleantraining$user_name, xlab = "User", fill = datacleantraining$classe)

Class A is the highest done activity followd by B and E. Among the users, “Adelmo” is the highest followd by “Charles” and “Jeremy”

Of the available variables(regressors), I want to build my model with regressors which have are dependent on the reading values and not the date or the user.

For the model to be effective, I believe this should be user agnostic and should be able to predict the class of activity irrespective of the time

Further cleaning of the variables

filter1 <- grepl("belt|arm|dumbell|classe", names(datacleantraining))
datacleantraining <- datacleantraining[,filter1]
datacleantesting <- datacleantesting[,filter1]
dim(datacleantraining)

## [1] 19622    40

Hence, from the 160 initial columns we are now down to 40 columns of which we will use 39 variables (and one is the outcome variable ) to construct our model.

Before constructing the model, I want to check for zero variance regressors among the set of variables

zerovariance <- nearZeroVar(datacleantraining, saveMetrics = T)
summary(zerovariance$zeroVar); summary(zerovariance$nzv)

##    Mode   FALSE 
## logical      40

##    Mode   FALSE 
## logical      40

We see that of this, there are no variables with near zero variance. Hence we would be using all the 39 variable ( 40 columns - the outcome varianble) to construct our predictive model

Building the Model

I will use the training set to create a further training set and a testing set which I would be useing for cross validation and testing

inTrain <-  createDataPartition(datacleantraining$classe , p = 0.5, list = FALSE)
train1 <- datacleantraining[inTrain,]; test1 <- datacleantraining[-inTrain,]

While ideally I would have wanted 60% to be classfied in the training set, my machine crashed thrice because of resource and RAM constraints.

Hence used it on 50%

using a random forest model

fit1 <- train(classe ~ . , data = train1, method = "rf")

Checking the model

fit1$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 20
## 
##         OOB estimate of  error rate: 1.4%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2782    4    1    2    1 0.002867384
## B   11 1867   18    1    2 0.016850974
## C    0   35 1655   19    2 0.032729398
## D    0    4   23 1577    4 0.019278607
## E    0    3    2    5 1794 0.005543237

We see that the error rate is 1.53%

Predicting on the Training set that we created

Now to check the accuracy of the preictions

pred1 <- predict(fit1, train1)
confusionMatrix(pred1, train1$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2790    0    0    0    0
##          B    0 1899    0    0    0
##          C    0    0 1711    0    0
##          D    0    0    0 1608    0
##          E    0    0    0    0 1804
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Here we get 100% accuracy

Predicting on the test set that we created

We would now use this on the test set which we created on the partition

pred2 <- predict(fit1, newdata = test1)
confusionMatrix(pred2, test1$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2773   14    0    0    0
##          B   11 1870   22    3    1
##          C    3   10 1670   27    0
##          D    1    1   18 1575    2
##          E    2    3    1    3 1800
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9876          
##                  95% CI : (0.9852, 0.9897)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9843          
##  Mcnemar's Test P-Value : 0.1037          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9939   0.9852   0.9760   0.9795   0.9983
## Specificity            0.9980   0.9953   0.9951   0.9973   0.9989
## Pos Pred Value         0.9950   0.9806   0.9766   0.9862   0.9950
## Neg Pred Value         0.9976   0.9965   0.9949   0.9960   0.9996
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2827   0.1906   0.1702   0.1606   0.1835
## Detection Prevalence   0.2841   0.1944   0.1743   0.1628   0.1844
## Balanced Accuracy      0.9960   0.9903   0.9855   0.9884   0.9986

on the test set the model gives an accuracy of 99% which is a good fit.

Charts

plot(fit1$finalModel, main = "Final Model")

plot(fit1)

Deriving the Quiz answers

pred3 <- predict(fit1, newdata = datacleantesting)
pred3

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

These were the final answers I used

Practical Machine Learning

Sourav

23 November 2017