Machine Learning project

Abstract

The main aim of this project is to predict the a behavior pattern labelled as “classe” variable from exercise activities. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. The collected data from accelerometers on belt, forearm, arm, and dumbell of 6 participants will be used to perform machine learning project. Links to the datasets are; https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv, for training data set,;https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv, for testing data set. ## Download required packages and data We shall download packages required and data.

#Required packages
library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.5

library(ggplot2)
library(lattice)
library(rattle)

## Warning: package 'rattle' was built under R version 3.2.5

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.5

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 3.2.5

library(kernlab)

## Warning: package 'kernlab' was built under R version 3.2.4

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.5

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(MASS)
set.seed(234)
PmlTraining <- read.table("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header = TRUE, sep = ",", dec = ".", na.strings=c("NA","#DIV/0!",""))
pmlTesting <-  read.table("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header = TRUE, sep = ",", dec = ".", na.strings=c("NA","#DIV/0!",""))
#str(PmlTraining) to check dataset

Data cleaning

Processing data for analyses, by removing variables with missing data and character observations.

nzv <- nearZeroVar(PmlTraining, saveMetrics = TRUE)
PmlTraining <- PmlTraining[, nzv$nzv==FALSE]
Good <- names(which(colSums(is.na(PmlTraining)) ==0))
PmlTraining1 <- subset(PmlTraining, select = Good)
#Remove the first seven variables to avoid interferance .
Training1 <- PmlTraining1[,-c(1:7)]

#set all variables as numeric class with exception of classe variable
Training1[, 1:51] <- lapply(Training1[, 1:51], as.numeric)

dim(Training1)

## [1] 19622    52

Splitting training dataset

set.seed(234)
inTrain <- createDataPartition(y=Training1$classe, p=0.75, list=FALSE)
training <- Training1[inTrain,]; validation <- Training1[-inTrain,]
dim(training);dim(validation)

## [1] 14718    52

## [1] 4904   52

Exploratory analysis

#Due to the space limited the plots will not be shown
# check covariance and corrilation using (cov(training[, 1:53]);cor(training[, 1:53]))
#featurePlot(x=Training1[, c(1:52)], y = Training1$classe, plot = "pairs")

Cross validation

To create some models of the training dataset and estimate their accuracy using validation set. 1. we shall begin by setting up test harness to use 10-fold cross validation. 2. we will then build 4 different models to predict “classe” from the training set. 3. Select the best model to run on the testing data set.

Will run algorithms using 10-fold cross validation SPlitting dataset into 10 parts, train in 9 and test on 1 then repeats

Fitting three different models.

Model 1 : Desicion tree

set.seed(234)
control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
fittree <- train(classe~., method = "rpart", data = training, metric = metric, trControl = control)
fancyRpartPlot(fittree$finalModel)

Model 2 : Random Forest

set.seed(234)
fit2Rf <- randomForest(classe~.,data=training, ntree=200, importance=TRUE)
plot(fit2Rf)

### Model 3 : Linear Discriminant Analysis

set.seed(234)
fitlda <- train(classe~., data = Training1, method = "lda", metric =metric, trControl = control)

Model 4 : k-Nearest Neigbors(kNN)

set.seed(234)
fitknn <- train(classe~., data = Training1, method = "knn", metric =metric, trControl = control)

Predict on the testing set

pred1 <- predict(fittree, validation)
pred2 <- predict(fit2Rf, validation)
pred3 <- predict(fitlda, validation)
pred4 <- predict(fitknn, validation)
predDf <- data.frame(pred1, pred2, pred3, pred4, classe = validation$classe)
CombMod <- train(classe~., method = "rf", data = predDf)
pred5 <- predict(CombMod, predDf)


rbind(postResample(pred1, obs = validation$classe), postResample(pred2, obs = validation$classe), postResample(pred3, obs = validation$classe), postResample(pred4, obs = validation$classe), postResample(pred5, obs = validation$classe))

##       Accuracy     Kappa
## [1,] 0.6088907 0.5004650
## [2,] 0.9946982 0.9932931
## [3,] 0.6929038 0.6110244
## [4,] 0.9757341 0.9692996
## [5,] 0.9951060 0.9938092

AccuTest <- confusionMatrix(pred2, validation$classe)
AccuTest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    5    0    0    0
##          B    1  942    4    0    0
##          C    0    2  851   13    0
##          D    0    0    0  791    1
##          E    0    0    0    0  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9922, 0.9965)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9926   0.9953   0.9838   0.9989
## Specificity            0.9986   0.9987   0.9963   0.9998   1.0000
## Pos Pred Value         0.9964   0.9947   0.9827   0.9987   1.0000
## Neg Pred Value         0.9997   0.9982   0.9990   0.9968   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1921   0.1735   0.1613   0.1835
## Detection Prevalence   0.2853   0.1931   0.1766   0.1615   0.1835
## Balanced Accuracy      0.9989   0.9957   0.9958   0.9918   0.9994

From the above comparison result Random Forests model provided the best result with accuracy of 99.47% which gives sample error to 0.53% and so as to the combined models. For this project we shall use Random Forests model to predict on the testing dataset. ## Predicting on the testing dataset

pred5T<- predict(fit2Rf, newdata = pmlTesting)
pred5T

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E