The main aim of this project is to predict the a behavior pattern labelled as “classe” variable from exercise activities. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. The collected data from accelerometers on belt, forearm, arm, and dumbell of 6 participants will be used to perform machine learning project. Links to the datasets are; https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv, for training data set,;https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv, for testing data set. ## Download required packages and data We shall download packages required and data.
#Required packages
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
library(ggplot2)
library(lattice)
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 3.2.5
library(kernlab)
## Warning: package 'kernlab' was built under R version 3.2.4
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.5
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(MASS)
set.seed(234)
PmlTraining <- read.table("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header = TRUE, sep = ",", dec = ".", na.strings=c("NA","#DIV/0!",""))
pmlTesting <- read.table("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header = TRUE, sep = ",", dec = ".", na.strings=c("NA","#DIV/0!",""))
#str(PmlTraining) to check dataset
Processing data for analyses, by removing variables with missing data and character observations.
nzv <- nearZeroVar(PmlTraining, saveMetrics = TRUE)
PmlTraining <- PmlTraining[, nzv$nzv==FALSE]
Good <- names(which(colSums(is.na(PmlTraining)) ==0))
PmlTraining1 <- subset(PmlTraining, select = Good)
#Remove the first seven variables to avoid interferance .
Training1 <- PmlTraining1[,-c(1:7)]
#set all variables as numeric class with exception of classe variable
Training1[, 1:51] <- lapply(Training1[, 1:51], as.numeric)
dim(Training1)
## [1] 19622 52
set.seed(234)
inTrain <- createDataPartition(y=Training1$classe, p=0.75, list=FALSE)
training <- Training1[inTrain,]; validation <- Training1[-inTrain,]
dim(training);dim(validation)
## [1] 14718 52
## [1] 4904 52
#Due to the space limited the plots will not be shown
# check covariance and corrilation using (cov(training[, 1:53]);cor(training[, 1:53]))
#featurePlot(x=Training1[, c(1:52)], y = Training1$classe, plot = "pairs")
To create some models of the training dataset and estimate their accuracy using validation set. 1. we shall begin by setting up test harness to use 10-fold cross validation. 2. we will then build 4 different models to predict “classe” from the training set. 3. Select the best model to run on the testing data set.
Will run algorithms using 10-fold cross validation SPlitting dataset into 10 parts, train in 9 and test on 1 then repeats
set.seed(234)
control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
fittree <- train(classe~., method = "rpart", data = training, metric = metric, trControl = control)
fancyRpartPlot(fittree$finalModel)
set.seed(234)
fit2Rf <- randomForest(classe~.,data=training, ntree=200, importance=TRUE)
plot(fit2Rf)
### Model 3 : Linear Discriminant Analysis
set.seed(234)
fitlda <- train(classe~., data = Training1, method = "lda", metric =metric, trControl = control)
set.seed(234)
fitknn <- train(classe~., data = Training1, method = "knn", metric =metric, trControl = control)
pred1 <- predict(fittree, validation)
pred2 <- predict(fit2Rf, validation)
pred3 <- predict(fitlda, validation)
pred4 <- predict(fitknn, validation)
predDf <- data.frame(pred1, pred2, pred3, pred4, classe = validation$classe)
CombMod <- train(classe~., method = "rf", data = predDf)
pred5 <- predict(CombMod, predDf)
rbind(postResample(pred1, obs = validation$classe), postResample(pred2, obs = validation$classe), postResample(pred3, obs = validation$classe), postResample(pred4, obs = validation$classe), postResample(pred5, obs = validation$classe))
## Accuracy Kappa
## [1,] 0.6088907 0.5004650
## [2,] 0.9946982 0.9932931
## [3,] 0.6929038 0.6110244
## [4,] 0.9757341 0.9692996
## [5,] 0.9951060 0.9938092
AccuTest <- confusionMatrix(pred2, validation$classe)
AccuTest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 5 0 0 0
## B 1 942 4 0 0
## C 0 2 851 13 0
## D 0 0 0 791 1
## E 0 0 0 0 900
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9922, 0.9965)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9926 0.9953 0.9838 0.9989
## Specificity 0.9986 0.9987 0.9963 0.9998 1.0000
## Pos Pred Value 0.9964 0.9947 0.9827 0.9987 1.0000
## Neg Pred Value 0.9997 0.9982 0.9990 0.9968 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1921 0.1735 0.1613 0.1835
## Detection Prevalence 0.2853 0.1931 0.1766 0.1615 0.1835
## Balanced Accuracy 0.9989 0.9957 0.9958 0.9918 0.9994
From the above comparison result Random Forests model provided the best result with accuracy of 99.47% which gives sample error to 0.53% and so as to the combined models. For this project we shall use Random Forests model to predict on the testing dataset. ## Predicting on the testing dataset
pred5T<- predict(fit2Rf, newdata = pmlTesting)
pred5T
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E