Practical Machine Learning Course Project

by Davin Kaing

Description

As part of the Data Science Specialization course, Practical Machine Learning, this purpose of this course project is to apply machine to the data from: http://groupware.les.inf.puc-rio.br/har. This data is from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. A program was used to create a predicting model for this data.

Data Processing

The data was downloaded from the following sites and assigned to their proper variable.

setwd("/Users/davinkaing")
TrainingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
TestingUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(TrainingUrl, destfile = "./data/Training.csv", method = "curl")
Training <- read.csv("./data/Training.csv")
download.file(TestingUrl, destfile = "./data/Testing.csv", method = "curl")
Testing <- read.csv("./data/Testing.csv")

Afterwards, the training data was processed and clean by removing NA’s and columns with blank values.

Training <- Training[, colSums(is.na(Training))<(0.5*nrow(Training))]
Testing <- Testing[, colSums(is.na(Testing))<(0.5*nrow(Testing))]
NewTraining <- Training[, names(Testing)[1:59]]
Training <- cbind(NewTraining, Training[,93])
Training <- Training[,-c(3:7)]
colnames(Training)[55] <- "class"

In order to find the best predictors, the variable importance of a small sample of the training data was determined. From this information, the data was processed further.

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Partition <- createDataPartition(y = Training$class, p = 0.3, list = FALSE)
Part_Train <- Training[Partition,]
Part_Train <- Part_Train[,-1]
modanal <- randomForest(class~., data = Part_Train)
VarImp <- data.frame(rownames(varImp(modanal)),varImp(modanal))
colnames(VarImp) <- c("variables", "importance")
OrderedVarImp <- VarImp[order(VarImp$importance, decreasing = TRUE),]

ProcessedTraining <- Training[, paste(OrderedVarImp$variables[1:30])]
ProcessedTraining <- cbind(ProcessedTraining, Training$class)
colnames(ProcessedTraining)[31] <- "class"

Once the data is processed, the training data is trained using random forest. This model fit is then used to predict the testing subsetted testing data. The confusion matrix details the accuracy and error rate of the model.

inTrain <- createDataPartition(y = ProcessedTraining$class, p = 0.75, list = FALSE)
training <- ProcessedTraining[inTrain,]
testing <- ProcessedTraining[-inTrain,]
library(parallel, quietly=T)
library(doParallel, quietly=T)
cluster<- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
modfit <- train(class~., method = "rf", data = training)
stopCluster(cluster)
pred <- predict(modfit, testing)
confusionMatrix(testing$class, pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    8  935    6    0    0
##          C    0   12  842    1    0
##          D    0    0   10  793    1
##          E    0    0    1    2  898
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9916         
##                  95% CI : (0.9887, 0.994)
##     No Information Rate : 0.2861         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9894         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   0.9873   0.9802   0.9962   0.9989
## Specificity            1.0000   0.9965   0.9968   0.9973   0.9993
## Pos Pred Value         1.0000   0.9852   0.9848   0.9863   0.9967
## Neg Pred Value         0.9977   0.9970   0.9958   0.9993   0.9998
## Prevalence             0.2861   0.1931   0.1752   0.1623   0.1833
## Detection Rate         0.2845   0.1907   0.1717   0.1617   0.1831
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9971   0.9919   0.9885   0.9968   0.9991