This report presents the results of the course project for the Practical Machine Learning course, part of the Johns Hopkins Data Science Specialization on Coursera.
The devices Jawbone Up, Nike FuelBand, and Fitbit can collect easily a large amount of personal activity data. People can use these devices to quantify how much physical activities they do, but almost never quantify how well they do it. In the context of [1], six persons were asked to perform barbell lifts correctly and incorrectly in 5 different ways.This report shows how data were used to predict the manner in which they did the exercise (classe variable in the training set).
The data for this experiment come from Human Activity Recognition Project:
According to [1], participants performed one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different ways:
The training and test are loading in pmlTraining and pmlTesting data frames respectively. In this step, extra spaces and problem values, (“NA”,“”,“NULL”and “#DIV/0!”) are classified as NA.
# Loading the data, removing extra spaces and replacing "NA","","NULL"and "#DIV/0!" with an NA value.
pmlTraining <- read.csv("pml-training.csv", na.strings=c("NA","","NULL","#DIV/0!"), strip.white=TRUE)
pmlTesting <- read.csv("pml-testing.csv", na.strings=c("NA","","NULL","#DIV/0!"), strip.white=TRUE)
It is necessary to explore the data to analyze how to clean them:
a) The dimension of the data (rows and columns):
# Showing the dimension of the Training data and Testing data.
dim(pmlTraining);dim(pmlTesting)
## [1] 19622 160
## [1] 20 160
# Checking if the columns names are the same for both datasets (Testing and Training)
names(pmlTraining)[names(pmlTesting) != names(pmlTraining)]; names(pmlTesting)[names(pmlTesting) != names(pmlTraining)]
## [1] "classe"
## [1] "problem_id"
b) Check the columns names
# Checking if the columns names are the same for both datasets (Testing and Training)
names(pmlTraining)[names(pmlTesting) != names(pmlTraining)]; names(pmlTesting)[names(pmlTesting) != names(pmlTraining)]
## [1] "classe"
## [1] "problem_id"
Both datasets have the same variable names, except for the outcome classe in the training dataset and problem_id in testing dataset. This happens because the problem_id variable is used to identify the 20 test cases for the submission of the prediction results.
c) NA values:
# showing the amount of NA values in Testing and Training dataset
sum(is.na(pmlTraining));sum(is.na(pmlTesting));
## [1] 1925102
## [1] 2000
# Showing the amount of columns with their respective amount of NA values
naValuesTraining = sapply(pmlTraining, function(x) {sum(is.na(x))})
table(naValuesTraining)
## naValuesTraining
## 0 19216 19217 19218 19220 19221 19225 19226 19227 19248 19293 19294
## 60 67 1 1 1 4 1 4 2 2 1 1
## 19296 19299 19300 19301 19622
## 2 1 4 2 6
naValuesTesting = sapply(pmlTesting, function(x) {sum(is.na(x))})
table(naValuesTesting)
## naValuesTesting
## 0 20
## 60 100
Both datasets have a large number of missing values (“NA”) and there are only 60 columns without them.
The data analysis shows it is necessary to perform two cleaning operations:
a) Remove the first seven variables that they are not related to the movement data: “X”,“user_name”,“raw_timestamp_part_1”,“raw_timestamp_part_2”, “cvtd_timestamp”,“new_window” and “num_window”**.
# Cleaning the data to reduce the number of predictors
# Removing the first seven variables that they are not related to the movement data:
# (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window)
newPmlTraining<-pmlTraining[, -c(1:7)]
newPmlTesting<-pmlTesting[, -c(1:7)]
b) Remove columns have almost all NA values:
# Removing columns have almost all NA values
newPmlTraining = newPmlTraining[, !names(newPmlTraining) %in% names(naValuesTraining[naValuesTraining>0])]
newPmlTesting = newPmlTesting[, !names(newPmlTesting) %in% names(naValuesTraining[naValuesTraining>0])]
After that, it decreases the number of columns from 160 to 53:
# Showing the new dimension of the new Training data and new Testing data.
dim(newPmlTraining);dim(newPmlTesting)
## [1] 19622 53
## [1] 20 53
In this step, the cleaned testing data is split up into training and cross validation set in a 70:30 ratio in order to train the model.
# Splitting new training dataset to perform a cross validation later (70% training and 30% testing).
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(3755)
temp <- createDataPartition(y = newPmlTraining$classe, p = 0.7, list = FALSE)
dataTrainingPml <- newPmlTraining[temp, ]
validTestingPml <- newPmlTraining[-temp, ]
At first, It was planning to use the random forest method [2] and analyze the OOB estimate error rate. If it was not satisfactory (less than 1%), different methods could be tried.
# Building the Model
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
model1 = randomForest(classe ~., data=dataTrainingPml)
model1
##
## Call:
## randomForest(formula = classe ~ ., data = dataTrainingPml)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.52%
## Confusion matrix:
## A B C D E class.error
## A 3903 2 0 1 0 0.000768
## B 10 2639 9 0 0 0.007148
## C 0 12 2379 5 0 0.007095
## D 0 0 21 2229 2 0.010213
## E 0 0 1 9 2515 0.003960
The OOB estimate of error rate obtained was 0.52%, less than 1% (satisfactory).
Now, it plotted the model in order to show the overall error of the model by trees.
layout(matrix(c(1,2),nrow=1),
width=c(4,1))
par(mar=c(5,4,4,0))
plot(model1, log="y")
par(mar=c(5,0,4,2))
plot(c(0,1),type="n", axes=F, xlab="", ylab="")
legend("top", colnames(model1$err.rate),col=1:4,cex=0.8,fill=1:4)
In this step, it performs the cross-validation test to classify the test set (30%) of the training set. A confusion matrix is used in order to analyze the model’s accuracy.
# crossvalidating
validPredict <- predict(model1, validTestingPml)
confusionMatrix(validTestingPml$classe, validPredict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 5 1131 3 0 0
## C 0 2 1024 0 0
## D 0 0 8 955 1
## E 0 0 5 7 1070
##
## Overall Statistics
##
## Accuracy : 0.995
## 95% CI : (0.993, 0.996)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.993
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.997 0.998 0.985 0.993 0.999
## Specificity 1.000 0.998 1.000 0.998 0.998
## Pos Pred Value 1.000 0.993 0.998 0.991 0.989
## Neg Pred Value 0.999 1.000 0.997 0.999 1.000
## Prevalence 0.285 0.193 0.177 0.163 0.182
## Detection Rate 0.284 0.192 0.174 0.162 0.182
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.999 0.998 0.992 0.995 0.998
The accuracy obtained was 99.5%, that means the model has a very good prediction for different data set.
Now, the original test data is used to predict 20 different test cases, according to project requisites:
# predict
predictTest <- predict(model1, newPmlTesting)
predictTest
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The model obtained shows that random forest method was very satisfactory for the presented problem. As a result, further exploration of alternative models was not necessary.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. [ONLINE] Available at http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf and http://groupware.les.inf.puc-rio.br/har#ixzz3RxdqVyEe [Accessed 17 February 2015].
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.