This detailed analysis has been performed to fulfill the requirements of the course project for the course Practical Machine Learning offered by the Johns Hopkins University on Coursera. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
Read more about the data set details in the Groupware@LES website.
The main objectives of this project are as follows
classe
variable.20
different test cases provided. This section consists of the steps followed to setup our R environment, get the required data, clean and process it.
In the following code segment, we set the required global options and load the required packages in R.
library(knitr)
opts_chunk$set(cache=TRUE,echo=TRUE)
options(width=120)
library(caret)
library(randomForest)
library(pander)
Here we set the working directory as required, it can be changed according to your preference in your personal computer.
setwd("E:/MOOCs/Coursera/Data Science - Specialization/8. Practical Machine Learning/Course Project")
The training data is available here: Training Data
The test data is available here: Testing Data
The data can be downloaded using the following code segment.
downloadDataset <- function(URL="", destFile="data.csv"){
if(!file.exists(destFile)){
download.file(URL, destFile, method="curl")
}else{
message("Dataset already downloaded.")
}
}
trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
downloadDataset(trainURL, "pml-training.csv")
## Dataset already downloaded.
downloadDataset(testURL, "pml-testing.csv")
## Dataset already downloaded.
Next, we read the file using appropriate functions and load in the data using the following commands. We just check if the data is loaded correctly by viewing a few rows and columns of the data frame.
training <- read.csv("pml-training.csv",na.strings=c("NA",""))
testing <-read.csv("pml-testing.csv",na.strings=c("NA",""))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
training[1:5,c('user_name','classe','num_window','roll_belt','pitch_belt')]
## user_name classe num_window roll_belt pitch_belt
## 1 carlitos A 11 1.41 8.07
## 2 carlitos A 11 1.41 8.07
## 3 carlitos A 11 1.42 8.07
## 4 carlitos A 12 1.48 8.05
## 5 carlitos A 12 1.48 8.07
First, we check how many columns have NA
values in the training and testing data and what is the quantity of NA
values present.
sum(is.na(training)) # Total NA values
[1] 1921600
t1 <- table(colSums(is.na(training)))
t2 <- table(colSums(is.na(testing)))
pandoc.table(t1, style = "grid", justify = 'left', caption = 'Training data column NA frequencies')
+-----+---------+
| 0 | 19216 |
+=====+=========+
| 60 | 100 |
+-----+---------+
Table: Training data column NA frequencies
pandoc.table(t2, style = "grid", justify = 'left', caption = 'Testing data column NA frequencies')
+-----+------+
| 0 | 20 |
+=====+======+
| 60 | 100 |
+-----+------+
Table: Testing data column NA frequencies
Looking at the above values it is clear that 60
variables have 0
NA values while the rest have NA
values for almost all the rows of the dataset, so we are going to ignore them using the following code segment.
# for training dataset
columnNACounts <- colSums(is.na(training)) # getting NA counts for all columns
badColumns <- columnNACounts >= 19000 # ignoring columns with majority NA values
cleanTrainingdata <- training[!badColumns] # getting clean data
sum(is.na(cleanTrainingdata)) # checking for NA values
## [1] 0
cleanTrainingdata <- cleanTrainingdata[, c(7:60)] # removing unnecessary columns
# for testing dataset
columnNACounts <- colSums(is.na(testing)) # getting NA counts for all columns
badColumns <- columnNACounts >= 20 # ignoring columns with majority NA values
cleanTestingdata <- testing[!badColumns] # getting clean data
sum(is.na(cleanTestingdata)) # checking for NA values
## [1] 0
cleanTestingdata <- cleanTestingdata[, c(7:60)] # removing unnecessary columns
Now since we don’t have any NA values we are ready to do some exploratory analysis and build our prediction model.
We look at some summary statistics and frequency plot for the classe
variable.
s <- summary(cleanTrainingdata$classe)
pandoc.table(s, style = "grid", justify = 'left', caption = '`classe` frequencies')
+------+------+------+------+------+
| A | B | C | D | E |
+======+======+======+======+======+
| 5580 | 3797 | 3422 | 3216 | 3607 |
+------+------+------+------+------+
Table: `classe` frequencies
plot(cleanTrainingdata$classe,col=rainbow(5),main = "`classe` frequency plot")
In this section, we will build a machine learning model for predicting the classe
value based on the other features of the dataset.
First we partition the cleanTrainingdata
dataset into training and testing data sets for building our model using the following code segment.
partition <- createDataPartition(y = cleanTrainingdata$classe, p = 0.6, list = FALSE)
trainingdata <- cleanTrainingdata[partition, ]
testdata <- cleanTrainingdata[-partition, ]
Now, using the features in the trainingdata
dataset, we will build our model using the Random Forest
machine learning technique.
#trainInds <- sample(nrow(cleanTrainingdata), 3000)
#trainingdata<-cleanTrainingdata[trainInds,]
model <- train(classe ~ ., data = trainingdata, method = "rf", prox = TRUE,
trControl = trainControl(method = "cv", number = 4, allowParallel = TRUE))
model
## Random Forest
##
## 11776 samples
## 53 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
##
## Summary of sample sizes: 8831, 8834, 8832, 8831
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.002 0.002
## 30 1 1 0.001 0.002
## 50 1 1 0.004 0.006
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
We build the model using 4-fold
cross validation.
Here, we calculate the in sample
accuracy which is the prediction accuracy of our model on the training data set.
training_pred <- predict(model, trainingdata)
confusionMatrix(training_pred, trainingdata$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2279 0 0 0
## C 0 0 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 1.000 1.000 1.000 1.000
## Specificity 1.000 1.000 1.000 1.000 1.000
## Pos Pred Value 1.000 1.000 1.000 1.000 1.000
## Neg Pred Value 1.000 1.000 1.000 1.000 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.194 0.174 0.164 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 1.000 1.000 1.000 1.000
Thus from the above statistics we see that the in sample
accuracy value is 1
which is 100%
.
Here, we calculate the out of sample
accuracy which is the prediction accuracy of our model on the testing data set.
testing_pred <- predict(model, testdata)
confusionMatrix(testing_pred, testdata$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2228 0 0 0 0
## B 3 1515 1 0 0
## C 0 3 1364 1 0
## D 0 0 3 1284 1
## E 1 0 0 1 1441
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.997, 0.999)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.998
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.998 0.998 0.997 0.998 0.999
## Specificity 1.000 0.999 0.999 0.999 1.000
## Pos Pred Value 1.000 0.997 0.997 0.997 0.999
## Neg Pred Value 0.999 1.000 0.999 1.000 1.000
## Prevalence 0.284 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.174 0.164 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 0.999 0.999 0.998 0.999 0.999
Thus from the above statistics we see that the out of sample
accuracy value is 0.998
which is 99.8%
.
Here, we apply the machine learning algorithm we built above, to each of the 20 test cases in the testing data set provided.
answers <- predict(model, cleanTestingdata)
answers <- as.character(answers)
answers
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B" "B"
Finally, we write the answers to files as specified by the course instructor using the following code segment.
pml_write_files = function(x) {
n = length(x)
for (i in 1:n) {
filename = paste0("problem_id_", i, ".txt")
write.table(x[i], file = filename, quote = FALSE, row.names = FALSE,
col.names = FALSE)
}
}
pml_write_files(answers)
On submission, we would see that all the predicted values are correct and a score of 20\20
is obtained.
We chose Random Forest
as our machine learning algorithm for building our model because,
k-fold
cross validation builds a robust model.We also obtained a really good accuracy based on the statistics we obtained above.