Summary: The data came from Human Activity Recognition (HAR). In this project I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The people were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. I don’t use any of the other variables to predict with.
training_set <- "c:\\DIANA\\Coursera\\Practical Machine Learning\\pml-training.csv"
testing_set <- "c:\\DIANA\\Coursera\\Practical Machine Learning\\pml-testing.csv"
training <- read.csv(training_set, header=TRUE, sep = ',')
testing <- read.csv(testing_set, header=TRUE, sep = ',')
I remove columns from training and testing data set where NA is more than 90 percent. During my observations I found that the first five columns mess about my prediction, thus I remove those columns too.
# Ignore the column of training data where NA is more than 90 percent.
l <- dim(training)[2]
not_na_col1 <- c()
for (i in 1:l){
na_num1 <- length(which(is.na(training[,i])))
if (na_num1 < dim(training)[1]*0.9)
not_na_col1 <- c(not_na_col1, i)
}
new_training <- subset(training, select = not_na_col1)
# Unnecessary columns
new_training <- subset(new_training, select = -c(X, user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp))
# Ignore the column of testing data where NA is more than 90 percent.
l <- dim(testing)[2]
not_na_col2 <- c()
for (i in 1:l){
na_num2 <- length(which(is.na(testing[,i])))
if (na_num2 < dim(testing)[1]*0.85)
not_na_col2 <- c(not_na_col2, i)
}
new_testing <- subset(testing, select = not_na_col2)
# Unnecessary columns
new_testing <- subset(new_testing, select = -c(X, user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp))
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(plyr)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.3.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
MY EXPECTATION: After I fit a right model I look forward to get an accuracy rate above 0.85.
inTrain <- createDataPartition(y = new_training$classe, p = 0.80, list = FALSE)
My_training <- new_training[inTrain,]
My_testing <- new_training[-inTrain,]
dim(My_training); dim(My_testing)
## [1] 15699 88
## [1] 3923 88
Because I realized there are too many unnecessary variables in my data set, I decided to remove columns with zero covariates.
# Removing zero covariates
nzv <- nearZeroVar(My_training)
TRAINING_ <- My_training[-nzv]
TESTING_ <- My_testing[-nzv]
dim(TRAINING_); dim(TESTING_)
## [1] 15699 54
## [1] 3923 54
I fit rpart model for my training set, draw a fancy plot and then examine the accuracy rate.
# Classification Tree with rpart
set.seed(221)
model1 <- train(classe ~., data = TRAINING_, method = "rpart")
fancyRpartPlot(model1$finalModel, sub = "Classification Tree")
predictions_1 <- predict(model1, newdata = TESTING_)
confusionMatrix(predictions_1, TESTING_$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1008 308 113 206 38
## B 20 234 22 85 113
## C 86 217 549 324 117
## D 0 0 0 0 0
## E 2 0 0 28 453
##
## Overall Statistics
##
## Accuracy : 0.572
## 95% CI : (0.5564, 0.5876)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4479
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9032 0.30830 0.8026 0.0000 0.6283
## Specificity 0.7631 0.92415 0.7703 1.0000 0.9906
## Pos Pred Value 0.6025 0.49367 0.4246 NaN 0.9379
## Neg Pred Value 0.9520 0.84778 0.9487 0.8361 0.9221
## Prevalence 0.2845 0.19347 0.1744 0.1639 0.1838
## Detection Rate 0.2569 0.05965 0.1399 0.0000 0.1155
## Detection Prevalence 0.4265 0.12083 0.3296 0.0000 0.1231
## Balanced Accuracy 0.8332 0.61622 0.7865 0.5000 0.8095
predictions1 <- predict(model1, new_testing)
predictions1
## [1] A A C A A C C C A A C C B A C B A A A B
## Levels: A B C D E
I found the above model does not meet my expectation. The accuracy rate is far below my expected rate and the result of prediction is too monotonous (contain only “C” and “A” class).
I fit a random forest for my training set and then examine the accuracy of this model.
# Random forest
set.seed(222)
model2 <- randomForest(classe ~., data = TRAINING_)
predictions_2 <- predict(model2, newdata = TESTING_)
confusionMatrix(predictions_2, TESTING_$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 2 0 0 0
## B 0 757 3 0 0
## C 0 0 678 3 0
## D 0 0 3 640 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9947, 0.9984)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9912 0.9953 0.9986
## Specificity 0.9993 0.9991 0.9991 0.9988 1.0000
## Pos Pred Value 0.9982 0.9961 0.9956 0.9938 1.0000
## Neg Pred Value 1.0000 0.9994 0.9981 0.9991 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1930 0.1728 0.1631 0.1835
## Detection Prevalence 0.2850 0.1937 0.1736 0.1642 0.1835
## Balanced Accuracy 0.9996 0.9982 0.9952 0.9971 0.9993
predictions2 <- predict(model2, new_testing)
predictions2
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
I found the accuracy rate does meet my expectation (above 0.99) and the result of testing set’s prediction looks much better than the previous model.