Practical Machine Learning

Summary: The data came from Human Activity Recognition (HAR). In this project I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The people were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. I don’t use any of the other variables to predict with.

Load the data

training_set <- "c:\\DIANA\\Coursera\\Practical Machine Learning\\pml-training.csv"
testing_set <- "c:\\DIANA\\Coursera\\Practical Machine Learning\\pml-testing.csv"
training <- read.csv(training_set, header=TRUE, sep = ',')
testing <- read.csv(testing_set, header=TRUE, sep = ',')

Cleaning data from NA and unnecessary columns

I remove columns from training and testing data set where NA is more than 90 percent. During my observations I found that the first five columns mess about my prediction, thus I remove those columns too.

# Ignore the column of training data where NA is more than 90 percent.
l <- dim(training)[2]
not_na_col1 <- c()
for (i in 1:l){
  na_num1 <- length(which(is.na(training[,i])))
  if (na_num1 < dim(training)[1]*0.9)
    not_na_col1 <- c(not_na_col1, i)
}
new_training <- subset(training, select = not_na_col1)
# Unnecessary columns
new_training <- subset(new_training, select = -c(X, user_name, raw_timestamp_part_1, 
                                                 raw_timestamp_part_2, cvtd_timestamp))


# Ignore the column of testing data where NA is more than 90 percent.
l <- dim(testing)[2]
not_na_col2 <- c()
for (i in 1:l){
  na_num2 <- length(which(is.na(testing[,i])))
  if (na_num2 < dim(testing)[1]*0.85)
    not_na_col2 <- c(not_na_col2, i)
}
new_testing <- subset(testing, select = not_na_col2)
# Unnecessary columns
new_testing <- subset(new_testing, select = -c(X, user_name, raw_timestamp_part_1,
                                               raw_timestamp_part_2, cvtd_timestamp))

Load necessary packages

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(rpart)
library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(plyr)
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 3.3.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)

Fitting models

MY EXPECTATION: After I fit a right model I look forward to get an accuracy rate above 0.85.

inTrain <- createDataPartition(y = new_training$classe, p = 0.80, list = FALSE)
My_training <- new_training[inTrain,]
My_testing <- new_training[-inTrain,]
dim(My_training); dim(My_testing)

## [1] 15699    88

## [1] 3923   88

Because I realized there are too many unnecessary variables in my data set, I decided to remove columns with zero covariates.

# Removing zero covariates
nzv <- nearZeroVar(My_training)
TRAINING_ <- My_training[-nzv]
TESTING_ <- My_testing[-nzv]
dim(TRAINING_); dim(TESTING_)

## [1] 15699    54

## [1] 3923   54

Classification tree and Cross validation

I fit rpart model for my training set, draw a fancy plot and then examine the accuracy rate.

# Classification Tree with rpart
set.seed(221)
model1 <- train(classe ~., data = TRAINING_, method = "rpart")

fancyRpartPlot(model1$finalModel, sub = "Classification Tree")

predictions_1 <- predict(model1, newdata = TESTING_)
confusionMatrix(predictions_1, TESTING_$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1008  308  113  206   38
##          B   20  234   22   85  113
##          C   86  217  549  324  117
##          D    0    0    0    0    0
##          E    2    0    0   28  453
## 
## Overall Statistics
##                                           
##                Accuracy : 0.572           
##                  95% CI : (0.5564, 0.5876)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4479          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9032  0.30830   0.8026   0.0000   0.6283
## Specificity            0.7631  0.92415   0.7703   1.0000   0.9906
## Pos Pred Value         0.6025  0.49367   0.4246      NaN   0.9379
## Neg Pred Value         0.9520  0.84778   0.9487   0.8361   0.9221
## Prevalence             0.2845  0.19347   0.1744   0.1639   0.1838
## Detection Rate         0.2569  0.05965   0.1399   0.0000   0.1155
## Detection Prevalence   0.4265  0.12083   0.3296   0.0000   0.1231
## Balanced Accuracy      0.8332  0.61622   0.7865   0.5000   0.8095

predictions1 <- predict(model1, new_testing)
predictions1

##  [1] A A C A A C C C A A C C B A C B A A A B
## Levels: A B C D E

I found the above model does not meet my expectation. The accuracy rate is far below my expected rate and the result of prediction is too monotonous (contain only “C” and “A” class).

Random forest and Cross validation

I fit a random forest for my training set and then examine the accuracy of this model.

# Random forest
set.seed(222)
model2 <- randomForest(classe ~., data = TRAINING_)

predictions_2 <- predict(model2, newdata = TESTING_)
confusionMatrix(predictions_2, TESTING_$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    2    0    0    0
##          B    0  757    3    0    0
##          C    0    0  678    3    0
##          D    0    0    3  640    1
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9947, 0.9984)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9912   0.9953   0.9986
## Specificity            0.9993   0.9991   0.9991   0.9988   1.0000
## Pos Pred Value         0.9982   0.9961   0.9956   0.9938   1.0000
## Neg Pred Value         1.0000   0.9994   0.9981   0.9991   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1930   0.1728   0.1631   0.1835
## Detection Prevalence   0.2850   0.1937   0.1736   0.1642   0.1835
## Balanced Accuracy      0.9996   0.9982   0.9952   0.9971   0.9993

predictions2 <- predict(model2, new_testing)
predictions2

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

I found the accuracy rate does meet my expectation (above 0.99) and the result of testing set’s prediction looks much better than the previous model.