Mahmoud Shaaban
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Here we apply a random forest (RF) algorithm to predict the kind of activity they subject is doing based on the measures by thesis devices. We first load the data, split them into a training and a testing sets, remove the metadata, near zero and NAs variables. Then we apply the RF algorithm and validate the prediction results against the testing set.
Using the provided urls, we download the datasets and read them into R objects trainiing and testing.
url1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists("pml-training.csv")) {
download.file(url1, "pml-training.csv", method = "curl")
}
training <- read.csv("pml-training.csv") # downlaod and read training set
if(!file.exists("pml-testing.csv")){
download.file(url2, "pml-testing.csv", method = "curl")
}
testing <- read.csv("pml-testing.csv") # downlaod and read testing set
We start by spliting the training set into a train and test sets for cross validating the prediction models.
# cross validation
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(123)
indTrain <- createDataPartition(y=training$classe, p = 0.70, list=FALSE)
train <- training[indTrain,] ; test <- training[-indTrain,]
dim(train); dim(test)
## [1] 13737 160
## [1] 5885 160
Second, we remove variables that would result in poor prediction. These are the first seven variables of the data set, variables with values near zero and the ones with more than 70% missing values (NA).
train <- train[,-c(1:7)] # remove metadata
## removing near zero variables
nzv <- nearZeroVar(train, saveMetrics=TRUE)
train <- train[, nzv$nzv==FALSE]
# remove variables with missing values more than 70%
na <- colSums( is.na(train) )
naind <- na/nrow(train) > .7
train <- train[,!naind]
dim(train)
## [1] 13737 53
Number of variables is decreased to 53.
Here we chose to apply a random forst (RF) algorithm for prediction. The reason is in this case we have many variables that we predict with, and we thin a RF model will perform better.
## random forest
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
mod <- randomForest(classe ~ ., data=train)
plot(mod, main = "Error Rates for the Random Forst Trees")
Here we cross validata the results by using the model we built mod to predict on the test set and we show the confusion matrix.
pred <- predict(mod, test)
cm <- confusionMatrix(pred, test$classe)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 1 1133 12 0 0
## C 0 0 1014 13 0
## D 0 0 0 950 0
## E 0 0 0 1 1082
##
## Overall Statistics
##
## Accuracy : 0.9944
## 95% CI : (0.9921, 0.9961)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9929
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9947 0.9883 0.9855 1.0000
## Specificity 0.9986 0.9973 0.9973 1.0000 0.9998
## Pos Pred Value 0.9964 0.9887 0.9873 1.0000 0.9991
## Neg Pred Value 0.9998 0.9987 0.9975 0.9972 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1925 0.1723 0.1614 0.1839
## Detection Prevalence 0.2853 0.1947 0.1745 0.1614 0.1840
## Balanced Accuracy 0.9990 0.9960 0.9928 0.9927 0.9999
round(cm$overall['Accuracy'],2) # accuracy
## Accuracy
## 0.99
1-round(cm$overall['Accuracy'],2) # out of sample error rate
## Accuracy
## 0.01
The overall accuracy of the RF model is 0.99 when applied on the test data set for cross validation and the out of sample error rate equals 0.01 . It seems very accurate due to the abundance of the variables we predict with.
Here we predict on the 20 samples of the testing dataset using the RF model.
prediction <- predict(mod, testing)
print(prediction)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E