Dataset Human Activity Recognition contains classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects. The goal of this project is to predict the manner in which they did the exercise. The training set will be used to build machine learning models.
Install all required packages.
library(caret)
library(randomForest)
library(dplyr)
library(Amelia)
library(dplyr)
library(doSNOW)
Load up the .CSV data
training <- read.csv(file = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
header = TRUE, na.strings=c("", "NA"))
testing <- read.csv(file = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
header = TRUE, na.strings=c("", "NA"))
Show missing data
missmap(obj = training, y.cex=0.5, x.cex=0.7)
I used Amelia package for showing missing data imputation. Missmap draws a map of the missingness in a dataset using the image function. As you can see more than half variables have null value so I removed this columns.
cols <- sapply(X = training, FUN = function(X) sum(is.na(X)) == 0)
training <- training[, cols]
Several variables are not directly related to the target variable. They even may lead to misleading results. So I’m removed first seven variables such as “X”, “user_name”, “raw_timestamp_part_1”, “raw_timestamp_part_2”, “cvtd_timestamp”, “new_window”, “num_window”
glimpse(training[,1:10])
## Observations: 19,622
## Variables: 10
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ user_name <fctr> carlitos, carlitos, carlitos, carlitos, ...
## $ raw_timestamp_part_1 <int> 1323084231, 1323084231, 1323084231, 13230...
## $ raw_timestamp_part_2 <int> 788290, 808298, 820366, 120339, 196328, 3...
## $ cvtd_timestamp <fctr> 05/12/2011 11:23, 05/12/2011 11:23, 05/1...
## $ new_window <fctr> no, no, no, no, no, no, no, no, no, no, ...
## $ num_window <int> 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 1...
## $ roll_belt <dbl> 1.41, 1.41, 1.42, 1.48, 1.48, 1.45, 1.42,...
## $ pitch_belt <dbl> 8.07, 8.07, 8.07, 8.05, 8.07, 8.06, 8.09,...
## $ yaw_belt <dbl> -94.4, -94.4, -94.4, -94.4, -94.4, -94.4,...
training <- training[,-c(1:7)]
I leverage function makeCluster in order to parallel computing. The reason is: by making the computer work harder (perform many calculations simultaneously) we wait less time for building model. The following code is configured to run on a workstation containing 3 logical cores.
cl <- makeCluster(3, type="SOCK")
registerDoSNOW(cl)
set.seed(123)
Then I created the training control object. In this assignment was applied 3-folds cross-validation in order to avoid overfitting
train_control <- trainControl(method = "cv", number = 3,
allowParallel = TRUE, verboseIter = TRUE)
In this assignment I used a random forest model. Random forests is a popular ensemble method that can be used to build predictive models for both classification and regression problems. In this case target variable is “classe” (A - sitting-down, B - standing-up, C- standing, D - walking, and E- sitting)
model.rf <- train(classe ~ ., data = training,
method = "rf",
importance = TRUE,
trControl = train_control)
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set
stopCluster is called to properly shut down the cluster before exiting R. If it is not called it may be necessary to use external means to ensure that all slave processes are shut down
stopCluster(cl)
Summary of the training:
*Model based on random forest has good results over the training set. This model has accuracy approximately 99%
model.rf
## Random Forest
##
## 19622 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 13081, 13082, 13081
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9929163 0.9910390
## 27 0.9932220 0.9914255
## 52 0.9869027 0.9834305
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Variance Importance Plot
varImpPlot(model.rf$finalModel)
Them main goal of this project was predict the manner in which subjects did the exercise, so below is result:
predict(object = model.rf, newdata = testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E