This project focuses on using the HAR data set to predict the quality of actions performed by the people base on the sensor data. First we download the data set from the sourses and start cleaning it.
if(!file.exists('pml-training.csv')) {
download.file(url = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',
destfile = 'pml-training.csv')
}
if(!file.exists('pml-testing.csv')) {
download.file(url = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',
destfile = 'pml-testing.csv')
}
training <- read.csv(file = 'pml-training.csv')
testing <- read.csv(file = 'pml-testing.csv')
As there are large number of features, we investigate whether all of them are useful. So we check for various properties of features and eliminate the extraneous of them. First we check for NAs.
# Removing Columns have high NA values
nas <- sapply(X = training, FUN = function(x) {print(mean(is.na(x))*100)})
print(table(nas))
## nas
## 0 97.9308938946081
## 93 67
From the above table we see that there is a pattern in the occurence of missing values. There are around 67 variables with NAs. We drop them straight away.
# Thus we remove 67 variables straightaway from training and test sets.
nas <- nas[nas>0]
na_attr <- names(nas)
# Removing variable 'X'
na_attr <- append(na_attr, c('X', 'user_name'))
training <- training[, !(names(training) %in% na_attr)]
testing <- testing[, !(names(testing) %in% na_attr)]
#Deleting the near zero predictors
nzv <- nearZeroVar(training[,-92], saveMetrics = FALSE, names = TRUE)
training <- training[, !(names(training) %in% nzv)]
testing <- testing[, !(names(testing) %in% nzv)]
#dplyr
train1 <- tbl_df(training)
test1 <- tbl_df(testing)
#Getting rid of timestamps as of now
train1 <- train1[,-c(1:3)]
test1 <- test1[,-c(1:3)]
Now, we eliminate the NearToZero variables. This will eliminate the variables that have very low variance. These variables contain very less information so we directly eliminate them. Now, we find the, co relation between all the pairs of remaining variables. This, will give us insights about which variables are tightly corelated. If large number of variables are largely related that we eliminate one of the corelated pair.
#co relation plot
cor <- cor(train1[,-54])
corrplot(cor, method = "color", type = "upper")
#Comments on correlation
There are only a couple of variables who are strongly co related so, I don’t intend to drop them as they are very few, rest the data set is ok.
# creating partition of training set for validation afterwards.
inTrain <- createDataPartition(train1$classe, p = .90, list = FALSE)
train <- training[inTrain, ]
test <- training[-inTrain, ]
Performing PCA and understand the nature of data. Plotting the PCA Components.
# strip some variables out
y_train <- train[,57]
train <- train[,-c(1,2,3,57)]
# find the principal components
pr <- prcomp(train, scale. = T)
#set plotting parametersz
biplot(pr, scale = 0, main="Plotting the Principal Components")
We can see several principal components highlighted by the red lines and along with that we can find that, there are 5 clusters that are formed.
Now we plot the graphs showing the variance explained vs the number of components. To get the insights about how many componets are useful for prediction.
par(mfrow=c(1,2))
# extract standard dev and computer variance
std_dev <- pr$sdev
pr_var <- std_dev^2
print(pr_var[1:10])
## [1] 8.357598 8.138721 4.662655 4.185845 3.667275 3.031224 2.256040
## [8] 2.068248 1.723622 1.512150
# compute the variance explained
pr_varex <- pr_var/sum(pr_var)
print(pr_varex[1:10])
## [1] 0.15769052 0.15356077 0.08797463 0.07897820 0.06919386 0.05719291
## [7] 0.04256679 0.03902354 0.03252117 0.02853114
# Plot Principal components vs Variance Explained
plot(pr_varex, xlab = "Principal Component",
ylab = "Variance Explained",
type = "b")
# Plot Cumulative proportion of var explained vs. Principal components
plot(cumsum(pr_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
# Poiniting to 42 as the best principal no of components
abline(h = 1)
abline(v = 42)
# Commenmts on Principal components
From the above figure we can conclude that, around 40 components cover the whole variance in the data set remaining components do not add any significant accuracy towards the prediction.
To speedup the execution of the models we set up a parallel environment.
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
dim(train1)
## [1] 19622 54
dim(train)
## [1] 17662 53
dim(test1)
## [1] 20 54
dim(test)
## [1] 1960 57
train = cbind(train, classe = y_train)
test <- test[,-c(1:3)]
rpart.mod <- rpart(classe~., data = train, method = "class")
rpart.pred <- predict(rpart.mod, test)
rpart.pred <- apply(X = rpart.pred, MARGIN = 1, FUN = function(x) {which.max(x)})
rpart.pred <- as.factor(rpart.pred)
levels(rpart.pred) <- c("A", "B", "C", "D", "E")
rpart.pred <- unname(rpart.pred)
print(confusionMatrix(rpart.pred, test$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 497 84 10 37 26
## B 13 185 10 6 22
## C 6 28 292 44 28
## D 30 46 12 217 37
## E 12 36 18 17 247
##
## Overall Statistics
##
## Accuracy : 0.7337
## 95% CI : (0.7135, 0.7531)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6615
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8907 0.48813 0.8538 0.6760 0.6861
## Specificity 0.8880 0.96774 0.9345 0.9237 0.9481
## Pos Pred Value 0.7599 0.78390 0.7337 0.6345 0.7485
## Neg Pred Value 0.9533 0.88747 0.9680 0.9357 0.9307
## Prevalence 0.2847 0.19337 0.1745 0.1638 0.1837
## Detection Rate 0.2536 0.09439 0.1490 0.1107 0.1260
## Detection Prevalence 0.3337 0.12041 0.2031 0.1745 0.1684
## Balanced Accuracy 0.8893 0.72793 0.8941 0.7999 0.8171
The performance of the decision tree on the test or hold out data set is not that greatly impressive. That is why, we go for other models. We do not predict on the testing set with this model as it poorly performs on the hold out set.
Building random forest on the hold out set. Here, first we check the performance of RF on our hold out set. If it is satisfactory, we proceed furthur to predict using the holdout set.
# Setting cross validation control
fitControl <- trainControl(method = "cv",
number = 2,
allowParallel = TRUE)
# Training the RF of training set
mod.fit <- caret::train(classe~., data = train, method = 'rf', trControl = fitControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
print(mod.fit$results)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9895255 0.9867491 0.0016814905 0.002126778
## 2 27 0.9951308 0.9938406 0.0009608517 0.001215690
## 3 53 0.9916204 0.9894004 0.0022419873 0.002835017
test.pred <- predict(mod.fit, newdata = test)
print(confusionMatrix(test.pred, reference = test$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 558 0 0 0 0
## B 0 379 0 0 0
## C 0 0 342 2 0
## D 0 0 0 318 1
## E 0 0 0 1 359
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.9948, 0.9994)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9974
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 0.9907 0.9972
## Specificity 1.0000 1.0000 0.9988 0.9994 0.9994
## Pos Pred Value 1.0000 1.0000 0.9942 0.9969 0.9972
## Neg Pred Value 1.0000 1.0000 1.0000 0.9982 0.9994
## Prevalence 0.2847 0.1934 0.1745 0.1638 0.1837
## Detection Rate 0.2847 0.1934 0.1745 0.1622 0.1832
## Detection Prevalence 0.2847 0.1934 0.1755 0.1628 0.1837
## Balanced Accuracy 1.0000 1.0000 0.9994 0.9950 0.9983
The above results are impressive and we build a RF model on the whole data set again and use it to predict on the testing set. ## Building Models Random Forest
model.fit <- caret::train(classe~., data = train1, method = "rf", trControl = fitControl)
print(model.fit)
## Random Forest
##
## 19622 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 9811, 9811
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9926613 0.9907167
## 27 0.9965345 0.9956164
## 53 0.9932219 0.9914259
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
names(test1)[54] <- "classe"
mod.pred <- predict(model.fit, newdata = test1)
print(mod.pred)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These are the predictions on the testing set.
Fitting a GBM on train set and predicting on the test set.
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loaded gbm 2.1.3
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Stochastic Gradient Boosting
##
## 17662 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 17662, 17662, 17662, 17662, 17662, 17662, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7582887 0.6932199
## 1 100 0.8300027 0.7846301
## 1 150 0.8704952 0.8360080
## 2 50 0.8840958 0.8531832
## 2 100 0.9407723 0.9250473
## 2 150 0.9629516 0.9531192
## 3 50 0.9319536 0.9138408
## 3 100 0.9703342 0.9624607
## 3 150 0.9854395 0.9815786
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
gbm.fit <- caret::train(classe~., data = train1, method="gbm", verbose = FALSE)
print(confusionMatrix(test.pred, reference = test$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 558 2 0 0 0
## B 0 373 1 0 0
## C 0 3 341 3 0
## D 0 0 0 317 4
## E 0 1 0 1 356
##
## Overall Statistics
##
## Accuracy : 0.9923
## 95% CI : (0.9874, 0.9957)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9842 0.9971 0.9875 0.9889
## Specificity 0.9986 0.9994 0.9963 0.9976 0.9988
## Pos Pred Value 0.9964 0.9973 0.9827 0.9875 0.9944
## Neg Pred Value 1.0000 0.9962 0.9994 0.9976 0.9975
## Prevalence 0.2847 0.1934 0.1745 0.1638 0.1837
## Detection Rate 0.2847 0.1903 0.1740 0.1617 0.1816
## Detection Prevalence 0.2857 0.1908 0.1770 0.1638 0.1827
## Balanced Accuracy 0.9993 0.9918 0.9967 0.9925 0.9938
gbm.pred <- predict(gbm.fit, newdata = test1)
print(gbm.pred)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
print(mod.pred)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
From the above results, we find the gbm and the RF models are performing similarly and have successfully predicted all the 20 sample in the testing data set.
# Stop the cluster
stopCluster(cluster)
registerDoSEQ()