Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
OR
Including the required packages
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RANN))
suppressPackageStartupMessages(library(corrplot))
suppressPackageStartupMessages(library(kernlab))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(randomForest))
Reading the test and train data sets
# Reading the data
train <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")
Analyzing class of the variables in the train data set
table(sapply(train,class))
##
## factor integer numeric
## 37 35 88
class(train$classe)
## [1] "factor"
Since the variable to be predicted is of class “factor”. We will try to see if it is a factor with levels or without levels.
table(train$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
This factor variable classe has 5 levels: A, B, C, D and E. Linear support vector machines or random forests should work well for building a machine learning algorithm to predict a categorical variable with labels and < 100k samples.
Next its vital that we change all factor variables into numeric ones, so that most modeling algorithms can train over it. So we assign dummies for each level in the factor variables and all 160 variables are of class numeric now.
after_dummy <- lapply(train, as.numeric)
after_dummy_test <- lapply(test,as.numeric)
after_dummy <- as.data.frame(after_dummy)
after_dummy_test <- as.data.frame(after_dummy_test)
table(sapply(after_dummy,class))
##
## numeric
## 160
Now we see that there are 160 predictors, lets see if any of them have zero variability
newtrain <- nearZeroVar(after_dummy,saveMetrics=T)
table(newtrain$nzv)
##
## FALSE TRUE
## 100 60
after_nzv <- after_dummy[,newtrain$nzv==FALSE]
after_nzv_test <- after_dummy_test[,newtrain$nzv==FALSE]
rm(train)
rm(test)
after_nzv <- lapply(after_nzv, as.numeric)
after_nzv <- as.data.frame(after_nzv)
after_nzv_test <- lapply(after_nzv_test, as.numeric)
after_nzv_test <- as.data.frame(after_nzv_test)
We see that for 60 predictors, the near zero variance is TRUE, which means these predictors have very minimal prediction capabilities. So it is okay to remove them from the data set. We are now left with 100 predictors
Next, we perform missing value treatment for the dataset using imputational k- nearest neighbour algorithm
# k nearest neighbour
obj <- preProcess(after_nzv[,-100],method="knnImpute")
table(is.na(after_nzv))
##
## FALSE TRUE
## 1174344 787856
table(is.na(after_nzv_test))
##
## FALSE TRUE
## 1180 820
summary <- as.data.frame(summary(after_nzv))
missing <- predict(obj,after_nzv[,-100])
missing.1 <- predict(obj,after_nzv_test[,-100])
Next we find the correlation between features and draw the correlation plot
# Find correlation excluding the variable to be predicted
M <- abs(cor(missing))
M1 <- abs(cor(missing.1))
# Correlation with itself is 1, so resetting that
diag(M) <- 0
diag(M1) <- 0
Drawing the correlation plot for features with correlation > 80%
corrplot(M>0.8)
Flagging highly correlated predictors and removing them off. We are now left with 61 predcitors.
# Flagging high correlation
highlyCorr <- findCorrelation(M, cutoff = 0.8)
filteredCorr <- missing[,-highlyCorr]
filteredCorr.t <- missing.1[,-highlyCorr]
Next, we know one of the most important pre processing to be performed is correction for skewness. We Center and Scale the histograms of the predictors. But we see that the predictors are already standardized with mean 0 and standard deviation 1, so we bypass this step.
head(lapply(filteredCorr, function(x) mean(x)))
## $X
## [1] -2.48146e-17
##
## $user_name
## [1] -6.284771e-17
##
## $raw_timestamp_part_1
## [1] -8.60877e-14
##
## $raw_timestamp_part_2
## [1] 4.582767e-17
##
## $num_window
## [1] 6.066985e-17
##
## $pitch_belt
## [1] 1.846961e-17
head(lapply(filteredCorr, function(x) sd(x)))
## $X
## [1] 1
##
## $user_name
## [1] 1
##
## $raw_timestamp_part_1
## [1] 1
##
## $raw_timestamp_part_2
## [1] 1
##
## $num_window
## [1] 1
##
## $pitch_belt
## [1] 1
filteredCorr$classe <- after_nzv$classe
filteredCorr.t$classe <- 0
finalTrain <- filteredCorr
finalTest <- filteredCorr.t
# Good practice to keep environment free of unnecessary clutter
rm(after_dummy)
rm(after_nzv)
rm(missing)
rm(M)
rm(highlyCorr)
rm(obj)
rm(after_dummy_test)
rm(after_nzv_test)
rm(missing.1)
rm(M1)
set.seed(1500)
Creating samples in the train data set. We expect the error to be around 1%.
inTrain <- createDataPartition(y=finalTrain$class,p=0.75,list=F)
train_sample <- finalTrain[inTrain,]
test_sample <- finalTrain[-inTrain,]
Using Support Vector Machine Classifier; Since we are predicting a category and this is labeled data and we have less than 100k samples. The summary of the model is given below and so is the confusion matrix. Our accuracy here is:
model_svm <- svm(as.factor(classe)~., data=train_sample)
summary(model_svm)
##
## Call:
## svm(formula = as.factor(classe) ~ ., data = train_sample)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.01639344
##
## Number of Support Vectors: 4678
##
## ( 737 1187 1024 896 834 )
##
##
## Number of Classes: 5
##
## Levels:
## 1 2 3 4 5
prediction <- predict(model_svm,test_sample[,-62])
confusionMatrix(prediction,test_sample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 1394 0 0 0 0
## 2 0 938 1 0 0
## 3 0 1 864 2 0
## 4 0 0 0 801 2
## 5 0 1 0 1 899
##
## Overall Statistics
##
## Accuracy : 0.9984
## 95% CI : (0.9968, 0.9993)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9979
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 1.0000 0.9979 0.9988 0.9963 0.9978
## Specificity 1.0000 0.9997 0.9993 0.9995 0.9995
## Pos Pred Value 1.0000 0.9989 0.9965 0.9975 0.9978
## Neg Pred Value 1.0000 0.9995 0.9998 0.9993 0.9995
## Prevalence 0.2843 0.1917 0.1764 0.1639 0.1837
## Detection Rate 0.2843 0.1913 0.1762 0.1633 0.1833
## Detection Prevalence 0.2843 0.1915 0.1768 0.1637 0.1837
## Balanced Accuracy 1.0000 0.9988 0.9991 0.9979 0.9986
Using Random Forest Classifier; Since it can predict categorical variables and can train over categorical predictors.The summary of the model is given below and so is the confusion matrix. Our accuracy here is:
model_rf <- randomForest(as.factor(classe)~ ., data=train_sample,importance=TRUE,proximity=TRUE)
print(model_rf)
##
## Call:
## randomForest(formula = as.factor(classe) ~ ., data = train_sample, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.01%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 4186 0 0 0 0 0.0000000000
## 2 0 2857 0 0 0 0.0000000000
## 3 0 1 2556 0 0 0.0003910833
## 4 0 0 0 2412 0 0.0000000000
## 5 0 0 0 0 2706 0.0000000000
prediction_rf <- predict(model_rf,test_sample[,-62])
confusionMatrix(prediction_rf,test_sample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 1394 0 0 0 0
## 2 0 939 0 0 0
## 3 0 0 865 0 0
## 4 0 1 0 804 0
## 5 0 0 0 0 901
##
## Overall Statistics
##
## Accuracy : 0.9998
## 95% CI : (0.9989, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 1.0000 0.9989 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 0.9998 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 0.9988 1.0000
## Neg Pred Value 1.0000 0.9997 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1917 0.1764 0.1639 0.1837
## Detection Rate 0.2843 0.1915 0.1764 0.1639 0.1837
## Detection Prevalence 0.2843 0.1915 0.1764 0.1642 0.1837
## Balanced Accuracy 1.0000 0.9995 1.0000 0.9999 1.0000
Final predictions over the test set
prediction_final_svm <- predict(model_svm, finalTest[,-62])
prediction_final <- predict(model_rf, finalTest[,-62])
Variable importance plot showing the model capable predictors, which is why removing them would make the model go grossly wrong.
varImpPlot(model_rf)
Output of the prediction anlysis
answers <- chartr("12345", "ABCDE", prediction_final)
answers
## [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
## [18] "B" "A" "B"