Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
pml.testing <- read.csv("pml-testing.csv")
pml.training <- read.csv("pml-training.csv")
Lets look at the dimensions of our training data.
dim(pml.training)
## [1] 19622 160
As we can see there are 19622 observations of 160 variables are present, we can fairly say that’s its not possible to show the summary of the dataset in this report. As we have abundant data to train the model so we would create a partition from data set for the cross validation.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y = pml.training$classe , p = 0.70, list = FALSE )
pml.train <- pml.training[inTrain,]
pml.crossValidate <- pml.training[-inTrain,]
dim(pml.train)
## [1] 13737 160
dim(pml.crossValidate)
## [1] 5885 160
Now we have divided our data set into 7:3 ratio for the cross validation.
Now we will preprocess the data before we train the data to the model. From the data summary we have obsereved that there are lots of variable present and many of them shows very less varaiblity, so it would be benefitial for us to remove those variables from our data set.
library(caret)
nzv <- nearZeroVar(pml.train, saveMetrics = TRUE)
pml.train <- pml.train[,nzv$nzv == FALSE]
pml.testing <- pml.testing[,nzv$nzv == FALSE]
pml.crossValidate <- pml.crossValidate[, nzv$nzv == FALSE]
Now we have removed the variables which have variablity near zero. Now we check for the NA’s in our variable which can temper our results.
columnsum <- colSums(is.na(pml.train))
columnsum[columnsum>1000]
## max_roll_belt max_picth_belt min_roll_belt
## 13461 13461 13461
## min_pitch_belt amplitude_roll_belt amplitude_pitch_belt
## 13461 13461 13461
## var_total_accel_belt avg_roll_belt stddev_roll_belt
## 13461 13461 13461
## var_roll_belt avg_pitch_belt stddev_pitch_belt
## 13461 13461 13461
## var_pitch_belt avg_yaw_belt stddev_yaw_belt
## 13461 13461 13461
## var_yaw_belt var_accel_arm max_picth_arm
## 13461 13461 13461
## max_yaw_arm min_roll_arm min_pitch_arm
## 13461 13461 13461
## min_yaw_arm amplitude_yaw_arm max_roll_dumbbell
## 13461 13461 13461
## max_picth_dumbbell min_roll_dumbbell min_pitch_dumbbell
## 13461 13461 13461
## amplitude_roll_dumbbell amplitude_pitch_dumbbell var_accel_dumbbell
## 13461 13461 13461
## avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 13461 13461 13461
## avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell
## 13461 13461 13461
## avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell
## 13461 13461 13461
## max_roll_forearm max_picth_forearm min_roll_forearm
## 13461 13461 13461
## min_pitch_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 13461 13461 13461
## var_accel_forearm avg_roll_forearm
## 13461 13461
Here we can see that there are alot of variables which has more than 1000 NAs and almost 97% of the values in the variable are NA’s. Therefore it would be a good idea to remove those variables from out dataset.
train.withoutNa <- pml.train[ , colSums(is.na(pml.train))/13737 <= 0.90]
sum(is.na(train.withoutNa))
## [1] 0
testing.withoutNA <- pml.testing[,colSums(is.na(pml.train))/13737 <= 0.90]
crossvalidate.withoutNA <- pml.crossValidate[,colSums(is.na(pml.train))/13737 <= 0.90]
Now, we have removed those predictors from the dataset that has NAs over 90 percent. We did the same for the testing data and crossvalidate data.
dim(train.withoutNa)
## [1] 13737 59
We can see that the number of variables has been reduced to 59.
#Removing Identification only variables
train.withoutNa <- train.withoutNa[,-(1:5)]
testing.withoutNA <- testing.withoutNA[,-(1:5)]
crossvalidate.withoutNA <- crossvalidate.withoutNA[,-(1:5)]
library(corrplot)
## corrplot 0.84 loaded
corr <- cor(train.withoutNa[,-54])
corrplot(corr, order = "FPC", method = "color", type = "lower",tl.cex = 0.8, tl.col = rgb(0, 0, 0))
From the above Analysis we can infer that the variables which are highly correlated are shaded as dark color and which is very low in numbers. Therefore we can say that there are relatively less number of correlated variables in our dataset. Therefore we would not do Principal component Analysis in this assignment.
We will now try to choose best model for the current training data.
Initially we will train our model using Random Forest.
fit.rf <- train(classe ~ ., method = "rf" , data = train.withoutNa)
predict.rf <- predict(fit.rf, crossvalidate.withoutNA)
c1 <- confusionMatrix(predict.rf, crossvalidate.withoutNA$classe)
c1$overall["Accuracy"]
## Accuracy
## 0.9984707
plot(c1$table, col = c1$byClass, main = paste("Random Forest - Accuracy =",
round(c1$overall['Accuracy'], 4)))
Now we trained our model using random forest and we cross validate the data that we kept for crossvalidation. From the coross validation, the accuracy that we are getting by training our model with Randforest is arround 99 percentage which is highly accurate but the time taken by the model to train is high.
fit.gbm <- train(classe ~ ., method = "gbm" , data = train.withoutNa, verbose = FALSE)
predict.gbm <- predict(fit.gbm, crossvalidate.withoutNA)
c2 <- confusionMatrix(predict.gbm, crossvalidate.withoutNA$classe)
c2$overall["Accuracy"]
## Accuracy
## 0.9858963
plot(c2$table, col = c2$byClass, main = paste("GBM - Accuracy =",
round(c2$overall['Accuracy'], 4)))
Now, we have applied GBM model to train our data and cross validated the data using cross validation data set. We observed that it is also taking a lot of time to train the dataset. And the accuracy is comparitively less than the random forest model. Therefore we will reject this model and proceed further.
library(e1071)
fit.svm <- svm(classe ~ ., method = "svm" , data = train.withoutNa, verbose = FALSE)
predict.svm <- predict(fit.svm, crossvalidate.withoutNA)
c3 <- confusionMatrix(predict.svm, crossvalidate.withoutNA$classe)
c3$overall["Accuracy"]
## Accuracy
## 0.9559898
plot(c2$table, col = c2$byClass, main = paste("SVM - Accuracy =",
round(c3$overall['Accuracy'], 4)))
Now we have trained our data using SVM model and cross validated to get the accuracy. We observed that using SVM model which uses radial kernel takes relatively less time than the other two models but out of sample error is higher than the other two models.
data.frame(RandomForest = c1$overall["Accuracy"],
GBM = c2$overall["Accuracy"],SVM = c3$overall["Accuracy"])
## RandomForest GBM SVM
## Accuracy 0.9984707 0.9858963 0.9559898
From the above results we can infer that Random Forest and GBM models takes time to train the data set whereas the SVM model takes much lesser time to train the data set. But the Accuracy of other two is comparatively high. Therefore there is a time and accuracy tradeoff. Here we will choose Random Forest model to classify from the given test dataset because the time taken to train the data is not that high on which we can reject this model.
predict.svm.test <- predict(fit.rf, testing.withoutNA)
predict.svm.test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Above are the results for the test data by applying Random Forest Model.