When using devices such as Jawbone Up, Nike FuelBand, and Fitbit, people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project will focus on building prediction model to predict 20 different test cases to predict the manner in which people did the exercise. The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
library(ggplot2)
library(caret)
library(rpart)
library(rattle)
library(rpart.plot)
library(randomForest)
library(readr)
library(dplyr)
pml_training <- read_csv("C:/Users/tuuye/Desktop/Data Science/John Hopkins/Machine Learning/pml-training.csv")
pml_testing <- read_csv("C:/Users/tuuye/Desktop/Data Science/John Hopkins/Machine Learning/pml-testing.csv")
set.seed(190594)
Check dimension of the training data
dim(pml_training)
## [1] 19622 160
The data frame has 19622 rows (observations) and 160 columns (variables). We will look at the first six rows of the dataset
head(pml_training)
## # A tibble: 6 x 160
## X1 user_name raw_timestamp_p~ raw_timestamp_p~ cvtd_timestamp
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 carlitos 1323084231 788290 05/12/2011 11~
## 2 2 carlitos 1323084231 808298 05/12/2011 11~
## 3 3 carlitos 1323084231 820366 05/12/2011 11~
## 4 4 carlitos 1323084232 120339 05/12/2011 11~
## 5 5 carlitos 1323084232 196328 05/12/2011 11~
## 6 6 carlitos 1323084232 304277 05/12/2011 11~
## # ... with 155 more variables: new_window <chr>, num_window <dbl>,
## # roll_belt <dbl>, pitch_belt <dbl>, yaw_belt <dbl>,
## # total_accel_belt <dbl>, kurtosis_roll_belt <chr>,
## # kurtosis_picth_belt <chr>, kurtosis_yaw_belt <chr>,
## # skewness_roll_belt <chr>, skewness_roll_belt.1 <chr>,
## # skewness_yaw_belt <chr>, max_roll_belt <dbl>, max_picth_belt <dbl>,
## # max_yaw_belt <chr>, min_roll_belt <dbl>, min_pitch_belt <dbl>,
## # min_yaw_belt <chr>, amplitude_roll_belt <dbl>,
## # amplitude_pitch_belt <dbl>, amplitude_yaw_belt <chr>,
## # var_total_accel_belt <dbl>, avg_roll_belt <dbl>,
## # stddev_roll_belt <dbl>, var_roll_belt <dbl>, avg_pitch_belt <dbl>,
## # stddev_pitch_belt <dbl>, var_pitch_belt <dbl>, avg_yaw_belt <dbl>,
## # stddev_yaw_belt <dbl>, var_yaw_belt <dbl>, gyros_belt_x <dbl>,
## # gyros_belt_y <dbl>, gyros_belt_z <dbl>, accel_belt_x <dbl>,
## # accel_belt_y <dbl>, accel_belt_z <dbl>, magnet_belt_x <dbl>,
## # magnet_belt_y <dbl>, magnet_belt_z <dbl>, roll_arm <dbl>,
## # pitch_arm <dbl>, yaw_arm <dbl>, total_accel_arm <dbl>,
## # var_accel_arm <dbl>, avg_roll_arm <dbl>, stddev_roll_arm <dbl>,
## # var_roll_arm <dbl>, avg_pitch_arm <dbl>, stddev_pitch_arm <dbl>,
## # var_pitch_arm <dbl>, avg_yaw_arm <dbl>, stddev_yaw_arm <dbl>,
## # var_yaw_arm <dbl>, gyros_arm_x <dbl>, gyros_arm_y <dbl>,
## # gyros_arm_z <dbl>, accel_arm_x <dbl>, accel_arm_y <dbl>,
## # accel_arm_z <dbl>, magnet_arm_x <dbl>, magnet_arm_y <dbl>,
## # magnet_arm_z <dbl>, kurtosis_roll_arm <dbl>, kurtosis_picth_arm <chr>,
## # kurtosis_yaw_arm <chr>, skewness_roll_arm <dbl>,
## # skewness_pitch_arm <chr>, skewness_yaw_arm <chr>, max_roll_arm <dbl>,
## # max_picth_arm <dbl>, max_yaw_arm <dbl>, min_roll_arm <dbl>,
## # min_pitch_arm <dbl>, min_yaw_arm <dbl>, amplitude_roll_arm <dbl>,
## # amplitude_pitch_arm <dbl>, amplitude_yaw_arm <dbl>,
## # roll_dumbbell <dbl>, pitch_dumbbell <dbl>, yaw_dumbbell <dbl>,
## # kurtosis_roll_dumbbell <dbl>, kurtosis_picth_dumbbell <dbl>,
## # kurtosis_yaw_dumbbell <chr>, skewness_roll_dumbbell <dbl>,
## # skewness_pitch_dumbbell <dbl>, skewness_yaw_dumbbell <chr>,
## # max_roll_dumbbell <dbl>, max_picth_dumbbell <dbl>,
## # max_yaw_dumbbell <dbl>, min_roll_dumbbell <dbl>,
## # min_pitch_dumbbell <dbl>, min_yaw_dumbbell <dbl>,
## # amplitude_roll_dumbbell <dbl>, amplitude_pitch_dumbbell <dbl>,
## # amplitude_yaw_dumbbell <dbl>, total_accel_dumbbell <dbl>,
## # var_accel_dumbbell <dbl>, avg_roll_dumbbell <dbl>,
## # stddev_roll_dumbbell <dbl>, ...
In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. And those variables are in the columns #8 to #159. Column #160 classe
is used as an outcome to build prediction model. The data from first seven columns are not related to our prediction so we will exclude those columns from our data
accelerometers = grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(pml_training))
length(accelerometers)
## [1] 152
And include column#160 to the new data frame
data <- pml_training[, c(accelerometers, 160)]
dim(data)
## [1] 19622 153
Then we will check if the dataset has missing values or not
na_count <- colSums(is.na(data))
head(sort(na_count, decreasing = TRUE), n = 6)
## kurtosis_roll_arm skewness_roll_arm kurtosis_roll_dumbbell
## 19294 19293 19221
## max_yaw_dumbbell min_yaw_dumbbell amplitude_yaw_dumbbell
## 19221 19221 19221
omitColumns = which(na_count > 19000)
data = data[, -omitColumns]
dim(data)
## [1] 19622 53
So from 153 columns, the data is reduced to 53 columns. Then, we will check type of each column
table(sapply(data[1,], class))
##
## character numeric
## 1 52
We have training and testing data but not the validation data. we split the cleaned training set into a pure training data set (80%) and a validation data set (20%). We will use the validation data set to conduct cross validation in future steps.
inTrain <- createDataPartition(y = data$classe, p = 0.8, list = FALSE)
training <- data[inTrain,]
validation <- data[-inTrain,]
dim(training)
## [1] 15699 53
dim(validation)
## [1] 3923 53
dim(pml_testing)
## [1] 20 160
So we have all the data we need to build prediction model
We fit a predictive model for activity recognition using Decision Tree algorithm.
modelTree <- rpart(classe ~., data = training, method = 'class')
prp(modelTree)
# We can create prettier tree by using `rattle` package
fancyRpartPlot(modelTree)
Now, we estimate the performance of the model on the validation data set.
predictTree <- predict(modelTree, newdata = validation, type = 'class')
accuracy.tree <- confusionMatrix(table(predictTree, validation$classe))
accuracy.tree
## Confusion Matrix and Statistics
##
##
## predictTree A B C D E
## A 899 107 15 27 8
## B 31 428 60 54 56
## C 27 67 503 102 95
## D 140 137 92 426 100
## E 19 20 14 34 462
##
## Overall Statistics
##
## Accuracy : 0.6928
## 95% CI : (0.6781, 0.7073)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6131
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8056 0.5639 0.7354 0.6625 0.6408
## Specificity 0.9441 0.9365 0.9102 0.8570 0.9728
## Pos Pred Value 0.8513 0.6804 0.6335 0.4760 0.8415
## Neg Pred Value 0.9243 0.8995 0.9422 0.9283 0.9232
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2292 0.1091 0.1282 0.1086 0.1178
## Detection Prevalence 0.2692 0.1603 0.2024 0.2281 0.1399
## Balanced Accuracy 0.8748 0.7502 0.8228 0.7598 0.8068
We see that the accuracy is low. We will check with the ‘random forest’. This model will automatically select important variables and is robust to correlated covariates and outliers in general.
# Plot
plot(accuracy.tree$table, col = accuracy.tree$byClass, main = paste('Decision Tree - Accuracy = ', round(accuracy.tree$overall['Accuracy'],4)))
modelRf <- train(classe ~., data = training, method = 'rf', trControl = trainControl(method = 'cv', 5), ntree = 250)
modelRf$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 250, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.6%
## Confusion matrix:
## A B C D E class.error
## A 4461 3 0 0 0 0.000672043
## B 15 3016 6 0 1 0.007241606
## C 0 22 2712 4 0 0.009495982
## D 0 0 37 2534 2 0.015157404
## E 0 0 0 4 2882 0.001386001
#Applying model to the validation data
predict.rf <- predict(modelRf, newdata = validation)
rf <- confusionMatrix(table(predict.rf, validation$classe))
rf
## Confusion Matrix and Statistics
##
##
## predict.rf A B C D E
## A 1116 2 0 0 0
## B 0 757 5 0 0
## C 0 0 679 7 0
## D 0 0 0 636 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9962
## 95% CI : (0.9937, 0.9979)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9952
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9927 0.9891 0.9986
## Specificity 0.9993 0.9984 0.9978 0.9997 1.0000
## Pos Pred Value 0.9982 0.9934 0.9898 0.9984 1.0000
## Neg Pred Value 1.0000 0.9994 0.9985 0.9979 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1930 0.1731 0.1621 0.1835
## Detection Prevalence 0.2850 0.1942 0.1749 0.1624 0.1835
## Balanced Accuracy 0.9996 0.9979 0.9953 0.9944 0.9993
We see that the accuracy is much higher than the Decision Tree Model so we will use this model to to do prediction
# Plot the matrix
plot(rf$table, col = rf$byClass, main = paste("Random Forest - Accuracy =", round(rf$overall['Accuracy'], 4)))
We will use modelRF to the testing data to predict the 20 results
predict.test <- predict(modelRf, newdata = pml_testing)
predict.test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E