For this project, we are going to take a look at some data from accelerometers used by six individuals during different activities to improve health. We are going to use the training data to classify the manner in which they exercise, which is stored in the “classe” column of the training set.
There are a few libraries that we are going to use in this project, so let’s add them.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
Now let’s load the training and testing datasets, and take a look at some of the data. There are 160 columns in the dataset, so we are going to limit the amount of columns that will be displayed.
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
head(training[, 1:8], 5)
## X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## 1 1 carlitos 1323084231 788290 05/12/2011 11:23
## 2 2 carlitos 1323084231 808298 05/12/2011 11:23
## 3 3 carlitos 1323084231 820366 05/12/2011 11:23
## 4 4 carlitos 1323084232 120339 05/12/2011 11:23
## 5 5 carlitos 1323084232 196328 05/12/2011 11:23
## new_window num_window roll_belt
## 1 no 11 1.41
## 2 no 11 1.41
## 3 no 11 1.42
## 4 no 12 1.48
## 5 no 12 1.48
There are a few things about the data that we will need to clean up before we can start applying machine learning algorithms. We can remove the first column because it just contains a number that was added when loading the CSV file. Also, the last column of the testing set doesn’t have the same name as the training set, so we will want to rename it. Another thing that we need to deal with is the missing and NA values in the training and testing datasets. We should remove them all before we build our models. Some of the columns in the testing set are all NA or missing, so let’s remove those columns from the testing set and remove all of those same columns from the training data set. There are also some columns that include values related to the date and time, which won’t be helpful in the prediction model, so lets remove those columns.
# Remove the first columns
training <- training[, -1]
testing <- testing[, -1]
# Rename the last column of the testing set
testing <- rename(testing, classe = "problem_id")
# Remove the NA and missing values
testing <- testing[colSums(!is.na(testing)) > 0]
training <- select(training, c(names(testing)))
# Remove time related columns
training <- select(training, -c("raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window"))
testing <- select(testing, -c("raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window"))
# Partition the data for training and testing.
trPart <- createDataPartition(training$classe, p = 0.75, list = FALSE)
trainPart <- training[trPart,]
testPart <- training[-trPart,]
The first model that we will build will be a recursive partition model, and we will use 10-fold cross validation. After building the model, we will check its accuracy with the confusion matrix.
# Set the training control for 10-fold cross validation.
train_control <- trainControl(method = "cv", number = 10)
# Train the model
modelrpart <- train(classe ~ ., data <- training, method = "rpart", trControl = train_control)
# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrpart, newdata = testPart))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1292 17 85 0 1
## B 383 331 235 0 0
## C 399 26 430 0 0
## D 373 143 288 0 0
## E 120 123 232 0 426
##
## Overall Statistics
##
## Accuracy : 0.5055
## 95% CI : (0.4914, 0.5196)
## No Information Rate : 0.5235
## P-Value [Acc > NIR] : 0.9943
##
## Kappa : 0.3533
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.5033 0.5172 0.33858 NA 0.99766
## Specificity 0.9559 0.8551 0.88305 0.8361 0.89390
## Pos Pred Value 0.9262 0.3488 0.50292 NA 0.47281
## Neg Pred Value 0.6366 0.9219 0.79254 NA 0.99975
## Prevalence 0.5235 0.1305 0.25897 0.0000 0.08707
## Detection Rate 0.2635 0.0675 0.08768 0.0000 0.08687
## Detection Prevalence 0.2845 0.1935 0.17435 0.1639 0.18373
## Balanced Accuracy 0.7296 0.6861 0.61082 NA 0.94578
The confusion matrix for this model shows a very poor result. It’s about 50% accurate, so this is not a very good model to use for prediction. Let’s try some other models.
The next model that we are going to try is a Linear Discriminant Analysis model. Again, we’re going to use 10-fold cross-validation.
# Set the training control for 10-fold cross validation.
train_control <- trainControl(method = "cv", number = 10)
# Train the model
modelrpart <- train(classe ~ ., data <- training, method = "lda", trControl = train_control)
# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrpart, newdata = testPart))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1169 35 99 91 1
## B 155 601 136 26 31
## C 89 90 554 109 13
## D 52 31 78 629 14
## E 28 135 62 63 613
##
## Overall Statistics
##
## Accuracy : 0.7272
## 95% CI : (0.7145, 0.7396)
## No Information Rate : 0.3044
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6543
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7830 0.6738 0.5963 0.6852 0.9122
## Specificity 0.9337 0.9133 0.9243 0.9561 0.9319
## Pos Pred Value 0.8380 0.6333 0.6480 0.7823 0.6804
## Neg Pred Value 0.9077 0.9264 0.9074 0.9295 0.9853
## Prevalence 0.3044 0.1819 0.1894 0.1872 0.1370
## Detection Rate 0.2384 0.1226 0.1130 0.1283 0.1250
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.8584 0.7935 0.7603 0.8206 0.9221
The accuracy of this model is about 73%. Better than the rpart model, but still not great. We may be able to do better than that. Let’s try another model.
The third model that we will build is a random forest model. We’re also using 4-fold cross validation.
# Set the training control for 4-fold cross validation.
train_control <- trainControl(method = "cv", number = 4)
# Train the model
modelrf <- train(classe ~ ., data <- training, method = "rf", trControl = train_control)
# Check the confusion matrix of the model
confusionMatrix(testPart$classe, predict(modelrf, newdata = testPart))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 0 949 0 0 0
## C 0 0 855 0 0
## D 0 0 0 804 0
## E 0 0 0 0 901
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The results of this model are excellent. There is a 100% accuracy rate on predicting the class. This is the model that we will use to predict on the testing dataset. We expect that it will have a 100% out-of-sample accuracy.
Now we’re going to predict the classe in the testing set and add it to the dataset.
testing$classe <- predict(modelrf, newdata = testing)
select(testing, c(user_name, classe))
## user_name classe
## 1 pedro B
## 2 jeremy A
## 3 jeremy B
## 4 adelmo A
## 5 eurico A
## 6 jeremy E
## 7 jeremy D
## 8 jeremy B
## 9 carlitos A
## 10 charles A
## 11 carlitos B
## 12 jeremy C
## 13 eurico B
## 14 jeremy A
## 15 jeremy E
## 16 eurico E
## 17 pedro A
## 18 carlitos B
## 19 pedro B
## 20 eurico B