In our analysis about exercise, we will need to read in the data and any required libraries. We should also set a seed to provide a reproducible analysis. Next, we will clean the data.
## Import libraries
library(knitr)
library(readr)
library(caret)
library(rpart)
library(randomForest)
library(ggplot2)
library(cowplot)
## Import data
train.df <- read_csv("~/Desktop/pml-training.csv", na = c("", "NA", "#DIV/0!"))
## Warning: Missing column names filled in: 'X1' [1]
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 3 parsing failures.
## row # A tibble: 3 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 5373 magnet_dumbbell_z no trailing characters .6 '~/Desktop/pml-tr… file 2 5373 magnet_forearm_y no trailing characters .123 '~/Desktop/pml-tr… row 3 5373 magnet_forearm_z no trailing characters .0917 '~/Desktop/pml-tr…
test.df <- read_csv("~/Desktop/pml-testing.csv", na = c("", "NA", "#DIV/0!"))
## Warning: Missing column names filled in: 'X1' [1]
## Set seed
set.seed(639)
First, we want to clean any variables that contain a large percentage of NA values. In this case, we’ll remove columns if 60% of the data are NA values. We will also remove the first column from both the training and testing datasets, since they only contain indices. Lastly, we will split the training dataset into training and validation datasets for cross validation.
## Reformat classe variable as factor variable
train.df$classe <- as.factor(train.df$classe)
## Reformat predictor variables to factor for Random Forest Classifcation
train.df$user_name <- as.factor(train.df$user_name)
train.df$cvtd_timestamp <- as.factor(train.df$cvtd_timestamp)
train.df$new_window <- as.factor(train.df$new_window)
test.df$user_name <- as.factor(test.df$user_name)
test.df$cvtd_timestamp <- as.factor(test.df$cvtd_timestamp)
test.df$new_window <- as.factor(test.df$new_window)
## Remove rows with only NA values
na.cols <- colSums(is.na(train.df)) < (nrow(train.df) * 0.6) # remove columns that are 60% NA
train.df <- train.df[,na.cols]
test.df <- test.df[,na.cols] # remove same cols from test that were removed form train
## Remove columns of indices and useless timestamps
train.df <- train.df[,-c(1,5)]
test.df <- test.df[,-c(1,5,60)]
## Ensure same amount of levels for factor variables
levels(test.df$new_window) <- levels(train.df$new_window)
## Ensure data frame format
train.df <- data.frame(train.df)
test.df <- data.frame(test.df)
## Split the training dataset
inTrain <- createDataPartition(train.df$classe, p = 0.7, list = FALSE)
trainCV <- train.df[inTrain,]
validateCV <- train.df[-inTrain,]
We can see that the total accelaration of arms, belts, forearms, and dumbbells have a lot of overlap between each other, and we can see that the positionings (x, y, and z) of arms, belts, forearms, and dumbbells have a lot of overlap, as well. There seems to be an extremely large amount of variables, which makes relationship-building process of the “classe” variable time-consuming and difficult. For prediction purposes, we could quickly run a machine learning algorithm at this stage in our analysis, and return to examining each predictor later.
## Plot arm accelaration
plot1 <- qplot(accel_arm_x, accel_arm_y, col = classe, data = trainCV)
## Plot belt accelaration
plot2 <- qplot(accel_belt_x, accel_belt_y, col = classe, data = trainCV)
## Plot dumbbell accelaration
plot3 <- qplot(accel_forearm_x, accel_forearm_y, col = classe, data = trainCV)
## Plot dumbbell accelaration
plot4 <- qplot(accel_dumbbell_x, accel_dumbbell_y, col = classe, data = trainCV)
## Plot total accelaration
plot5 <- qplot(total_accel_dumbbell, total_accel_arm, col = classe, data = trainCV)
## Plot total accelaration
plot6 <- qplot(total_accel_belt, total_accel_forearm, col = classe, data = trainCV)
## Arrange plots
plot_grid(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)
First, we will use a classification tree to predict the the classe variable in our dataset. After applying a classification tree model to our data, we are able to see the goodness of fit of our model according to the confusion matrix. The majority of our “classe” observations were correctly predicted, but we may be able to do better.
## Use model to predict classe of testing dataset
t.model <- rpart(classe ~ ., data = trainCV, method = "class")
pred <- predict(t.model, validateCV, type = "class")
## Produce a confusion matrix
confusionMatrix(pred, validateCV$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1548 27 2 0 14
## B 93 946 97 83 85
## C 1 66 834 32 11
## D 12 46 64 774 85
## E 20 54 29 75 887
##
## Overall Statistics
##
## Accuracy : 0.8477
## 95% CI : (0.8383, 0.8568)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8077
## Mcnemar's Test P-Value : 1.704e-15
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9247 0.8306 0.8129 0.8029 0.8198
## Specificity 0.9898 0.9246 0.9774 0.9579 0.9629
## Pos Pred Value 0.9730 0.7255 0.8835 0.7890 0.8329
## Neg Pred Value 0.9707 0.9579 0.9611 0.9613 0.9595
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2630 0.1607 0.1417 0.1315 0.1507
## Detection Prevalence 0.2703 0.2216 0.1604 0.1667 0.1810
## Balanced Accuracy 0.9573 0.8776 0.8951 0.8804 0.8914
## Use model to predict classe of testing dataset
rf.data <- rfImpute(classe ~ ., data = trainCV)
## ntree OOB 1 2 3 4 5
## 300: 0.12% 0.00% 0.08% 0.21% 0.18% 0.24%
## ntree OOB 1 2 3 4 5
## 300: 0.13% 0.00% 0.11% 0.29% 0.22% 0.12%
## ntree OOB 1 2 3 4 5
## 300: 0.14% 0.00% 0.11% 0.25% 0.22% 0.20%
## ntree OOB 1 2 3 4 5
## 300: 0.14% 0.00% 0.08% 0.33% 0.18% 0.20%
## ntree OOB 1 2 3 4 5
## 300: 0.15% 0.00% 0.15% 0.21% 0.13% 0.32%
rf.model <- randomForest(classe ~ ., data = rf.data)
pred <- predict(rf.model, validateCV, type = "class")
## Produce a confusion matrix
confusionMatrix(pred, validateCV$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 2 0 0
## C 0 0 1024 0 0
## D 0 0 0 964 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9993
## 95% CI : (0.9983, 0.9998)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 0.9981 1.0000 0.9982
## Specificity 1.0000 0.9996 1.0000 0.9996 1.0000
## Pos Pred Value 1.0000 0.9982 1.0000 0.9979 1.0000
## Neg Pred Value 1.0000 1.0000 0.9996 1.0000 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1935 0.1740 0.1638 0.1835
## Detection Prevalence 0.2845 0.1939 0.1740 0.1641 0.1835
## Balanced Accuracy 1.0000 0.9998 0.9990 0.9998 0.9991
## Plot data that were incorrectly predicted
validateCV$isGoodPred <- validateCV$classe == pred
qplot(accel_arm_x, accel_arm_y, col = isGoodPred, data = validateCV)
## Predict "classe" from the testing dataset
predict(rf.model, test.df)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
In our analysis, we performed some exploratory analysis to determine the relationships of a few predictors with the response variable. Since there were so many predictors in our dataset, we wanted to predict upon our response using a few machine learning algorithms in order to quickly capture a more comprehensive understanding of the relationship between our predictors and the response initially. In this particular scenario, we wanted to compare the classification tree algorithm to the random forest classifier when predicting upon the “classe” variable in this dataset and, in the end, test their accuracies. Ultimately, we were able to accomplish our goal, and we are able to observe similar accuracies to the ones listed below.
| Method | Accuracy |
|---|---|
| Classification Tree | 0.8477 |
| Random Forest | 0.9987 |
In the end, the random forest classifier had the highest accuracy of the two models, since the random forest has an accuracy of 99%, and the classification tree has an accuracy of 85%. Therefore, we chose to use the random forest classifier as our final predictive model. The final predictions of the “classe” variables given the test data are provided in the table below.
| Test Set | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Prediction | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |