The advent of small sensors has made it possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are revolutionizing the way we drive, operate machinery, and take measurements. Quality control engineers are often interested in quantifying how much is produced in a manufacturing process, but they rarely quantify how well they do it.
In this project, data from accelerometers on the belt, arm, and bell of a manufacturing mechanism performed by six expert machinists will be used to predict the quality in which they created the product. They were asked to create a part correctly and incorrectly in five different ways. The variable we are interested in predicting the outcome of is classe. This variable contains the levels A, B, C, D, and E, with A being the highest quality and E being the lowest. To predict the quality outcome of these products, I will fit different predictive models to the data set, compare the out-of-sample errors, and select the best model to make final predictions on twenty different test cases. Let’s begin!
setwd("C:/Users/Kelli/Documents")
machineTraining <- read.csv("machine_training.csv")
machineTest <- read.csv("machine_testing.csv")
sum(is.na(machineTraining))
## [1] 1287472
sum(is.na(machineTest))
## [1] 2000
Upon exploring the data, it appears there are many variables with missing and NA values, which will need to be removed before fitting the model. In addition, the first seven columns will also be removed as these variables are descriptive and not related to the product development.
training <- machineTraining[,colSums(is.na(machineTraining))==0 & colSums(machineTraining=="")==0]
testing <- machineTest[,colSums(is.na(machineTest)) == 0 & colSums(machineTest=="")==0]
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
#Double check for any missing data
sum(is.na(training))
## [1] 0
sum(is.na(testing))
## [1] 0
#Check new dimensions to ensure data integrity
dim(training)
## [1] 19622 53
dim(testing)
## [1] 20 53
Now both data sets consist of 53 predictor variables.
Next, I split the training data into two subsets to perform cross-validation, one for training and one for testing. Observations from the test subset will be held out of model fitting and used to make predictions on. This will provide an unbiased estimate of the error rate when the model is used in production. Excluding a subset of the data from the model fitting process is essential to avoid overfitting and to measure the model’s performance.
library(caret)
#Set seed for reproducibility
set.seed(99)
#Create an 80/20 split of the data
machineTrainingIndex <- createDataPartition(training$classe, p=0.8, list=FALSE, times=1)
#Training subset that contains 80% of the observations
machineSubTrain <- training[machineTrainingIndex,]
#Testing subset that contains 20% of the observations
machineSubTest <- training[-machineTrainingIndex,]
The first model I chose to evaluate was a decision tree since the variable we are interested in predicting, classe, is a categorical variable. Decision trees are useful in classifying observations into categories based on the explanatory variables in the data set.
library(rpart)
library(rattle)
classeTree <- rpart(classe ~ ., data=machineSubTrain, method="class")
fancyRpartPlot(classeTree, main="Classification Tree", sub="")
To evaluate the decision tree’s performance, I made predictions using the testing subset. This is data the model hasn’t seen before and will give us a good estimate of how well it will perform when given new data.
predTree <- predict(classeTree, newdata=machineSubTest, type="class")
predTree <- as.data.frame(predTree)
subTestTree <- cbind(machineSubTest, predTree)
#Calculate accuracy and out-of-sample error
accuracyTree <- sum(predTree$predTree==machineSubTest$classe)/NROW(machineSubTest)
oseTree <- (1-accuracyTree)*100
sprintf("The accuracy is %s%% and the out-of-sample error is %s%%.", round(accuracyTree*100, digits = 2), round(oseTree, digits = 2))
## [1] "The accuracy is 75.61% and the out-of-sample error is 24.39%."
To improve accuracy and minimize the out-of-sample error, I decided to fit a random forest model. Random forests are built on many decision trees using different sets of randomly selected predictors and excel in nonlinear classification problems.
library(randomForest)
library(e1071)
cvCtrl <- trainControl(method="repeatedcv", repeats = 3)
classeRf <- train(classe ~ ., data=machineSubTrain, method="rf", trControl=cvCtrl)
Using the same process as the decision tree, I evaluated the random forest model using the testing subset data.
predRf <- predict(classeRf, newdata=machineSubTest, type="raw")
predRf <- as.data.frame(predRf)
subTestRf <- cbind(machineSubTest, predRf)
#Calculate the accuracy and out-of-sample error
accuracyRf <- sum(predRf$predRf==machineSubTest$classe)/NROW(machineSubTest)
oseRf <- (1-accuracyRf)*100
sprintf("The accuracy is %s%% and the out-of-sample error is %s%%.", round(accuracyRf*100, digits = 2), round(oseRf, digits = 2))
## [1] "The accuracy is 99.01% and the out-of-sample error is 0.99%."
A nice feature of random forest models is their ability to measure the importance of predictors, which can be seen in the graph below. The variables with points farther to the right have a higher mean decrease in Gini importance. The higher the number, the more the Gini impurity score decreases, indicating that the variable is more important.
varImpRf <- varImp(classeRf)
plot(varImpRf, main = "Importance of Top 25 Predictors", top = 25)
To get a better understanding of the performance of both models, I constructed a confusion matrix of each one. A confusion matrix shows the number of correct predictions the model makes as well as the types of errors it makes.
predTree <- predict(classeTree, newdata=machineSubTest, type="class")
predRf <- predict(classeRf, newdata=machineSubTest, type="raw")
#Decision Tree Confusion Matrix
confusionMatrix(factor(predTree), factor(machineSubTest$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1010 135 10 28 24
## B 39 468 57 45 61
## C 32 60 554 106 86
## D 15 72 40 421 37
## E 20 24 23 43 513
##
## Overall Statistics
##
## Accuracy : 0.7561
## 95% CI : (0.7423, 0.7694)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6906
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9050 0.6166 0.8099 0.6547 0.7115
## Specificity 0.9298 0.9362 0.9123 0.9500 0.9656
## Pos Pred Value 0.8368 0.6985 0.6611 0.7197 0.8234
## Neg Pred Value 0.9610 0.9105 0.9579 0.9335 0.9370
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2575 0.1193 0.1412 0.1073 0.1308
## Detection Prevalence 0.3077 0.1708 0.2136 0.1491 0.1588
## Balanced Accuracy 0.9174 0.7764 0.8611 0.8024 0.8386
#Random Forest Confusion Matrix
confusionMatrix(factor(predRf), factor(machineSubTest$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1113 9 0 0 0
## B 2 749 3 0 1
## C 1 1 677 13 4
## D 0 0 4 630 1
## E 0 0 0 0 715
##
## Overall Statistics
##
## Accuracy : 0.9901
## 95% CI : (0.9864, 0.9929)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9874
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9973 0.9868 0.9898 0.9798 0.9917
## Specificity 0.9968 0.9981 0.9941 0.9985 1.0000
## Pos Pred Value 0.9920 0.9921 0.9727 0.9921 1.0000
## Neg Pred Value 0.9989 0.9968 0.9978 0.9960 0.9981
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2837 0.1909 0.1726 0.1606 0.1823
## Detection Prevalence 0.2860 0.1925 0.1774 0.1619 0.1823
## Balanced Accuracy 0.9971 0.9925 0.9920 0.9891 0.9958
After comparing the results, I chose to use the random forest model to make predictions on the test cases as it significantly minimized the out-of-sample error. Below are the final predictions made by the model, and it was able to correctly predict 100% of the product quality outcomes!
finalPredictions <- predict(classeRf, testing, type="raw")
print(finalPredictions)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E