Predicting the Quality of a Product

Background

The advent of small sensors has made it possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are revolutionizing the way we drive, operate machinery, and take measurements. Quality control engineers are often interested in quantifying how much is produced in a manufacturing process, but they rarely quantify how well they do it.

In this project, data from accelerometers on the belt, arm, and bell of a manufacturing mechanism performed by six expert machinists will be used to predict the quality in which they created the product. They were asked to create a part correctly and incorrectly in five different ways. The variable we are interested in predicting the outcome of is classe. This variable contains the levels A, B, C, D, and E, with A being the highest quality and E being the lowest. To predict the quality outcome of these products, I will fit different predictive models to the data set, compare the out-of-sample errors, and select the best model to make final predictions on twenty different test cases. Let’s begin!

Processing the Data

Read the data sets into R

setwd("C:/Users/Kelli/Documents")
machineTraining <- read.csv("machine_training.csv")
machineTest <- read.csv("machine_testing.csv")

Check for missing data

sum(is.na(machineTraining))

## [1] 1287472

sum(is.na(machineTest))

## [1] 2000

Clean the data, removing columns with incomplete data

Upon exploring the data, it appears there are many variables with missing and NA values, which will need to be removed before fitting the model. In addition, the first seven columns will also be removed as these variables are descriptive and not related to the product development.

training <- machineTraining[,colSums(is.na(machineTraining))==0 & colSums(machineTraining=="")==0]
testing <- machineTest[,colSums(is.na(machineTest)) == 0 & colSums(machineTest=="")==0]
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]

#Double check for any missing data
sum(is.na(training))

## [1] 0

sum(is.na(testing))

## [1] 0

#Check new dimensions to ensure data integrity
dim(training)

## [1] 19622    53

dim(testing)

## [1] 20 53

Now both data sets consist of 53 predictor variables.

Split the training data into subsets for cross-validation

Next, I split the training data into two subsets to perform cross-validation, one for training and one for testing. Observations from the test subset will be held out of model fitting and used to make predictions on. This will provide an unbiased estimate of the error rate when the model is used in production. Excluding a subset of the data from the model fitting process is essential to avoid overfitting and to measure the model’s performance.

library(caret)

#Set seed for reproducibility
set.seed(99) 

#Create an 80/20 split of the data
machineTrainingIndex <- createDataPartition(training$classe, p=0.8, list=FALSE, times=1)

#Training subset that contains 80% of the observations
machineSubTrain <- training[machineTrainingIndex,] 

#Testing subset that contains 20% of the observations
machineSubTest <- training[-machineTrainingIndex,]

Training a Decision Tree Model

The first model I chose to evaluate was a decision tree since the variable we are interested in predicting, classe, is a categorical variable. Decision trees are useful in classifying observations into categories based on the explanatory variables in the data set.

library(rpart)
library(rattle)

classeTree <- rpart(classe ~ ., data=machineSubTrain, method="class")
fancyRpartPlot(classeTree, main="Classification Tree", sub="")

Evaluating the Model

To evaluate the decision tree’s performance, I made predictions using the testing subset. This is data the model hasn’t seen before and will give us a good estimate of how well it will perform when given new data.

predTree <- predict(classeTree, newdata=machineSubTest, type="class") 
predTree <- as.data.frame(predTree) 
subTestTree <- cbind(machineSubTest, predTree)

#Calculate accuracy and out-of-sample error
accuracyTree <- sum(predTree$predTree==machineSubTest$classe)/NROW(machineSubTest)
oseTree <- (1-accuracyTree)*100
sprintf("The accuracy is %s%% and the out-of-sample error is %s%%.", round(accuracyTree*100, digits = 2), round(oseTree, digits = 2))

## [1] "The accuracy is 75.61% and the out-of-sample error is 24.39%."

Fitting a Random Forest Model

To improve accuracy and minimize the out-of-sample error, I decided to fit a random forest model. Random forests are built on many decision trees using different sets of randomly selected predictors and excel in nonlinear classification problems.

library(randomForest)
library(e1071)
cvCtrl <- trainControl(method="repeatedcv", repeats = 3)
classeRf <- train(classe ~ ., data=machineSubTrain, method="rf", trControl=cvCtrl)

Evaluating the Model

Using the same process as the decision tree, I evaluated the random forest model using the testing subset data.

predRf <- predict(classeRf, newdata=machineSubTest, type="raw")
predRf <- as.data.frame(predRf) 
subTestRf <- cbind(machineSubTest, predRf)

#Calculate the accuracy and out-of-sample error
accuracyRf <- sum(predRf$predRf==machineSubTest$classe)/NROW(machineSubTest)
oseRf <- (1-accuracyRf)*100
sprintf("The accuracy is %s%% and the out-of-sample error is %s%%.", round(accuracyRf*100, digits = 2), round(oseRf, digits = 2))

## [1] "The accuracy is 99.01% and the out-of-sample error is 0.99%."

Importance of Variables

A nice feature of random forest models is their ability to measure the importance of predictors, which can be seen in the graph below. The variables with points farther to the right have a higher mean decrease in Gini importance. The higher the number, the more the Gini impurity score decreases, indicating that the variable is more important.

varImpRf <- varImp(classeRf)
plot(varImpRf, main = "Importance of Top 25 Predictors", top = 25)

Confusion Matrix

To get a better understanding of the performance of both models, I constructed a confusion matrix of each one. A confusion matrix shows the number of correct predictions the model makes as well as the types of errors it makes.

predTree <- predict(classeTree, newdata=machineSubTest, type="class")
predRf <- predict(classeRf, newdata=machineSubTest, type="raw")

#Decision Tree Confusion Matrix
confusionMatrix(factor(predTree), factor(machineSubTest$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1010  135   10   28   24
##          B   39  468   57   45   61
##          C   32   60  554  106   86
##          D   15   72   40  421   37
##          E   20   24   23   43  513
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7561          
##                  95% CI : (0.7423, 0.7694)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6906          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9050   0.6166   0.8099   0.6547   0.7115
## Specificity            0.9298   0.9362   0.9123   0.9500   0.9656
## Pos Pred Value         0.8368   0.6985   0.6611   0.7197   0.8234
## Neg Pred Value         0.9610   0.9105   0.9579   0.9335   0.9370
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2575   0.1193   0.1412   0.1073   0.1308
## Detection Prevalence   0.3077   0.1708   0.2136   0.1491   0.1588
## Balanced Accuracy      0.9174   0.7764   0.8611   0.8024   0.8386

#Random Forest Confusion Matrix
confusionMatrix(factor(predRf), factor(machineSubTest$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1113    9    0    0    0
##          B    2  749    3    0    1
##          C    1    1  677   13    4
##          D    0    0    4  630    1
##          E    0    0    0    0  715
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9901          
##                  95% CI : (0.9864, 0.9929)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9874          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9868   0.9898   0.9798   0.9917
## Specificity            0.9968   0.9981   0.9941   0.9985   1.0000
## Pos Pred Value         0.9920   0.9921   0.9727   0.9921   1.0000
## Neg Pred Value         0.9989   0.9968   0.9978   0.9960   0.9981
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2837   0.1909   0.1726   0.1606   0.1823
## Detection Prevalence   0.2860   0.1925   0.1774   0.1619   0.1823
## Balanced Accuracy      0.9971   0.9925   0.9920   0.9891   0.9958

Conclusion

After comparing the results, I chose to use the random forest model to make predictions on the test cases as it significantly minimized the out-of-sample error. Below are the final predictions made by the model, and it was able to correctly predict 100% of the product quality outcomes!

Final Predictions

finalPredictions <- predict(classeRf, testing, type="raw")
print(finalPredictions)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Predicting the Quality of a Product

Kelli Belcher

Background

Processing the Data

Read the data sets into R

Check for missing data

Clean the data, removing columns with incomplete data

Split the training data into subsets for cross-validation

Training a Decision Tree Model

Evaluating the Model

Fitting a Random Forest Model

Evaluating the Model

Importance of Variables

Confusion Matrix

Conclusion

Final Predictions