The data below is used to create and analyze a confusion matrix and create an ROC curve.

##   pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1        7     124        70       33     215 25.5    0.161  37     0
## 2        2     122        76       27     200 35.9    0.483  26     0
## 3        3     107        62       13      48 22.9    0.678  23     1
## 4        1      91        64       24       0 29.2    0.192  21     0
## 5        4      83        86       19       0 29.3    0.317  34     0
## 6        1     100        74       12      46 19.5    0.149  28     0
##   scored.class scored.probability
## 1            0         0.32845226
## 2            0         0.27319044
## 3            0         0.10966039
## 4            0         0.05599835
## 5            0         0.10049072
## 6            0         0.05515460

Confusion Matrix

##    
##       0   1
##   0 119  30
##   1   5  27

The rows represent the prediction model’s values.
The columns represent the (actual) target’s values.

The model predicted 119 0’s that were actually 0. The model predicted 30 0’s that were actually 1.
The model predicted 5 1’s that were actaully 0. The model predicted 27 1’s that were actually 1.
I am considering 0 to be positive and 1 to be negative.

Accuracy

The accuracy is the ratio of the correct predictions to the total predicitons.

## [1] 0.8066298

Classification Error Rate

The error rate is the ratio of the incorrect predictions to the total predicitons.

## [1] 0.1933702

The total of the accuracy and error adds to 1.

## [1] 1

Precision

The precision is the ratio of the true positives (predicted 0 values that were 0 values) to the total positives (all zero values that were predicted).

## [1] 0.4358974

Sensitivity

The sensitivity is the ratio of the true positives (values predicted to be 0 that were 0) to the true positives plus false negatives (target value is zero).

## [1] 0.9596774

Specificity

The specificity is the ratio of the true negatives (predicted and target values are 1) to the true negatives plus false positives (predicted value is 0 and target value is 1).

## [1] 0.4736842

F1 Score

The F1 score is equal to 2xPrecisionxSensitive/(Precision+Sensitivity)

## [1] 0.5994962

Bounds of F1 Score

If all of the predictions are correct, then there are only true positives and true negatives. In that case, the precision will equal 1 and the sensitivity will equal 1. The F1 score would then equal 1. That is the maximum boundary of the F1 score.

The other extreme is if there are no true positives. In that case the precision will equal 0. The F1 score in that case is zero.
The F1 score will be undefined if there are no true negatives and no true positives because then the sensitivity will be zero and the precision will be zero so the F1 score will be 0/0.

ROC(Receiver Operating Characteristic) Curve

The ROC curve is a plot of true positive rate (sensitivity) vs. false positive rate (1-specificity). It is created by changing the cut-off. The cut-off is the threshold, such that probabilities below that result in the prediction being 0, and above that result in the prediction being 1.

The area under the curve is 0.8488964.

Caret Package

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.9597          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.7987          
##          Neg Pred Value : 0.8438          
##              Prevalence : 0.6851          
##          Detection Rate : 0.6575          
##    Detection Prevalence : 0.8232          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 0               
##

The caret package yielded the same confusion matrix, sensitivity and specificity value that I found above.

## 
## Call:
## roc.default(response = df$class, predictor = df$scored.probability)
## 
## Data: df$scored.probability in 124 controls (df$class 0) < 57 cases (df$class 1).
## Area under the curve: 0.8503

The ROC curve from the pROC package produces a curve of the same shape as the ROC curve created above. The value of the area under the curve calculated from the pROC method is 0.8503. This is approximately equal to the area I calculated using the trapezoid method, which was 0.8489.

Appendix—

df <- read.csv(‘https://raw.githubusercontent.com/swigodsky/Data621/master/classification_output_data.csv’) head(df)

Confusion Matrix

table(df$scored.class,df$class)

Accuracy

acc <- function(df){ totalnum <- length(df$scored.class) numRight <- length(which(df$scored.class==df$class)) accuracy <- numRight/totalnum return(accuracy) } accuracy <- acc(df) print(accuracy)

Classification Error Rate

err <- function(df){ totalnum <- length(df$scored.class) numWrong <- length(which(df$scored.class!=df$class)) error <- numWrong/totalnum return(error) } error <- err(df) print(error)

print(accuracy+error)

Precision

prec <- function(df){ true_pos <- length(which((df$scored.class==0)&(df$class==0))) all_pos <- length(which(df$class==0)) + length(which(df$scored.class==0)) precision <- true_pos/all_pos return(precision) } precision <- prec(df) print(precision)

Sensitivity

sens <- function(df){ true_pos <- length(which((df$scored.class==0)&(df$class==0))) false_neg <- length(which((df$scored.class==1)&(df$class==0))) sensitivity <- true_pos/(true_pos+false_neg) return(sensitivity) } sensitivity <- sens(df) print(sensitivity)

Specificity

spec <- function(df){ true_neg <- length(which((df$scored.class==1)&(df$class==1))) false_pos <- length(which((df$scored.class==0)&(df$class==1))) sensitivity <- true_neg/(true_neg+false_pos) return(sensitivity) } specificity <- spec(df) print(specificity)

F1 Score

f1 <- function(df){ true_pos <- length(which((df$scored.class==0)&(df$class==0))) all_pos <- length(which(df$class==0)) + length(which(df$scored.class==0)) precision <- true_pos/all_pos

false_neg <- length(which((df$scored.class==1)&(df$class==0))) sensitivity <- true_pos/(true_pos+false_neg)

f1 <- 2precisionsensitivity/(precision+sensitivity) return(f1) } f1 <- f1(df) print(f1)

ROC(Receiver Operating Characteristic) Curve

library(ggplot2) roc <- function(df){ roc_tester <- data.frame(o_m_specificity=NA, sensitivity=NA)[numeric(0), ] auc=0 for (cutoff in seq(0,1.0,0.01)){ test_df <- df #make a copy of df #set scored (predicted) values in test_df according to whether the probability is above or below the cut-off threshold test_df$scored.class[test_df$scored.probability < cutoff] <- 0 test_df$scored.class[test_df$scored.probability >= cutoff] <- 1

spec_val <- spec(test_df)
sens_val <- sens(test_df)
  
roc_tester <- rbind(roc_tester, list(o_m_specificity=1-spec_val,sensitivity= sens_val))

#calculating area of trapezoid for each set of data points  
if (cutoff>=0.1){
  num_values = nrow(roc_tester)
  base2 = roc_tester$sensitivity[num_values]
  base1 = roc_tester$sensitivity[num_values-1]
  height2 = roc_tester$o_m_specificity[num_values]
  height1 = roc_tester$o_m_specificity[num_values-1]
  area = .5*(base1+base2)*(height2-height1)
  auc = auc + area
}

}

roc_plot <- ggplot(roc_tester, aes(x = o_m_specificity, y = sensitivity)) + geom_point() + labs(x="False Positive Rate (1-specificity)", y="True Positive Rate (sensitivity)", title="ROC Curve" )

return(list(roc_plot=roc_plot, auc_val=auc)) }

roc_vals <- roc(df) roc_vals$roc_plot

The area under the curve is 0.8488964.

Caret Package

library(caret) library(e1071) values<- factor(df$class) pred <- factor(df$scored.class) confusionMatrix(pred,values)

library(pROC) pROC::roc(df$class, df$scored.probability) plot.roc(df$class, df$scored.probability, main=“ROC Curve Using pROC Package”)

DATA 621 - HW 2 - Classification Metrics

Sarah Wigodsky

October 9, 2018