The data below is used to create and analyze a confusion matrix and create an ROC curve.
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
##
## 0 1
## 0 119 30
## 1 5 27
The rows represent the prediction model’s values.
The columns represent the (actual) target’s values.
The model predicted 119 0’s that were actually 0. The model predicted 30 0’s that were actually 1.
The model predicted 5 1’s that were actaully 0. The model predicted 27 1’s that were actually 1.
I am considering 0 to be positive and 1 to be negative.
The accuracy is the ratio of the correct predictions to the total predicitons.
## [1] 0.8066298
The error rate is the ratio of the incorrect predictions to the total predicitons.
## [1] 0.1933702
The total of the accuracy and error adds to 1.
## [1] 1
The precision is the ratio of the true positives (predicted 0 values that were 0 values) to the total positives (all zero values that were predicted).
## [1] 0.4358974
The sensitivity is the ratio of the true positives (values predicted to be 0 that were 0) to the true positives plus false negatives (target value is zero).
## [1] 0.9596774
The specificity is the ratio of the true negatives (predicted and target values are 1) to the true negatives plus false positives (predicted value is 0 and target value is 1).
## [1] 0.4736842
The F1 score is equal to 2xPrecisionxSensitive/(Precision+Sensitivity)
## [1] 0.5994962
If all of the predictions are correct, then there are only true positives and true negatives. In that case, the precision will equal 1 and the sensitivity will equal 1. The F1 score would then equal 1. That is the maximum boundary of the F1 score.
The other extreme is if there are no true positives. In that case the precision will equal 0. The F1 score in that case is zero.
The F1 score will be undefined if there are no true negatives and no true positives because then the sensitivity will be zero and the precision will be zero so the F1 score will be 0/0.
The ROC curve is a plot of true positive rate (sensitivity) vs. false positive rate (1-specificity). It is created by changing the cut-off. The cut-off is the threshold, such that probabilities below that result in the prediction being 0, and above that result in the prediction being 1.
The area under the curve is 0.8488964.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 30
## 1 5 27
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.9597
## Specificity : 0.4737
## Pos Pred Value : 0.7987
## Neg Pred Value : 0.8438
## Prevalence : 0.6851
## Detection Rate : 0.6575
## Detection Prevalence : 0.8232
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 0
##
The caret package yielded the same confusion matrix, sensitivity and specificity value that I found above.
##
## Call:
## roc.default(response = df$class, predictor = df$scored.probability)
##
## Data: df$scored.probability in 124 controls (df$class 0) < 57 cases (df$class 1).
## Area under the curve: 0.8503
The ROC curve from the pROC package produces a curve of the same shape as the ROC curve created above. The value of the area under the curve calculated from the pROC method is 0.8503. This is approximately equal to the area I calculated using the trapezoid method, which was 0.8489.
df <- read.csv(‘https://raw.githubusercontent.com/swigodsky/Data621/master/classification_output_data.csv’) head(df)
table(df\(scored.class,df\)class)
acc <- function(df){ totalnum <- length(df\(scored.class) numRight <- length(which(df\)scored.class==df$class)) accuracy <- numRight/totalnum return(accuracy) } accuracy <- acc(df) print(accuracy)
err <- function(df){ totalnum <- length(df\(scored.class) numWrong <- length(which(df\)scored.class!=df$class)) error <- numWrong/totalnum return(error) } error <- err(df) print(error)
print(accuracy+error)
prec <- function(df){ true_pos <- length(which((df\(scored.class==0)&(df\)class==0))) all_pos <- length(which(df\(class==0)) + length(which(df\)scored.class==0)) precision <- true_pos/all_pos return(precision) } precision <- prec(df) print(precision)
sens <- function(df){ true_pos <- length(which((df\(scored.class==0)&(df\)class==0))) false_neg <- length(which((df\(scored.class==1)&(df\)class==0))) sensitivity <- true_pos/(true_pos+false_neg) return(sensitivity) } sensitivity <- sens(df) print(sensitivity)
spec <- function(df){ true_neg <- length(which((df\(scored.class==1)&(df\)class==1))) false_pos <- length(which((df\(scored.class==0)&(df\)class==1))) sensitivity <- true_neg/(true_neg+false_pos) return(sensitivity) } specificity <- spec(df) print(specificity)
f1 <- function(df){ true_pos <- length(which((df\(scored.class==0)&(df\)class==0))) all_pos <- length(which(df\(class==0)) + length(which(df\)scored.class==0)) precision <- true_pos/all_pos
false_neg <- length(which((df\(scored.class==1)&(df\)class==0))) sensitivity <- true_pos/(true_pos+false_neg)
f1 <- 2precisionsensitivity/(precision+sensitivity) return(f1) } f1 <- f1(df) print(f1)
library(ggplot2) roc <- function(df){ roc_tester <- data.frame(o_m_specificity=NA, sensitivity=NA)[numeric(0), ] auc=0 for (cutoff in seq(0,1.0,0.01)){ test_df <- df #make a copy of df #set scored (predicted) values in test_df according to whether the probability is above or below the cut-off threshold test_df\(scored.class[test_df\)scored.probability < cutoff] <- 0 test_df\(scored.class[test_df\)scored.probability >= cutoff] <- 1
spec_val <- spec(test_df)
sens_val <- sens(test_df)
roc_tester <- rbind(roc_tester, list(o_m_specificity=1-spec_val,sensitivity= sens_val))
#calculating area of trapezoid for each set of data points
if (cutoff>=0.1){
num_values = nrow(roc_tester)
base2 = roc_tester$sensitivity[num_values]
base1 = roc_tester$sensitivity[num_values-1]
height2 = roc_tester$o_m_specificity[num_values]
height1 = roc_tester$o_m_specificity[num_values-1]
area = .5*(base1+base2)*(height2-height1)
auc = auc + area
}
}
roc_plot <- ggplot(roc_tester, aes(x = o_m_specificity, y = sensitivity)) + geom_point() + labs(x="False Positive Rate (1-specificity)", y="True Positive Rate (sensitivity)", title="ROC Curve" )
return(list(roc_plot=roc_plot, auc_val=auc)) }
roc_vals <- roc(df) roc_vals$roc_plot
The area under the curve is 0.8488964.
library(caret) library(e1071) values<- factor(df\(class) pred <- factor(df\)scored.class) confusionMatrix(pred,values)
library(pROC) pROC::roc(df\(class, df\)scored.probability) plot.roc(df\(class, df\)scored.probability, main=“ROC Curve Using pROC Package”)