In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.
1)Download the classification Data
## pregnant glucose diastolic skinfold insulin bmi pedigree age class
## 1 7 124 70 33 215 25.5 0.161 37 0
## 2 2 122 76 27 200 35.9 0.483 26 0
## 3 3 107 62 13 48 22.9 0.678 23 1
## 4 1 91 64 24 0 29.2 0.192 21 0
## 5 4 83 86 19 0 29.3 0.317 34 0
## 6 1 100 74 12 46 19.5 0.149 28 0
## 7 9 89 62 0 0 22.5 0.142 33 0
## 8 8 120 78 0 0 25.0 0.409 64 0
## 9 1 79 60 42 48 43.5 0.678 23 0
## 10 2 123 48 32 165 42.1 0.520 26 0
## scored.class scored.probability
## 1 0 0.32845226
## 2 0 0.27319044
## 3 0 0.10966039
## 4 0 0.05599835
## 5 0 0.10049072
## 6 0 0.05515460
## 7 0 0.10711542
## 8 0 0.45994744
## 9 0 0.11702368
## 10 0 0.31536320
## class scored.class scored.probability
## 1 0 0 0.32845226
## 2 0 0 0.27319044
## 3 1 0 0.10966039
## 4 0 0 0.05599835
## 5 0 0 0.10049072
## 6 0 0 0.05515460
## 7 0 0 0.10711542
## 8 0 0 0.45994744
## 9 0 0 0.11702368
## 10 0 0 0.31536320
Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?
## class
## scored.class 1 0
## 1 27 5
## 0 30 119
The confusion matrix summarizes the prediction output. The columns are actuals while the rows are predicted. The diagonal of the matrix is considered to be the “ground truth” or true values. There are 20 false negatives and 5 false positives. In the past, I have worked with an almost identical dataset, so with my background knowledge, I can infer that the goal was to classify someone having diabetes. Is this the Pima dataset? There are more false negatives than false positive hence greater chance of a type 2 error.
The formula for accuracy is as follows: \[ Accuracy=\frac { TP+Tn }{ TP+FP+Tn+Fn } \]
Variable definitions: TP=true positive Tn=true negative FP=false positive Fn=false negative
lets write a function that takes this formula and returns prediction accuracy
#convert the subset data into data frame
classes.df<-data.frame(classification_data2)
#define the conditions for each variable
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives
accuracy <- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}
#accuracy(classes.df)
We can verify if our function is correct by doing the calculation directly (almost like doing it by hand)
A<-(27+119)/(27+5+119+30)
A
## [1] 0.8066298
The formula for the error rate is as follows: \[ ErrorRate=\frac { FP+Fn }{ TP+FP+Tn+Fn } \]
We build a similar function as the prediction accuracy
#define the conditions for each variable
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives
error<- function(df)
{
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
(FP + Fn)/nrow(df)
}
#error(classes.df)
As with prediction accuracy, we can compute error by hand to verify our function
E<-(5+30)/(27+5+119+30)
E
## [1] 0.1933702
We can also find the error rate by computing the compliment of the accuracy as follows: \[ Error=1-Accuracy \]
E2<-1-A
E2
## [1] 0.1933702
We need to verify that the error and the accuracy sum to one. By finding the error using the compliment of the accuracy, it is implied that they add up to one. We can still check.
E+A
## [1] 1
The precision formula is defined as follows: \[ precision=\frac { TP }{ TP+FP } \]
We define our function similar to previous functions
#our variables pairings are similar to the previous functions we have written
classes.df<-data.frame(classification_data2)
precision<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
#precision
We can verify our function with a hand calculation
p<-27/(27+5)
p
## [1] 0.84375
The sensitivity is defined as follows: \[ sensitivity=\frac { TP }{ TP+Fn } \] Define function as follows:
classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written
sensitivity<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
Compute by hand to verify the function calculation
s<-27/(27+30)
s
## [1] 0.4736842
The specificity is defined as follows
\[ specificity=\frac { Tn }{ Tn+FP } \]
find using a written function
#our variables pairings are similar to the previous functions we have written
specificity<- function(df)
{
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Tn/(Tn + FP)
}
calculate by hand
sp<-(119/(119+5))
sp
## [1] 0.9596774
\[ F1\quad Score=\frac { 2\times precision\times sensitivity }{ precision+sensitivity } \]
Since we previously defined the parameters in the equation, we can recall them to compute the F1 score.
#take the previous functions and nest them into a function for F1 score
F1 <- function(x)
{
precision1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
sensitivity1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
return(F1)
}
#F1(classes.df)
Lets verify our function by hand
f1<-(2*0.84375*0.4736842)/(0.84375+0.4736842)
f1
## [1] 0.6067416
With some algebraic manipulation, we can reduce the F1 score formula as follows: \[ F1Score\quad=\frac { 2TP }{ 2TP+Fn+FP } \]
If we compute this formula by hand, we should get the same result:
f1_a<-(2*27)/(2*27 +30 +5)
f1_a
## [1] 0.6067416
We can build a function for this easier reduced version of the F1 score
#our variables pairings are similar to the previous functions we have written
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
Lets assume we have classified n true positives and no false negatives or false positives. Assume that TP is non zero and Fn and FP are equal to zero, then the formula is reduced as follows: \[ F1\quad Score=\frac { 2TP }{ 2TP+0+0 } =\frac { 2TP }{ 2TP } =1 \]
Lets assume that we we classified no true positives and n false negatives with m false positives or 0 false negatives with m false positives, or n false negatives with 0 false positives, then our F1 output would be as follows: \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+Fn+FP } =0 \] \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+Fn+0 } =0 \] \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+0+FP } =0 \] The third situation we can consider is if TP, Fn, and FP are all non zero. The denominator is greater than the numerator, hence the F1 score will always be a value between zero and one. \[ 0<\frac { 2TP }{ 2TP+Fn+FP } <1 \]
Conceptually, if the F1 score is zero, then the analysis output tells us that we do not have any information or conclusion on posittive cases. Perhaps a different model should be considered. If the F1 score is 1, then we are only able to classifiy true cases and have no way of classfying false cases be false positives or false negatives. The F1 score is not possible if the denominator is zero. That is an indicator that a different model should be considered.
Perhaps this can be shown algebraically using the non reduced version of the F1 score formula Recall the definitions of precision and sensitivity from Wikipedia
precision-is the fraction of relevant instances among the retrieved instances sensitivity-measures the proportion of actual positives that are correctly identified as such
Multiply all sides on the inequality by the LCD (precision + sensitivity) \[ 0<\frac { 2\times precision\times sensitivity }{ precision+sensitivity } <1 \] Divide all sides of the inequality by (2 x precision x sensitivity) \[ 0<{2\times precision\times sensitivity } < {precision + sensitivity} \]
Simplify \[ 0<{ 1 }<\frac { precision }{ 2\times precision\times sensitivity } +\frac { sensitivity }{ 2\times precision\times sensitivity } \] cancel out terms \[ 0<{ 1 }<\frac { 1 }{ 2\times sensitivity } +\frac { 1 }{ 2\times precision\ } \]
The resulting inequality is undefined if sensitivity and or precision is zero. The domain is as follows: \[ {sensitivity, Precision ~ {D:(0, \infty)}} \]
If (true positives + false negatives) = 0 then no positive cases in the input data, so any analysis of this case has no information, and so no conclusion about how positive cases are handled. You want n/A or something similar as the ratio result, avoiding a division by zero error
If (true positives + false positives) = 0 then all cases have been predicted to be negative: this is one end of the ROC curve. Again, you want to recognise and report this possibility while avoiding a division by zero error.
library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))
myroc = function(df)
{
#define the threshold
threshold = seq(0,1, by= 0.01)
#define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
x = seq_along(threshold)
y = seq_along(threshold)
for (i in threshold)
{
prob.df = as.numeric(df$scored.probability>i)
prediction.df = cbind(prediction.df, prob.df)
}
for (j in 1:length(threshold))
{
classes.type.df = factor(df$class,levels = c(0,1))
prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
collector.table = table(classes.type.df, prediciton_types.df)
#rather than introduce more nested function based on our previous definition of these parameters, we can get them from the entires in the confusion matrix. The code is not as verbose
sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
y[j] = sensitivity_a
x[j] = 1 - specificity_a
}
roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)
myauc <- function(outcome, proba)
{
n = length(proba)
positives_sum = sum(outcome)
df = data.frame(out = outcome, prob = proba)
df = df[order(-df$prob),]
df$above = (1:n) - cumsum(df$out)
return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
}
auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
results
}
#generate the plot
#myroc(classification)
We essentially take the functions we made above and consolidate all the output for this question. Since we also calculated each value by hand, these values can easily be verified.
accuracy(classes.df);
## [1] 0.8066298
error(classes.df);
## [1] 0.1933702
precision(classes.df);
## [1] 0.84375
sensitivity(classes.df);
## [1] 0.4736842
specificity(classes.df);
## [1] 0.9596774
F1(classes.df);
## [1] 0.6067416
myroc(classification)
## $Plot
##
## $`Area under curve`
## [1] 0.8503113
Confusion matrix
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked _by_ '.GlobalEnv':
##
## precision, sensitivity, specificity
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix
## Confusion Matrix and Statistics
##
## class
## scored.class 1 0
## 1 27 5
## 0 30 119
##
## Accuracy : 0.8066
## 95% CI : (0.7415, 0.8615)
## No Information Rate : 0.6851
## P-Value [Acc > NIR] : 0.0001712
##
## Kappa : 0.4916
## Mcnemar's Test P-Value : 4.976e-05
##
## Sensitivity : 0.4737
## Specificity : 0.9597
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7987
## Prevalence : 0.3149
## Detection Rate : 0.1492
## Detection Prevalence : 0.1768
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##
#Confusion matrix table
#caret_matrix$table
The values from confusion matrix are indeed simlar to those generated from the user made functions.
library(pROC)
## Warning: package 'pROC' was built under R version 3.4.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
## Area under the curve: 0.8503
The difference between the plot generated from my function and the pROC function is marginal. They are more similar than different.
APPENDIX
url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-2/master/classification-output-data.csv'
classification_data <- read.csv(url, header = TRUE)
head(classification_data,10)
classification_data2<-subset(classification_data,select=c(class,scored.class, scored.probability))
head(classification_data2,10)
matrix.df<-with(classification_data2, table(scored.class, class)[2:1,2:1])
matrix.df
#convert the subset data into data frame
classes.df<-data.frame(classification_data2)
#define the conditions for each variable
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives
accuracy <- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}
#accuracy(classes.df)
A<-(27+119)/(27+5+119+30)
A
#define the conditions for each variable
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives
error<- function(df)
{
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
(FP + Fn)/nrow(df)
}
#error(classes.df)
E<-(5+30)/(27+5+119+30)
E
E2<-1-A
E2
E+A
#our variables pairings are similar to the previous functions we have written
classes.df<-data.frame(classification_data2)
precision<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
#precision
p<-27/(27+5)
p
classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written
sensitivity<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
s<-27/(27+30)
s
#our variables pairings are similar to the previous functions we have written
specificity<- function(df)
{
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Tn/(Tn + FP)
}
sp<-(119/(119+5))
sp
#take the previous functions and nest them into a function for F1 score
F1 <- function(x)
{
precision1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
sensitivity1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
return(F1)
}
#F1(classes.df)
f1<-(2*0.84375*0.4736842)/(0.84375+0.4736842)
f1
f1_a<-(2*27)/(2*27 +30 +5)
f1_a
#our variables pairings are similar to the previous functions we have written
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))
myroc = function(df)
{
#define the threshold
threshold = seq(0,1, by= 0.01)
#define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
x = seq_along(threshold)
y = seq_along(threshold)
for (i in threshold)
{
prob.df = as.numeric(df$scored.probability>i)
prediction.df = cbind(prediction.df, prob.df)
}
for (j in 1:length(threshold))
{
classes.type.df = factor(df$class,levels = c(0,1))
prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
collector.table = table(classes.type.df, prediciton_types.df)
#rather than introduce more nested function based on our previous definition of these parameters, we can get them from the entires in the confusion matrix. The code is not as verbose
sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
y[j] = sensitivity_a
x[j] = 1 - specificity_a
}
roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)
myauc <- function(outcome, proba)
{
n = length(proba)
positives_sum = sum(outcome)
df = data.frame(out = outcome, prob = proba)
df = df[order(-df$prob),]
df$above = (1:n) - cumsum(df$out)
return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
}
auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
results
}
#generate the plot
#myroc(classification)
accuracy(classes.df);
error(classes.df);
precision(classes.df);
sensitivity(classes.df);
specificity(classes.df);
F1(classes.df);
myroc(classification)
library(caret)
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix
#Confusion matrix table
#caret_matrix$table
library(pROC)
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
#1
url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-2/master/classification-output-data.csv'
classification_data <- read.csv(url, header = TRUE)
head(classification_data,10)
#2
classification_data2<-subset(classification_data,select=c(class,scored.class, scored.probability))
head(classification_data2,10)
matrix.df<-with(classification_data2, table(scored.class, class)[2:1,2:1])
matrix.df
#3
#convert the subset data into data frame
classes.df<-data.frame(classification_data2)
#define the conditions for each variable
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives
accuracy <- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}
accuracy(classes.df)
#4
#define the conditions for each variable
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives
error<- function(df)
{
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
(FP + Fn)/nrow(df)
}
error(classes.df)
#5
#our variables pairings are similar to the previous functions we have written
classes.df<-data.frame(classification_data2)
precision<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
precision<-precision(classes.df);
precision
#6
classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written
sensitivity<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
sensitivity<-sensitivity(classes.df);
sensitivity
#7
#our variables pairings are similar to the previous functions we have written
specificity<- function(df)
{
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
Tn/(Tn + FP)
}
specificity<-specificity(classes.df);
specificity
#8
#take the previous functions and nest them into a function for F1 score
F1 <- function(x)
{
precision1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(TP)/(TP + FP)
}
sensitivity1<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
TP/(TP + Fn)
}
F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
return(F1)
}
F1(classes.df)
#our variables pairings are similar to the previous functions we have written
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced<-f1_reduced(classes.df);
f1_reduced
#10
library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))
myroc = function(df)
{
#define the threshold
threshold = seq(0,1, by= 0.01)
#define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
x = seq_along(threshold)
y = seq_along(threshold)
for (i in threshold)
{
prob.df = as.numeric(df$scored.probability>i)
prediction.df = cbind(prediction.df, prob.df)
}
for (j in 1:length(threshold))
{
classes.type.df = factor(df$class,levels = c(0,1))
prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
collector.table = table(classes.type.df, prediciton_types.df)
#rather than introduce more nested function based on our previous definition of these parameters, we can get them from the entires in the confusion matrix. The code is not as verbose
sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
y[j] = sensitivity_a
x[j] = 1 - specificity_a
}
roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)
myauc <- function(outcome, proba)
{
n = length(proba)
positives_sum = sum(outcome)
df = data.frame(out = outcome, prob = proba)
df = df[order(-df$prob),]
df$above = (1:n) - cumsum(df$out)
return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
}
auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
results
}
#generate the plot
myroc(classification)
#11
accuracy(classes.df);
error(classes.df);
precision(classes.df);
sensitivity(classes.df);
specificity(classes.df);
F1(classes.df);
myroc(classification)
#12
library(caret)
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix
#Confusion matrix table
caret_matrix$table
#13
library(pROC)
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))