DATA 621- Week 2

In this homework assignment, you will work through various classification metrics. You will be asked to create functions in R to carry out the various calculations. You will also investigate some functions in packages that will let you obtain the equivalent results. Finally, you will create graphical output that also can be used to evaluate the output of classification models, such as binary logistic regression.

1)Download the classification Data

##    pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1         7     124        70       33     215 25.5    0.161  37     0
## 2         2     122        76       27     200 35.9    0.483  26     0
## 3         3     107        62       13      48 22.9    0.678  23     1
## 4         1      91        64       24       0 29.2    0.192  21     0
## 5         4      83        86       19       0 29.3    0.317  34     0
## 6         1     100        74       12      46 19.5    0.149  28     0
## 7         9      89        62        0       0 22.5    0.142  33     0
## 8         8     120        78        0       0 25.0    0.409  64     0
## 9         1      79        60       42      48 43.5    0.678  23     0
## 10        2     123        48       32     165 42.1    0.520  26     0
##    scored.class scored.probability
## 1             0         0.32845226
## 2             0         0.27319044
## 3             0         0.10966039
## 4             0         0.05599835
## 5             0         0.10049072
## 6             0         0.05515460
## 7             0         0.10711542
## 8             0         0.45994744
## 9             0         0.11702368
## 10            0         0.31536320

Subset only the columns we want to use class: the actual class for the observation scored.class: the predicted class for the observation (based on a threshold of 0.5) scored.probability: the predicted probability of success for the observation

##    class scored.class scored.probability
## 1      0            0         0.32845226
## 2      0            0         0.27319044
## 3      1            0         0.10966039
## 4      0            0         0.05599835
## 5      0            0         0.10049072
## 6      0            0         0.05515460
## 7      0            0         0.10711542
## 8      0            0         0.45994744
## 9      0            0         0.11702368
## 10     0            0         0.31536320

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

##             class
## scored.class   1   0
##            1  27   5
##            0  30 119

The confusion matrix summarizes the prediction output. The columns are actuals while the rows are predicted. The diagonal of the matrix is considered to be the “ground truth” or true values. There are 20 false negatives and 5 false positives. In the past, I have worked with an almost identical dataset, so with my background knowledge, I can infer that the goal was to classify someone having diabetes. Is this the Pima dataset? There are more false negatives than false positive hence greater chance of a type 2 error.

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

The formula for accuracy is as follows: \[ Accuracy=\frac { TP+Tn }{ TP+FP+Tn+Fn } \]

Variable definitions: TP=true positive Tn=true negative FP=false positive Fn=false negative

lets write a function that takes this formula and returns prediction accuracy

#convert the subset data into  data frame 
classes.df<-data.frame(classification_data2)

#define the conditions for each variable 
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives 
accuracy <- function(df)
  {
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}

#accuracy(classes.df)

We can verify if our function is correct by doing the calculation directly (almost like doing it by hand)

A<-(27+119)/(27+5+119+30)
A

## [1] 0.8066298

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

The formula for the error rate is as follows: \[ ErrorRate=\frac { FP+Fn }{ TP+FP+Tn+Fn } \]

We build a similar function as the prediction accuracy

#define the conditions for each variable 
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives 

error<- function(df)
  {
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  (FP + Fn)/nrow(df)
  }

#error(classes.df)

As with prediction accuracy, we can compute error by hand to verify our function

E<-(5+30)/(27+5+119+30)
E

## [1] 0.1933702

We can also find the error rate by computing the compliment of the accuracy as follows: \[ Error=1-Accuracy \]

E2<-1-A
E2

## [1] 0.1933702

We need to verify that the error and the accuracy sum to one. By finding the error using the compliment of the accuracy, it is implied that they add up to one. We can still check.

E+A

## [1] 1

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

The precision formula is defined as follows: \[ precision=\frac { TP }{ TP+FP } \]

We define our function similar to previous functions

#our variables pairings are similar to the previous functions we have written 
classes.df<-data.frame(classification_data2)

precision<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }

#precision

We can verify our function with a hand calculation

p<-27/(27+5)
p

## [1] 0.84375

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall.

The sensitivity is defined as follows: \[ sensitivity=\frac { TP }{ TP+Fn } \] Define function as follows:

classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written 

sensitivity<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
}

Compute by hand to verify the function calculation

s<-27/(27+30)
s

## [1] 0.4736842

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions

The specificity is defined as follows

\[ specificity=\frac { Tn }{ Tn+FP } \]

find using a written function

#our variables pairings are similar to the previous functions we have written 

specificity<- function(df)
  {
  Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  Tn/(Tn + FP)
  }

calculate by hand

sp<-(119/(119+5))
sp

## [1] 0.9596774

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

\[ F1\quad Score=\frac { 2\times precision\times sensitivity }{ precision+sensitivity } \]

Since we previously defined the parameters in the equation, we can recall them to compute the F1 score.

#take the previous functions and nest them into a function for F1 score 

F1 <- function(x)
  {
  precision1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }
  
  
  sensitivity1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
  }
  
  F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
  return(F1)
  }

#F1(classes.df)

Lets verify our function by hand

f1<-(2*0.84375*0.4736842)/(0.84375+0.4736842)
f1

## [1] 0.6067416

With some algebraic manipulation, we can reduce the F1 score formula as follows: \[ F1Score\quad=\frac { 2TP }{ 2TP+Fn+FP } \]

If we compute this formula by hand, we should get the same result:

f1_a<-(2*27)/(2*27  +30 +5)
f1_a

## [1] 0.6067416

We can build a function for this easier reduced version of the F1 score

#our variables pairings are similar to the previous functions we have written 

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
  }

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: if 0 < a < 1 and 0 < b < 1 than ab < a.) -Lets recall the reduced version of the F1 score \[ F1Score\quad=\frac { 2TP }{ 2TP+Fn+FP } \] Before providing a more formal proof, lets examine the output for F1 score in a few hypothetical situations

Lets assume we have classified n true positives and no false negatives or false positives. Assume that TP is non zero and Fn and FP are equal to zero, then the formula is reduced as follows: \[ F1\quad Score=\frac { 2TP }{ 2TP+0+0 } =\frac { 2TP }{ 2TP } =1 \]

Lets assume that we we classified no true positives and n false negatives with m false positives or 0 false negatives with m false positives, or n false negatives with 0 false positives, then our F1 output would be as follows: \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+Fn+FP } =0 \] \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+Fn+0 } =0 \] \[ F1\quad Score=\frac { 2TP }{ 2TP+Fn+FP } =\frac { 0 }{ 0+0+FP } =0 \] The third situation we can consider is if TP, Fn, and FP are all non zero. The denominator is greater than the numerator, hence the F1 score will always be a value between zero and one. \[ 0<\frac { 2TP }{ 2TP+Fn+FP } <1 \]

Conceptually, if the F1 score is zero, then the analysis output tells us that we do not have any information or conclusion on posittive cases. Perhaps a different model should be considered. If the F1 score is 1, then we are only able to classifiy true cases and have no way of classfying false cases be false positives or false negatives. The F1 score is not possible if the denominator is zero. That is an indicator that a different model should be considered.

Perhaps this can be shown algebraically using the non reduced version of the F1 score formula Recall the definitions of precision and sensitivity from Wikipedia

precision-is the fraction of relevant instances among the retrieved instances sensitivity-measures the proportion of actual positives that are correctly identified as such

Multiply all sides on the inequality by the LCD (precision + sensitivity) \[ 0<\frac { 2\times precision\times sensitivity }{ precision+sensitivity } <1 \] Divide all sides of the inequality by (2 x precision x sensitivity) \[ 0<{2\times precision\times sensitivity } < {precision + sensitivity} \]

Simplify \[ 0<{ 1 }<\frac { precision }{ 2\times precision\times sensitivity } +\frac { sensitivity }{ 2\times precision\times sensitivity } \] cancel out terms \[ 0<{ 1 }<\frac { 1 }{ 2\times sensitivity } +\frac { 1 }{ 2\times precision\ } \]

The resulting inequality is undefined if sensitivity and or precision is zero. The domain is as follows: \[ {sensitivity, Precision ~ {D:(0, \infty)}} \]

If (true positives + false negatives) = 0 then no positive cases in the input data, so any analysis of this case has no information, and so no conclusion about how positive cases are handled. You want n/A or something similar as the ratio result, avoiding a division by zero error

If (true positives + false positives) = 0 then all cases have been predicted to be negative: this is one end of the ROC curve. Again, you want to recognise and report this possibility while avoiding a division by zero error.

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC). note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))

myroc = function(df)
  {
  #define the threshold
  threshold = seq(0,1, by= 0.01)
  
  #define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
  x = seq_along(threshold)
  y = seq_along(threshold)
  
    for (i in threshold) 
      {
        prob.df = as.numeric(df$scored.probability>i)
        prediction.df = cbind(prediction.df, prob.df)
      }
                for (j in 1:length(threshold))
                    {
                      classes.type.df = factor(df$class,levels = c(0,1))
                      prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
                      collector.table = table(classes.type.df, prediciton_types.df)
                      
                      #rather than introduce more nested function based on our previous definition of these parameters, we                        can get them from the entires in the confusion matrix. The code is not as verbose 
                      
                      sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
                      specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
                      y[j] = sensitivity_a
                      x[j] = 1 - specificity_a
                    }

                        roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
                        roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
                        roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)

                myauc <- function(outcome, proba)
                          {
                              n = length(proba)
                              positives_sum = sum(outcome)
                              df = data.frame(out = outcome, prob = proba)
                              df = df[order(-df$prob),]
                              df$above = (1:n) - cumsum(df$out)
                              return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
                          }

                auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
               results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
               results
}

#generate the plot
#myroc(classification)

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above

We essentially take the functions we made above and consolidate all the output for this question. Since we also calculated each value by hand, these values can easily be verified.

accuracy(classes.df);

## [1] 0.8066298

error(classes.df);

## [1] 0.1933702

precision(classes.df);

## [1] 0.84375

sensitivity(classes.df);

## [1] 0.4736842

specificity(classes.df);

## [1] 0.9596774

F1(classes.df);

## [1] 0.6067416

myroc(classification)

## $Plot

## 
## $`Area under curve`
## [1] 0.8503113

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

Confusion matrix

library(caret)

## Warning: package 'caret' was built under R version 3.4.4

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked _by_ '.GlobalEnv':
## 
##     precision, sensitivity, specificity

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])

#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix

## Confusion Matrix and Statistics
## 
##             class
## scored.class   1   0
##            1  27   5
##            0  30 119
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

#Confusion matrix table
#caret_matrix$table

The values from confusion matrix are indeed simlar to those generated from the user made functions.

Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

library(pROC)

## Warning: package 'pROC' was built under R version 3.4.4

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))

## Area under the curve: 0.8503

The difference between the plot generated from my function and the pROC function is marginal. They are more similar than different.

APPENDIX

url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-2/master/classification-output-data.csv'
classification_data <- read.csv(url, header = TRUE)
head(classification_data,10)
classification_data2<-subset(classification_data,select=c(class,scored.class, scored.probability))
head(classification_data2,10)
matrix.df<-with(classification_data2, table(scored.class, class)[2:1,2:1])
matrix.df
#convert the subset data into  data frame 
classes.df<-data.frame(classification_data2)

#define the conditions for each variable 
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives 
accuracy <- function(df)
  {
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}

#accuracy(classes.df)
A<-(27+119)/(27+5+119+30)
A
#define the conditions for each variable 
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives 

error<- function(df)
  {
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  (FP + Fn)/nrow(df)
  }

#error(classes.df)
E<-(5+30)/(27+5+119+30)
E
E2<-1-A
E2
E+A
#our variables pairings are similar to the previous functions we have written 
classes.df<-data.frame(classification_data2)

precision<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }

#precision
p<-27/(27+5)
p
classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written 

sensitivity<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
}

s<-27/(27+30)
s
#our variables pairings are similar to the previous functions we have written 

specificity<- function(df)
  {
  Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  Tn/(Tn + FP)
  }

sp<-(119/(119+5))
sp
#take the previous functions and nest them into a function for F1 score 

F1 <- function(x)
  {
  precision1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }
  
  
  sensitivity1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
  }
  
  F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
  return(F1)
  }

#F1(classes.df)
f1<-(2*0.84375*0.4736842)/(0.84375+0.4736842)
f1
f1_a<-(2*27)/(2*27  +30 +5)
f1_a
#our variables pairings are similar to the previous functions we have written 

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
  }


library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))

myroc = function(df)
  {
  #define the threshold
  threshold = seq(0,1, by= 0.01)
  
  #define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
  x = seq_along(threshold)
  y = seq_along(threshold)
  
    for (i in threshold) 
      {
        prob.df = as.numeric(df$scored.probability>i)
        prediction.df = cbind(prediction.df, prob.df)
      }
                for (j in 1:length(threshold))
                    {
                      classes.type.df = factor(df$class,levels = c(0,1))
                      prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
                      collector.table = table(classes.type.df, prediciton_types.df)
                      
                      #rather than introduce more nested function based on our previous definition of these parameters, we                        can get them from the entires in the confusion matrix. The code is not as verbose 
                      
                      sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
                      specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
                      y[j] = sensitivity_a
                      x[j] = 1 - specificity_a
                    }

                        roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
                        roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
                        roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)

                myauc <- function(outcome, proba)
                          {
                              n = length(proba)
                              positives_sum = sum(outcome)
                              df = data.frame(out = outcome, prob = proba)
                              df = df[order(-df$prob),]
                              df$above = (1:n) - cumsum(df$out)
                              return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
                          }

                auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
               results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
               results
}

#generate the plot
#myroc(classification)

accuracy(classes.df);
error(classes.df);
precision(classes.df);
sensitivity(classes.df);
specificity(classes.df);
F1(classes.df);
myroc(classification)

library(caret)

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])

#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix

#Confusion matrix table
#caret_matrix$table

library(pROC)
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
#1
url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-2/master/classification-output-data.csv'
classification_data <- read.csv(url, header = TRUE)
head(classification_data,10)

#2
classification_data2<-subset(classification_data,select=c(class,scored.class, scored.probability))
head(classification_data2,10)

matrix.df<-with(classification_data2, table(scored.class, class)[2:1,2:1])
matrix.df

#3
#convert the subset data into  data frame 
classes.df<-data.frame(classification_data2)

#define the conditions for each variable 
#(1,1) pairing pertains to true positives
#(0,0) pairing pertains to true negatives 
accuracy <- function(df)
  {
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)
(TP + Tn)/nrow(df)
}

accuracy(classes.df)


#4

#define the conditions for each variable 
#(0,1) pairing pertains to false positives
#(1,0) pairing pertains to false negatives 

error<- function(df)
  {
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  (FP + Fn)/nrow(df)
  }

error(classes.df)

#5
#our variables pairings are similar to the previous functions we have written 
classes.df<-data.frame(classification_data2)

precision<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }

precision<-precision(classes.df);
precision


#6
classes.df<-data.frame(classification_data2)
#our variables pairings are similar to the previous functions we have written 

sensitivity<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
}

sensitivity<-sensitivity(classes.df);
sensitivity



#7
#our variables pairings are similar to the previous functions we have written 

specificity<- function(df)
  {
  Tn <- sum(classes.df$class == 0 & classes.df$scored.class == 0)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  Tn/(Tn + FP)
  }

specificity<-specificity(classes.df);
specificity



#8
#take the previous functions and nest them into a function for F1 score 

F1 <- function(x)
  {
  precision1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (TP)/(TP + FP)
  }
  
  
  sensitivity1<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)  
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  TP/(TP + Fn)
  }
  
  F1 <- (2 * precision1(classes.df) * sensitivity1(classes.df))/(precision1(classes.df) + sensitivity1(classes.df))
  return(F1)
  }

F1(classes.df)


#our variables pairings are similar to the previous functions we have written 

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
  }

f1_reduced<-f1_reduced(classes.df);
f1_reduced



#10
library(ggplot2)
classification <- subset(classes.df, select = c(scored.probability, class))

myroc = function(df)
  {
  #define the threshold
  threshold = seq(0,1, by= 0.01)
  
  #define a data frame to hold the predicted classes
prediction.df = data.frame(row.names = 1:nrow(df))
  x = seq_along(threshold)
  y = seq_along(threshold)
  
    for (i in threshold) 
      {
        prob.df = as.numeric(df$scored.probability>i)
        prediction.df = cbind(prediction.df, prob.df)
      }
                for (j in 1:length(threshold))
                    {
                      classes.type.df = factor(df$class,levels = c(0,1))
                      prediciton_types.df = factor(prediction.df[,j], levels = c(0,1))
                      collector.table = table(classes.type.df, prediciton_types.df)
                      
                      #rather than introduce more nested function based on our previous definition of these parameters, we                        can get them from the entires in the confusion matrix. The code is not as verbose 
                      
                      sensitivity_a = collector.table[2,2] / (collector.table[2,2] + collector.table[2,1])
                      specificity_a = collector.table[1,1] / (collector.table[1,1] + collector.table[1,2])
                      y[j] = sensitivity_a
                      x[j] = 1 - specificity_a
                    }

                        roc.df = data.frame(false_positive_rate = x, true_positive_rate = y)
                        roc.visuals = ggplot(roc.df, aes(x=false_positive_rate, y=true_positive_rate)) + geom_step()
                        roc.visuals = roc.visuals + geom_abline(slope = 1, intercept = c(0,0), colour="blue", lty=2)

                myauc <- function(outcome, proba)
                          {
                              n = length(proba)
                              positives_sum = sum(outcome)
                              df = data.frame(out = outcome, prob = proba)
                              df = df[order(-df$prob),]
                              df$above = (1:n) - cumsum(df$out)
                              return( 1- sum( df$above * df$out ) / (positives_sum * (n-positives_sum) ) )
                          }

                auc_final_plot = myauc(classes.df$class,classes.df$scored.probability)
               results = list("Plot"=roc.visuals, "Area under curve"=auc_final_plot)
               results
}

#generate the plot
myroc(classification)

#11
accuracy(classes.df);
error(classes.df);
precision(classes.df);
sensitivity(classes.df);
specificity(classes.df);
F1(classes.df);
myroc(classification)


#12
library(caret)

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])

#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix

#Confusion matrix table
caret_matrix$table


#13
library(pROC)
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))

DATA 621- Week 2

Vinicio Haro

June 19, 2018