Packages

library(data.table)

Question 1

Download the classification output data set (attached in Blackboard to the assignment).

Data = fread("https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/classification-output-data.csv", select = c("class","scored.class","scored.probability"))

Question 2

The data set has three key columns we will use:

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

The rows represent the actual class.

str(Data)

## Classes 'data.table' and 'data.frame':   181 obs. of  3 variables:
##  $ class             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ scored.class      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ scored.probability: num  0.328 0.273 0.11 0.056 0.1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(Data)

##      class         scored.class    scored.probability
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.02323   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.11702   
##  Median :0.0000   Median :0.0000   Median :0.23999   
##  Mean   :0.3149   Mean   :0.1768   Mean   :0.30373   
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.43093   
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.94633

table(Data$scored.class,Data$class)

##    
##       0   1
##   0 119  30
##   1   5  27

Question 3

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions. \[Accuracy=\frac{TP+TN}{TP+FP+TN+FN}\]

Accuracy_ = function(input) {
  tb = table(input$class,input$scored.class)
  TN=tb[1,1]
  TP=tb[2,2]
  FN=tb[2,1]
  FP=tb[1,2]
  
  return((TP+TN)/(TP+FP+TN+FN))
  
}

Testing the Accuracy function yields 0.8066298

Question 4

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

\[Classification Error Rate=\frac{FP+FN}{TP+FP+TN+FN}\]

CER_ = function(input) {
  tb = table(input$class,input$scored.class)
  TN=tb[1,1]
  TP=tb[2,2]
  FN=tb[2,1]
  FP=tb[1,2]
  
  return((FP+FN)/(TP+FP+TN+FN))
  
}

Testing the Classification Error Rate function yields 0.1933702

Question 5

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions. \[Precision=\frac{TP}{TP+FP}\]

Precision_ = function(input) {
  tb = table(input$class,input$scored.class)
  
  TP=tb[2,2]
  FP=tb[1,2]
  
  return((TP)/(TP+FP))
  
}

Testing the Precision function yields 0.84375

Question 6

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions. Sensitivity is also known as recall. \[Sensitivity=\frac{TP}{TP+FN}\]

Sensitivity_ = function(input) {
  tb = table(input$class,input$scored.class)
  
  TP=tb[2,2]
  FN=tb[2,1]
  
  return((TP)/(TP+FN))
  
}

Testing the Sensitivity function yields 0.4736842

Question 7

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions. \[Specificity=\frac{TN}{TN+FP}\]

Specificity_ = function(input) {
  tb = table(input$class,input$scored.class)
  TN=tb[1,1]
  TP=tb[2,2]
  FN=tb[2,1]
  FP=tb[1,2]
  
  
  
  return((TN)/(TN+FP))
  
}

Testing the Specificity function yields 0.9596774

Question 8

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions. \[F1 Score=\frac{2*Precision*Sensitivity}{Precision+Sensitivity}\]

F1_ = function(input) {
  tb = table(input$class,input$scored.class)
  TN=tb[1,1]
  TP=tb[2,2]
  FN=tb[2,1]
  FP=tb[1,2]
  
  
  Precision = (TP)/(TP+FP)
  Sensitivity = (TP)/(TP+FN)
  Precision =(TP)/(TP+FP)
  
  return((2*Precision*Sensitivity)/(Precision+Sensitivity))
  
}

Testing the F1 Score function yields 0.6067416

Question 9

Before we move on, let’s consider a question that was asked: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1.

Both scores used to calculate F1 are bounded between 0 and 1 Sensitivity is bounded between 0 and 1.

Question 10

MY_ROC = function(labels, scores){
  labels = labels[order(scores, decreasing=TRUE)]
  result =data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
  
  dFPR = c(diff(result$FPR), 0)
  dTPR = c(diff(result$TPR), 0)
  AUC = round(sum(result$TPR * dFPR) + sum(dTPR * dFPR)/2,4)

  plot(result$FPR,result$TPR,type="l",main ="ROC Curve",ylab="Sensitivity",xlab="1-Specificity")
  abline(a=0,b=1)
  legend(.6,.2,AUC,title = "AUC")
  
}

MY_ROC(Data$class,Data$scored.probability)

Reference1

Reference2

Reference3

Question 11

Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

Accuracy: 0.8066298

Classification Error Rate is: 0.1933702

Precision is: 0.84375

Sensitivity is: 0.4736842

Specificity is: 0.9596774

F1 Score is: 0.6067416

Question 12

Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

The results all seem similar to the answers generated by the caret package.

library("caret")

car =confusionMatrix(Data$scored.class, Data$class, positive='1')

#print table
car$table

##           Reference
## Prediction   0   1
##          0 119  30
##          1   5  27

car$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##            0.4736842            0.9596774            0.8437500 
##       Neg Pred Value            Precision               Recall 
##            0.7986577            0.8437500            0.4736842 
##                   F1           Prevalence       Detection Rate 
##            0.6067416            0.3149171            0.1491713 
## Detection Prevalence    Balanced Accuracy 
##            0.1767956            0.7166808