DATA 621 Business And Data Analytics - Assignment 2

A. Classification output data set

1. Download

Read in the classification data set and display the columns

# Read the data
df <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA621_FA18/master/HW2/classification-output-data.csv", stringsAsFactors = FALSE)

names(df)

##  [1] "pregnant"           "glucose"            "diastolic"         
##  [4] "skinfold"           "insulin"            "bmi"               
##  [7] "pedigree"           "age"                "class"             
## [10] "scored.class"       "scored.probability"

2. Table function

The three columns needed for this assignment are:

class: the actual class for the observation
scored.class: the predicted class for the observation (based on a threshold of 0.5)
scored.probability: the predicted probability of success for the observation

The table function creates a confusion matrix displaying the actual class in columns and the predicted class in rows.

table(df$scored.class,df$class)

##    
##       0   1
##   0 119  30
##   1   5  27

B. Creating Classification Metrics functions

3. Accuracy

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

mAccuracy <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  A <- (fTP + fTN) / (fTP + fTN + fFP + fFN)
  return(A)
}

4. Error

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

mError <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  E <- (fFP + fFN) / (fTP + fTN + fFP + fFN)
  return(E)
}

5. Precision

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the precision of the predictions.

mPrecision <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  P <- (fTP) / (fTP + fFP)
  return(P)
}

6. Sensitivity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the sensitivity of the predictions.

Sensitivity is also known as recall.

mSensitivity <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  Se <- (fTP) / (fTP + fFN)
  return(Se)
}

7. Specificity

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the specificity of the predictions.

mSpecificity <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  Sp <- (fTN) / (fTN + fFP)
  return(Sp)
}

8. F1 Score

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the F1 score of the predictions.

mF1_score <- function(dframe, pclass, aclass){
  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]
  P <- (fTP) / (fTP + fFP)
  Se <- (fTP) / (fTP + fFN)
  FS <- (2 * P * Se) / (P + Se)
  return(FS)
}

9. F1 Bounds

What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < 𝑎 < 1 and 0 < 𝑏 < 1 then 𝑎𝑏 < 𝑎.)

F1 is bounded from 0 to 1. Both Recall and Precision are positive fractions from 0 to 1. In the event of maximum sensitiviy and precision, the result of F1 score formula will be 1, while the opposite event evaluates to 0.

10. ROC Curve and AUC

Write a function that generates an ROC curve from a data set with a true classification column (class in our example) and a probability column (scored.probability in our example). Your function should return a list that includes the plot of the ROC curve and a vector that contains the calculated area under the curve (AUC).

mRoc_curve <- function(dframe, pclass, aclass){

  ft <- table(dframe[ ,pclass], dframe[ ,aclass])
  fTP <- ft[2,2]; fTN <- ft[1,1]; fFP <- ft[2,1]; fFN <- ft[1,2]

  allPos <- sum(df$class == 1, na.rm=TRUE)
  allNeg <- sum(df$class == 0, na.rm=TRUE)

  threshold <- seq(0.0,1.0,0.01)
  x <- c()
  y <- c()
  
  for (i in 1:length(threshold)) {
    
    TP <- sum(df$scored.probability >= threshold[i] & df$class == 1, na.rm=TRUE)
    TN <- sum(df$scored.probability < threshold[i] & df$class == 0, na.rm=TRUE)
    
    y[i] <- TP / allPos
    x[i] <- 1- (TN / allNeg)
  }  

  fPlot <- plot(x,y,type = "s", 
                main = "ROC Curve",
                xlab = "False Positive Rate (1-Specificity)",
                ylab = "True Positive Rate")
  fPlot <- abline(0,1); fPlot

  xd <- c(0, abs(diff(x)))
  fAuc <- sum(xd*y); fAuc

  return(fAuc)

}
mRoc_curve(df, "scored.class", "class")

## [1] 0.8438031

11. Functions Output

Accuracy

mAccuracy(df, "scored.class", "class")

## [1] 0.8066298

Error

mError(df, "scored.class", "class")

## [1] 0.1933702

Precision

mPrecision(df, "scored.class", "class")

## [1] 0.84375

Sensitivity

mSensitivity(df, "scored.class", "class")

## [1] 0.4736842

Specificity

mSpecificity(df, "scored.class", "class")

## [1] 0.9596774

F1 Score

mF1_score(df, "scored.class", "class")

## [1] 0.6067416

12. Caret Package

Consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set.

library(caret)
pclass <- factor(df$scored.class,levels = c(1,0))
aclass <- factor(df$class,levels = c(1,0))
confusionMatrix(pclass, aclass)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1  27   5
##          0  30 119
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

Based on the results, the sensitivity and specificity metrics are the same with my created functions.

13. pROC Package

generate an ROC curve for the data set.

library(pROC)
plot(roc(df$class, df$scored.probability))

auc(df$class, df$scored.probability)

## Area under the curve: 0.8503

Based on the results, the graph is the same similar with my created funtions, only that the x axis from pROC plots from 1 to 0, while my created function plots x points from 0 to 1.

The AUC is very close, difference of 0.006.

DATA 621 Business And Data Analytics - Assignment 2 - Classification Metrics

Luisa Velasco

October 14 2018