Individual Project 2

1. Download the classification output data set (attached in Canvas to the assignment).

library(readxl)
Data <- read_excel("~/Big Data Econometrics/Kirk_Individual_Project_2/Data.xlsx", col_types = c("numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"))
View(Data)

2. Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

attach(Data)
class_columns = c("scored.class","class")
class_data = Data[class_columns]
View(class_data)
conmatrx = table(class_data)
conmatrx

##             class
## scored.class   0   1
##            0 119  30
##            1   5  27

# The rows represent the predicted classifications and the columns represent the actual classifications

TN = conmatrx[1,1]
FN = conmatrx[1,2]
FP = conmatrx[2,1]
TP = conmatrx[2,2]

3. Accuracy

Accuracy = (TN+TP)/(TN+FN+FP+TP)
Accuracy

## [1] 0.8066298

(119+27)/(119+30+5+27)

## [1] 0.8066298

4. Classification Error Rate

CER = (FP+FN)/(TN+FN+FP+TP)
CER

## [1] 0.1933702

(5+30)/(5+30+119+27)

## [1] 0.1933702

5. Precision

Precision = TP/(TP+FP)
Precision

## [1] 0.84375

27/(27+5)

## [1] 0.84375

6. Sensitivity

Sensitivity = TP/(TP+FN)
Sensitivity

## [1] 0.4736842

27/(27+30)

## [1] 0.4736842

7. Specificity

Specificity = TN/(TN+FP)
Specificity

## [1] 0.9596774

119/(119+5)

## [1] 0.9596774

8. F1 Score

F1_Score = (2*Precision*Sensitivity)/(Precision+Sensitivity)
F1_Score

## [1] 0.6067416

9. Let’s consider the following question: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < a < 1 and 0 < b < 1 then ab < a)

# Precision and Sensitivity fall between 0 and 1.
# Let's start by setting both to 1

2*1*1/(1+1)

## [1] 1

# Now set both to 0
2*0*0/(0+0)

## [1] NaN

# Let's set Precision to 0 and Sensitivity to 1 (The answer will be the same if we swap the variables)
2*1*0/(1+0)

## [1] 0

# By testing out the extremes of each input variable, I have proven that the bounds of the F1 Score are between 0 and 1 or undefined

10. Write a function that generates an ROC curve from a data set with a true classification column (i.e., class) and a probability column (i.e., scored.probability). Your function should return the plot of the ROC curve and the calculated area under the ROC curve (AUC). Note that I recommend using a sequence of thresholds ranging from 0 to 1 at 0.01 intervals.

library(ggplot2)
ROC_curve <- function(x){
  thresh <- seq(0,1,by=0.01)
  TPR <- c()
  FPR <- c()
  for (i in 1:length(thresh)){
    scoredclass <- ifelse(x$scored.probability >= thresh[i], 1, 0)
    rev_df <- data.frame(scored.class = scoredclass, class = x$class)
    df_table <- with(rev_df, table(scored.class, class))
    TPRate <- (df_table[4])/(df_table[4] + df_table[3])
    FPRate <- (df_table[2]/(df_table[2] + df_table[1]))
    TPR[i] <- TPRate
    FPR[i] <- FPRate
  }
  plot_df <- data.frame(TPR,FPR)
  ROC_plot <- ggplot(plot_df, aes(x=FPR, y=TPR)) + geom_point() + geom_line(col="blue") + geom_abline(intercept = 0, slope = 1) + labs(title="ROC", x = "1 - Specificity", y = "Sensitivity")
  
  height = (plot_df$TPR[-1]+plot_df$TPR[-length(plot_df$TPR)])/2
  width = -(diff(plot_df$FPR))
  AUC = sum(height*width)
 
  return(list(ROC_plot))
  return (list(AUC = AUC))
}
ROC_list <- ROC_curve(Data)
ROC_plot <- ROC_list[[1]]
ROC_plot

## Warning: Removed 9 rows containing missing values (geom_point).

## [1] "0.8503"

AUC

## [1] "0.8503"

11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

list(Accuracy = Accuracy, Classification_Error_Rate = CER, Precision = Precision, Sensitivity = Sensitivity, Specificity = Specificity, F1_Score = F1_Score)

## $Accuracy
## [1] 0.8066298
## 
## $Classification_Error_Rate
## [1] 0.1933702
## 
## $Precision
## [1] 0.84375
## 
## $Sensitivity
## [1] 0.4736842
## 
## $Specificity
## [1] 0.9596774
## 
## $F1_Score
## [1] 0.6067416

12. Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

library(caret)

## Loading required package: lattice

?confusionMatrix

## starting httpd help server ...

##  done

library(e1071)
CM_table <- with(Data, table(scored.class, class)[2:1,2:1])
confusionMatrix(CM_table)

## Confusion Matrix and Statistics
## 
##             class
## scored.class   1   0
##            1  27   5
##            0  30 119
##                                           
##                Accuracy : 0.8066          
##                  95% CI : (0.7415, 0.8615)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.0001712       
##                                           
##                   Kappa : 0.4916          
##                                           
##  Mcnemar's Test P-Value : 4.976e-05       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9597          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3149          
##          Detection Rate : 0.1492          
##    Detection Prevalence : 0.1768          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##

sensitivity(CM_table)

## [1] 0.4736842

specificity(CM_table)

## [1] 0.9596774

#The results here are the same as the functions I created

13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

myROC <- roc(Data$class, Data$scored.probability, plot=T, asp=NA, main = "ROC Curve", ret="tp")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

myROC["auc"]

## $auc
## Area under the curve: 0.8503

#These results are the same as the results I generated above

Individual Project 2

Connor Kirk

June 21, 2019

1. Download the classification output data set (attached in Canvas to the assignment).

2. Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

3. Accuracy

4. Classification Error Rate

5. Precision

6. Sensitivity

7. Specificity

8. F1 Score

9. Let’s consider the following question: What are the bounds on the F1 score? Show that the F1 score will always be between 0 and 1. (Hint: If 0 < a < 1 and 0 < b < 1 then ab < a)

11. Use your created R functions and the provided classification output data set to produce all of the classification metrics discussed above.

12. Investigate the caret package. In particular, consider the functions confusionMatrix, sensitivity, and specificity. Apply the functions to the data set. How do the results compare with your own functions?

13. Investigate the pROC package. Use it to generate an ROC curve for the data set. How do the results compare with your own functions?