DATA 621 – Business Analytics and Data Mining

Homework #2

Critical Thinking Group 2

Elina Azrilyan

March 8th, 2020

Step 1.

Download the classification output data set.

df <- read.csv("https://raw.githubusercontent.com/che10vek/Data621/master/classification-output-data.csv")
head(df,10)

##    pregnant glucose diastolic skinfold insulin  bmi pedigree age class
## 1         7     124        70       33     215 25.5    0.161  37     0
## 2         2     122        76       27     200 35.9    0.483  26     0
## 3         3     107        62       13      48 22.9    0.678  23     1
## 4         1      91        64       24       0 29.2    0.192  21     0
## 5         4      83        86       19       0 29.3    0.317  34     0
## 6         1     100        74       12      46 19.5    0.149  28     0
## 7         9      89        62        0       0 22.5    0.142  33     0
## 8         8     120        78        0       0 25.0    0.409  64     0
## 9         1      79        60       42      48 43.5    0.678  23     0
## 10        2     123        48       32     165 42.1    0.520  26     0
##    scored.class scored.probability
## 1             0         0.32845226
## 2             0         0.27319044
## 3             0         0.10966039
## 4             0         0.05599835
## 5             0         0.10049072
## 6             0         0.05515460
## 7             0         0.10711542
## 8             0         0.45994744
## 9             0         0.11702368
## 10            0         0.31536320

Step 2.

The data set has three key columns we will use: - class: the actual class for the observation - scored.class: the predicted class for the observation (based on a threshold of 0.5) - scored.probability: the predicted probability of success for the observation

Use the table() function to get the raw confusion matrix for this scored dataset. Make sure you understand the output. In particular, do the rows represent the actual or predicted class? The columns?

cmtable<-table(df$class, df$scored.class)
cmtable

##    
##       0   1
##   0 119   5
##   1  30  27

Interpreting the output of this Confusion Matrix: The rows represent Actual Values, and the columns respesent Predicted Values. Let’s rename the rows and columns to to make it clearer.

colnames(cmtable) <- c("Predicted No", "Predicted Yes")
rownames(cmtable) <- c("Actual No", "Actual Yes")
cmtable

##             
##              Predicted No Predicted Yes
##   Actual No           119             5
##   Actual Yes           30            27

Step 3.

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the accuracy of the predictions.

get_accuracy <- function(actual, predicted)
{
  cmtable <- as.matrix(table(predicted, actual))
  TN <- cmtable[1,1]
  FN <- cmtable[1,2]
  FP <- cmtable[2,1]
  TP <- cmtable[2,2]
  return ((TP + TN) / (TN + FN + TP + FP))
}
get_accuracy(df$class, df$scored.class)

## [1] 0.8066298

Step 4.

Write a function that takes the data set as a dataframe, with actual and predicted classifications identified, and returns the classification error rate of the predictions.

get_class_err_rate <- function(actual, predicted)
{
  cmtable <- as.matrix(table(predicted, actual))
  TN <- cmtable[1,1]
  FN <- cmtable[1,2]
  FP <- cmtable[2,1]
  TP <- cmtable[2,2]
  return ((FP + FN) / (TN + FN + TP + FP))
}
get_class_err_rate(df$class, df$scored.class)

## [1] 0.1933702

Verify that you get an accuracy and an error rate that sums to one.

get_accuracy(df$class, df$scored.class) + get_class_err_rate(df$class, df$scored.class)

## [1] 1