Evaluation of classification models

The limitations of accuracy

2-class problem: Model over-predicts accuracy and leaves out the small examples where it is not accurate (ex: model predicts 99% accuracy but reads as 100% accurate)
Overfitting: A model that is not generalized enough, instead focused on a given dataset

Metrics for evaluation

Confusion matrix a: True positive
b: False negative
c: False positive
d: True negative
Accuracy = TP+TN / TP+TN+FP+FN = a+d / a+b+c+d

Precision metric Precision = Number of correctly identified positives / Number of positive predictions (correct or not)
Measures the fraction of a time the model is correct
In a perfect precision metric, there are no false positives

Recall/sensitivity
Recall = Number of correctly identified positives / All actual positives (true positives and false negatives, which were supposed to be positives)
In a perfect recall metric, there are no false negatives

Precision and recall are conflicting tradeoffs
We decide which metric to used based on the cost

F1 Score
Harmonic mean of precision and recall - balance of both, not the highest of each
F1 = 2(recall*precision) / recall + precision

Increasing model complexity

Decision trees: Allowing maximum depth to be higher
When only using one feature, the prediction error is higher
If model is too complex, reduce depth
If model is too simple, increase depth

Linear regression models: Reducing the regularization penalty

Boosted ensemble: Increasing the number of trees

Evaluating a classification model

# Load the COVID dataset into R
covid <- read.csv("C:/Users/Monte Richardson/Desktop/Data Science Dojo/covid_19_data.csv", header=TRUE , na.strings = c("")) 

# Explore the data set
dim(covid)
## [1] 4247    8
str(covid)
## 'data.frame':    4247 obs. of  8 variables:
##  $ SNo            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ObservationDate: Factor w/ 47 levels "01/22/2020","01/23/2020",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Province.State : Factor w/ 182 levels " Montreal, QC",..: 4 8 21 45 47 51 52 53 54 57 ...
##  $ Country.Region : Factor w/ 111 levels " Azerbaijan",..: 59 59 59 59 59 59 59 59 59 59 ...
##  $ Last.Update    : Factor w/ 1190 levels "1/22/2020 17:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Confirmed      : num  1 14 6 1 0 26 2 1 4 1 ...
##  $ Deaths         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered      : num  0 0 0 0 0 0 0 0 0 0 ...

Data exploration and cleaning

# View the summary statistics of the dataset
summary(covid)
##       SNo         ObservationDate   Province.State        Country.Region
##  Min.   :   1   03/08/2020: 255   Anhui    :  47   Mainland China:1451  
##  1st Qu.:1062   03/07/2020: 225   Beijing  :  47   US            : 783  
##  Median :2124   03/06/2020: 199   Chongqing:  47   Australia     : 201  
##  Mean   :2124   03/05/2020: 173   Fujian   :  47   Canada        : 133  
##  3rd Qu.:3186   03/04/2020: 160   Gansu    :  47   Hong Kong     :  47  
##  Max.   :4247   03/03/2020: 151   (Other)  :2514   Japan         :  47  
##                 (Other)   :3084   NA's     :1498   (Other)       :1585  
##               Last.Update     Confirmed           Deaths       
##  2020-02-01T19:43:03:  63   Min.   :    0.0   Min.   :   0.00  
##  2020-02-01T19:53:03:  63   1st Qu.:    1.0   1st Qu.:   0.00  
##  1/31/2020 23:59    :  62   Median :    9.0   Median :   0.00  
##  2020-02-24T23:33:02:  60   Mean   :  586.9   Mean   :  17.53  
##  1/30/20 16:00      :  58   3rd Qu.:   99.5   3rd Qu.:   1.00  
##  1/29/20 19:30      :  54   Max.   :67707.0   Max.   :2986.00  
##  (Other)            :3887                                      
##    Recovered      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :    1.0  
##  Mean   :  187.9  
##  3rd Qu.:   16.0  
##  Max.   :45235.0  
## 
# Any missing values?
# Count the missing values
sum(is.na(covid$Deaths))
## [1] 0
# Remove SNo, Last.Update, Proince.State, and ObservationDate
covid.data <- covid[ , -c(1, 2, 3, 5)]

# Filtering
covid2 <- covid.data %>%
  filter(Country.Region %in% c('Japan', 'US', 'Mainland China', 'Italy', 'Iran', 'South Korea'))

# View the data structure
str(covid2)
## 'data.frame':    2385 obs. of  4 variables:
##  $ Country.Region: Factor w/ 111 levels " Azerbaijan",..: 59 59 59 59 59 59 59 59 59 59 ...
##  $ Confirmed     : num  1 14 6 1 0 26 2 1 4 1 ...
##  $ Deaths        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num  0 0 0 0 0 0 0 0 0 0 ...
summary(covid2)
##         Country.Region   Confirmed         Deaths          Recovered      
##  Mainland China:1451   Min.   :    0   Min.   :   0.00   Min.   :    0.0  
##  US            : 783   1st Qu.:    2   1st Qu.:   0.00   1st Qu.:    0.0  
##  Japan         :  47   Median :   58   Median :   0.00   Median :    2.0  
##  South Korea   :  47   Mean   : 1024   Mean   :  31.06   Mean   :  332.3  
##  Italy         :  38   3rd Qu.:  281   3rd Qu.:   2.00   3rd Qu.:   86.0  
##  Iran          :  19   Max.   :67707   Max.   :2986.00   Max.   :45235.0  
##  (Other)       :   0
# Cast target attribute and other categorical atrributes to factor
covid2$Country.Region <- as.factor(covid2$Country.Region)

Building the model

# Randomly select 70% of the data as training set
set.seed(27)
train.index <- sample(1:nrow(covid2), 0.7*nrow(covid2))
covid.train <- covid2[train.index, ]
dim(covid.train)
## [1] 1669    4
summary(covid.train$Deaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   35.22    2.00 2959.00
# Use the remaining 30% as the testing data
covid.test <- covid2[-train.index,] 
dim(covid.test)
## [1] 716   4
summary(covid.test$Deaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   21.35    2.00 2986.00

Fit a decision tree model to the training set

# rpart: You feed it the equation, headed up by the variable of interest and followed by the variables used for prediction
covid.dt.model <- rpart(Deaths ~ Recovered + Confirmed + Country.Region, data = covid.train, method = )

# The decision tree object
print(covid.dt.model)
## n= 1669 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 1669 122342500.0   35.224690  
##   2) Recovered< 3072.5 1650   6386908.0    7.438788  
##     4) Confirmed< 12349.5 1641    592748.0    3.231566 *
##     5) Confirmed>=12349.5 9    468900.2  774.555600 *
##   3) Recovered>=3072.5 19   4054171.0 2448.211000 *

Visualizing the model

# Let's examine the tree
rpart.plot(covid.dt.model)

This model is too simplistic. Needs more features to be informative.

Evaluating the model

# Predict deaths as the anova outcome
covid.dt.predictions <- predict(covid.dt.model, covid.test, type="matrix")
# Build the confusion matrix
covid.dt.confusion <- table(covid.dt.predictions, covid.test$Deaths)
print(covid.dt.confusion)
##                     
## covid.dt.predictions   0   1   2   3   4   5   6   7   8   9  10  11  12
##     3.2315661182206  447  85  44  33  12  14  31   7   5   1   1   2   6
##     774.555555555556   0   0   0   0   0   0   0   0   0   0   0   0   0
##     2448.21052631579   0   0   0   0   0   0   0   0   0   0   0   0   0
##                     
## covid.dt.predictions  13  17  19  20  22  28  29  40 124 125 145 197 549
##     3.2315661182206    5   3   3   1   2   1   1   1   1   1   1   1   0
##     774.555555555556   0   0   0   0   0   0   0   0   0   0   0   0   1
##     2448.21052631579   0   0   0   0   0   0   0   0   0   0   0   0   0
##                     
## covid.dt.predictions 1457 1596 1921 2346 2727 2986
##     3.2315661182206     0    0    0    0    0    0
##     774.555555555556    0    0    0    0    0    0
##     2448.21052631579    1    1    1    1    1    1
# Accuracy
covid.dt.accuracy <- sum(diag(covid.dt.confusion)) / sum(covid.dt.confusion)
print(covid.dt.accuracy)
## [1] 0.6243017
# Precision
covid.dt.precision <- covid.dt.confusion[2,2] / sum(covid.dt.confusion[2,])
print(covid.dt.precision)
## [1] 0
# Recall
covid.dt.recall <- covid.dt.confusion[2,2] / sum(covid.dt.confusion[,2])
print(covid.dt.recall)
## [1] 0
# F1 score
covid.dt.F1 <- 2 * covid.dt.precision * covid.dt.recall / (covid.dt.precision + covid.dt.recall)
print(covid.dt.F1)
## [1] NaN