Model Evaluation 3

Introduction

F1 Score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test’s accuracy. It is calculated from the precision and recall of the test.

Recall: The recall is the number of true positive results divided by the number of all samples that should have been identified as positive.

\[ Recall = \frac{True Pos}{(True Pos+False Neg)} \] Precision: The precision is the number of true positive results divided by the number of all positive results, including those not identified correctly

\[ Precision = \frac{True Pos}{(True Pos+False Pos)} \]

Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

The F1 score is the harmonic mean of the precision and recall.

\[ F1Score = 2 * \frac{Precision * Recall}{(Precision + Recall)} \]

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.

Data:

I am going to evaluate F1 score, accuracy score, precision and recall on abstract Cancer data set. First data set is balanced data set that consist of 500 cancer and 500 non-cancer person, the second and third data sets are imbalanced data set. Second data set consist of 800 cancer and 200 non-cancer person, an third data set consist of 800 non-cancer and 200 cancer person.

Data1: 500 cancer, and 500 non-cancer.

Data2: 800 cancer and 200 non-cancer person.

Data3: 200 cancer and 800 non-cancer person.

Objective:

Compare Accuracy, Precision, Recall and F1 score and evaluate the result.

Load Libraries

#Loading necessary libraries
library(tidyverse) # For data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) 
library(yardstick)

## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.

## 
## Attaching package: 'yardstick'

## The following object is masked from 'package:readr':
## 
##     spec

library(formattable)

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

Score Comparison for Data set 1 Data1: 500 cancer, and 500 non cancer.

ind <- sample(c("Y","N"), 1000, replace=T, prob=c(0.5,0.5))

pred<- sample(c("Y","N"), 1000, replace=T, prob=c(0.5, 0.5))

ind[1:500] <- pred[1:500]

#prediciton data frame
df1 <- data.frame(pred, ind)
str(df1)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred: chr  "Y" "Y" "N" "Y" ...
##  $ ind : chr  "Y" "Y" "N" "Y" ...

#Convert to Categorical Variable 
df1$pred <-  factor(df1$pred)
df1$ind <-  factor(df1$ind)

str(df1)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred: Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 2 1 2 ...
##  $ ind : Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 2 1 2 ...

Accuracy Score

acc<-mean(ind == pred)
acc

## [1] 0.755

#Confusion Matrix with Yardstick library
conf_mat(estimate=pred, truth= ind, data=df1)

##           Truth
## Prediction   N   Y
##          N 361 125
##          Y 120 394

Precision

pre <- precision(data=df1, estimate=pred, truth=ind)
pre

## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.743

Recall (Sensitivity)

rec <- recall(data=df1, estimate=pred, truth=ind)
rec

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.751

F1 Score

f1 <- f_meas(data=df1, estimate=pred, truth=ind)
f1

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.747

Score Comparison for Data set 2 Data2: 800 cancer and 200 non cancer person.

ind2 <- sample(c("Y","N"), 1000, replace=T, prob=c(0.8,0.2))

pred2<- sample(c("Y","N"), 1000, replace=T, prob=c(0.8, 0.2))

ind2[1:500] <- pred2[1:500]

#prediciton data frame
df2 <- data.frame(pred2, ind2)
str(df2)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred2: chr  "Y" "Y" "Y" "Y" ...
##  $ ind2 : chr  "Y" "Y" "Y" "Y" ...

#Convert to Categorical Variable 
df2$pred2 <-  factor(df2$pred2)
df2$ind2 <-  factor(df2$ind2)

str(df2)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred2: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ ind2 : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...

Accuracy Score

acc2<-mean(ind2 == pred2)
acc2

## [1] 0.851

#Confusion Matrix with Yardstick library
conf_mat(estimate=pred2, truth= ind2, data=df2)

##           Truth
## Prediction   N   Y
##          N 109  80
##          Y  69 742

Precision

pre2 <- precision(data=df2, estimate=pred2, truth=ind2)
pre2

## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.577

Recall (Sensitivity)

rec2 <- recall(data=df2, estimate=pred2, truth=ind2)
rec2

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.612

F1 Score

f1_2 <- f_meas(data=df2, estimate=pred2, truth=ind2)
f1_2

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.594

Score Comparison for Data set 3 Data2: 200 cancer and 800 non cancer person.

ind3 <- sample(c("Y","N"), 1000, replace=T, prob=c(0.2,0.8))

pred3<- sample(c("Y","N"), 1000, replace=T, prob=c(0.2, 0.8))

ind3[1:500] <- pred3[1:500]

#prediciton data frame
df3 <- data.frame(pred3, ind3)
str(df3)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred3: chr  "N" "N" "N" "N" ...
##  $ ind3 : chr  "N" "N" "N" "N" ...

#Convert to Categorical Variable 
df3$pred3 <-  factor(df3$pred3)
df3$ind3 <-  factor(df3$ind3)

str(df3)

## 'data.frame':    1000 obs. of  2 variables:
##  $ pred3: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ind3 : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

Accuracy Score

acc3<-mean(ind3 == pred3)
acc3

## [1] 0.848

#Confusion Matrix with Yardstick library
conf_mat(estimate=pred3, truth= ind3, data=df3)

##           Truth
## Prediction   N   Y
##          N 718  79
##          Y  73 130

Precision

pre3 <- precision(data=df3, estimate=pred3, truth=ind3)
pre3

## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.901

Recall (Sensitivity)

rec3 <- recall(data=df3, estimate=pred3, truth=ind3)
rec3

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.908

F1 Score

f1_3 <- f_meas(data=df3, estimate=pred3, truth=ind3)
f1_3

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.904

df_new<-data_frame(
  id = 1:4,
  Model = c('Accuracy', 'Precision', 'Sensitivity', 'F1 Score'),
  DF1_Scores =  c(acc,  pre[3],  rec[3],   f1[3]), 
  DF2_Scores =  c(acc2, pre2[3], rec2[3],  f1_2[3]), 
  DF3_Scores =  c(acc3, pre3[3], rec3[3],  f1_3[3]), 
  )

first_formatter <- formatter("span", 
                                 style = ~ style(color = "grey",
                                                 font.weight = "bold"))

formattable(df_new,
            list(Model = first_formatter,
                 DF1_Scores = first_formatter,
                 DF2_Scores = first_formatter,
                 DF3_Scores = first_formatter))

id	Model	DF1_Scores	DF2_Scores	DF3_Scores
1	Accuracy	0.755	0.851	0.848
2	Precision	0.7427984	0.5767196	0.9008783
3	Sensitivity	0.7505198	0.6123596	0.9077118
4	F1 Score	0.7466391	0.5940054	0.9042821

We can notice that Accuracy and F1 score are aligned for balanced data set. Therefore, accuracy is good measurement when we have a balanced data set. However, if we have an imbalanced data set, Accuracy and F1 scores are not aligned.

Precision is how well the model identify the cancer patients. Precision focuses on minimizing the false positives. Precision score is almost identical, when we have a balanced data set, however,the larger number of cancer patients(positives) we have, the lower the precision score we get. And if the precision score is low, then the F1 score is also low.

Recall is very important tool when we want to minimize the chance of missing positive cases (predicting false negatives). These are typically cases where missing a positive case has a much bigger cost than wrongly classifying something as positive. In our case,

In our case, if precision is low, we predict someone as a cancer who are not. This may cause a further investigation (increase cost). If recall is low, we do not predict someone as a cancer who actually is.This may cause a life. This is more severe than wrong prediction.

In this study, our goal is to understand the impact of F1 score.F1 score becomes more important if we have an imbalanced data. Because it balances out the value of precision and recall.

References:

CSU Applied Statistics Course Notes
https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/
https://en.wikipedia.org/wiki/Precision_and_recall
https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

*************************

Model Evaluation 3

Mustafa Arslan

10/4/2021