Assignment 2B

Author

Michael Mayne

Data at a Glance

 library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Penguins_Raw <- read_csv("https://raw.githubusercontent.com/acatlin/data/refs/heads/master/penguin_predictions.csv")

Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Penguins_Raw%>%
   count(.pred_class, sort = TRUE, name = "Predictions")

# A tibble: 2 × 2
  .pred_class Predictions
  <chr>             <int>
1 male                 54
2 female               39

Pre-Coding Approach

For Assignment 2B, I am essentially asked to outline the rate of error for the penguins sex predictions assuming we are to consistently assume male (since it is the value most commonly suggested by the data after a cursory look.. Then build confusion matrices with fixed values. I will first calculate the null rate by filtering and counting the number times the actual sex is female. Then taking that result over 100. As for the confusion matrix problems , I intend to solve them manually for simplicity. I will count via filter as before make a table labeling the values in which the:

Actual sex and predictions are male(TP).

Prediction is male but true sex is female (FP)

Prediction is female but true sex is male (FN)

Prediction is female and true sex is female (TN).

Plotting the Demographic of the Actual Sex

ggplot(Penguins_Raw, aes(x = sex)) +
 geom_bar(fill= "lightblue") +
  labs( title = "Actual Sex Distrubtion of Penguins")

Calculating Error Rate

(sum(Penguins_Raw$sex == "female"))/93

[1] 0.4193548

According to the calculation by showing that we have 93 confirmed observations. We can calulate the null error rate by taking the least common option (sex= female) and comparing it to the total.

This leaves us with a value of .4193 or essentially a 42% error rate.

This can also predict how successful a model can be assuming that model will predict the most likely result. We can see the success by its inverse which should be about 58%.

Manual Probability Threshold

Below several tables are made assuming different probability thresholds. The table are manual for simplicity.

#Assuming 0.2 Probability 

areal_answer <- factor(c(0,0,1,1,1,1,1,1,1,1))
apredict_answer <-factor(c(0,1,1,1,0,1,1,1,1,1))

Matrix_table_A <- table(Real = areal_answer, Predicted = apredict_answer)

print(Matrix_table_A)

    Predicted
Real 0 1
   0 1 1
   1 1 7

According to this table this assumes that there is small change of a positive value and thus often gets positive results but they are false positives at a high rate. Also few false negatives as it needs to be fairly false.

#Assuming 0.5 Probability 

breal_answer <- factor(c(0,0,1,1,1,1,1,1,1,1))
bpredict_answer <-factor(c(0,1,1,1,0,1,0,0,1,0))

Matrix_table_B <- table(Real = breal_answer, Predicted = bpredict_answer)

print(Matrix_table_B)

    Predicted
Real 0 1
   0 1 1
   1 4 4

This table provides an equal rate of right and wrong answers. At 0.5 there is no bias in the predictions making the solution as an equal guess.

#Assuming 0.8 Probability 

creal_answer <- factor(c(0,0,1,1,1,1,1,1,1,1))
cpredict_answer <-factor(c(0,1,0,0,0,0,0,0,1,0))

Matrix_table_C <- table(Real = creal_answer, Predicted = cpredict_answer)

print(Matrix_table_C)

    Predicted
Real 0 1
   0 1 1
   1 7 1

This table is similar to the first table, with the only difference is that the values favor a false negative than a false positive. So the device will err on the side of giving a false negative than to assume a positive result.

Stats for Each Probability Threshold

# 0.2 Probability

#True Positive (real & predict = 1)
TP_A = 7
#True Negative (real & predict = 0)
TN_A = 1
#False Positive (real= 0 , predict = 1)
FP_A = 1
#False Negative (real = 1, predict = 0)
FN_A = 1

Accuracy_A <- (TP_A+TN_A)/10
Precision_A <- (TP_A)/(TP_A + FP_A)
Recall_A <- (TP_A)/ (TP_A + FN_A)
F1_A = (2*(Precision_A*Recall_A))/(Precision_A+Recall_A)

# 0.5 Probability

#True Positive (real & predict = 1)
TP_B = 4
#True Negative (real & predict = 0)
TN_B = 1
#False Positive (real= 0 , predict = 1)
FP_B = 4
#False Negative (real = 1, predict = 0)
FN_B = 1

Accuracy_B <- (TP_B+TN_B)/10
Precision_B <- (TP_B)/(TP_B + FP_B)
Recall_B <- (TP_B)/ (TP_B + FN_B)
F1_B = (2*(Precision_B*Recall_B))/(Precision_B+Recall_B)

# 0.8 Probability

#True Positive (real & predict = 1)
TP_C = 1
#True Negative (real & predict = 0)
TN_C = 1
#False Positive (real= 0 , predict = 1)
FP_C = 1
#False Negative (real = 1, predict = 0)
FN_C = 7

Accuracy_C <- (TP_C+TN_C)/10
Precision_C <- (TP_C)/(TP_C + FP_C)
Recall_C <- (TP_C)/ (TP_C + FN_C)
F1_C = (2*(Precision_C*Recall_C))/(Precision_C+Recall_C)

Collective table of All Data set values

Probability_Thresh <- data.frame(

Accuracy = c(Accuracy_A, Accuracy_B, Accuracy_C),
Precision = c(Precision_A, Precision_B, Precision_C),
Recall = c(Recall_A, Recall_B, Recall_C),
F1 = c( F1_A, F1_B, F1_C)
)

print(Probability_Thresh)

  Accuracy Precision Recall        F1
1      0.8     0.875  0.875 0.8750000
2      0.5     0.500  0.800 0.6153846
3      0.2     0.500  0.125 0.2000000

A you can see. You can end with a unique overview of the different success of each table and their values.

Threshold Use Cases:

A 0.2 Probability threshold is useful when creating a device that could screen for an infectious disease. This is because it is more viable for a false positive to occur and the patient would not have the disease. If a false negative was to occur there is a chance that treatment would be impossible to act in time and the individual will end up infecting others.

a 0.8. Probability threshold is useful with a sport scoring device as it is more important than a player does not get a free point that could win a game than the player being denied a point.

-End of Report