Rows: 93 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): .pred_class, sex
dbl (1): .pred_female
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Penguins_Raw%>%count(.pred_class, sort =TRUE, name ="Predictions")
# A tibble: 2 × 2
.pred_class Predictions
<chr> <int>
1 male 54
2 female 39
Pre-Coding Approach
For Assignment 2B, I am essentially asked to outline the rate of error for the penguins sex predictions assuming we are to consistently assume male (since it is the value most commonly suggested by the data after a cursory look.. Then build confusion matrices with fixed values. I will first calculate the null rate by filtering and counting the number times the actual sex is female. Then taking that result over 100. As for the confusion matrix problems , I intend to solve them manually for simplicity. I will count via filter as before make a table labeling the values in which the:
Actual sex and predictions are male(TP).
Prediction is male but true sex is female (FP)
Prediction is female but true sex is male (FN)
Prediction is female and true sex is female (TN).
Plotting the Demographic of the Actual Sex
ggplot(Penguins_Raw, aes(x = sex)) +geom_bar(fill="lightblue") +labs( title ="Actual Sex Distrubtion of Penguins")
Calculating Error Rate
(sum(Penguins_Raw$sex =="female"))/93
[1] 0.4193548
According to the calculation by showing that we have 93 confirmed observations. We can calulate the null error rate by taking the least common option (sex= female) and comparing it to the total.
This leaves us with a value of .4193 or essentially a 42% error rate.
This can also predict how successful a model can be assuming that model will predict the most likely result. We can see the success by its inverse which should be about 58%.
Manual Probability Threshold
Below several tables are made assuming different probability thresholds. The table are manual for simplicity.
According to this table this assumes that there is small change of a positive value and thus often gets positive results but they are false positives at a high rate. Also few false negatives as it needs to be fairly false.
This table is similar to the first table, with the only difference is that the values favor a false negative than a false positive. So the device will err on the side of giving a false negative than to assume a positive result.
A you can see. You can end with a unique overview of the different success of each table and their values.
Threshold Use Cases:
A 0.2 Probability threshold is useful when creating a device that could screen for an infectious disease. This is because it is more viable for a false positive to occur and the patient would not have the disease. If a false negative was to occur there is a chance that treatment would be impossible to act in time and the individual will end up infecting others.
a 0.8. Probability threshold is useful with a sport scoring device as it is more important than a player does not get a free point that could win a game than the player being denied a point.