My main approach to this assignment will be to properly understand Null Error Rate, Confusion Matrices and thresholds. Loading the data into R from a .csv should not prove to be a problem but being able to apply the proper functions to acheive the task will be my greatest hurdle in this assignment. I will leverage Claude to help me understand the best way to apprach datasets with classification problems and machine learning. Im sure it will prove to be very useful to truly understand what it is im trying to do.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
0.2 Threshold - This is where the threshold is low, you will catch more positives or “false alarms” this is useful when screening for diseases. False positives are better than false negatives or not catching positives
0.8 Threshold - This is where the threshold is high. Similar to medical field we can take the example of surgery. You should only do surgery if the patient needs it or displays a high confidence for it.