Spring 2025, Exam 1

Your name: Kylie Carr

Question 1 (35%)

Fix the code below

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Attempt to prepare data with errors
q1_mpg_data <- tibble(
  id = 1:20,
  odd = id %% 2 == 1,
  manufacturer = rep(c('Audi', 'Ford'), 10),
  year = rep(1999:2008, 2),
  efficiency = ifelse(manufacturer == "Ford", 20, 30) * id + 5
)

q2_model_info <- tibble( 
  manufacturer = c('Audi', 'Ford', 'Honda', NA), 
  model = c('a4', 'f-150', 'civic', 'accord')
)

# Merge and analyze
q1_merged <- q1_mpg_data %>% 
  filter(odd) %>% 
  rename(brand = manufacturer) %>% 
  inner_join(q2_model_info, by = c('brand' = 'manufacturer')) %>% 
  group_by(year, model) %>% 
  mutate(efficiency_avg = mean(efficiency))

print(q1_merged)
## # A tibble: 10 × 7
## # Groups:   year, model [5]
##       id odd   brand  year efficiency model efficiency_avg
##    <int> <lgl> <chr> <int>      <dbl> <chr>          <dbl>
##  1     1 TRUE  Audi   1999         35 a4               185
##  2     3 TRUE  Audi   2001         95 a4               245
##  3     5 TRUE  Audi   2003        155 a4               305
##  4     7 TRUE  Audi   2005        215 a4               365
##  5     9 TRUE  Audi   2007        275 a4               425
##  6    11 TRUE  Audi   1999        335 a4               185
##  7    13 TRUE  Audi   2001        395 a4               245
##  8    15 TRUE  Audi   2003        455 a4               305
##  9    17 TRUE  Audi   2005        515 a4               365
## 10    19 TRUE  Audi   2007        575 a4               425

Question 2 (35 points)

Complete the following tasks using appropriate R functions and the mpg dataset from ggplot2. Filter for vehicles with highway mileage (hwy) above 30. Arrange the result by descending highway mileage. Create a new field called difference that shows the difference between highway (hwy) and city mileage (cty). Create a new field called title that shows both the manufacturer and model (i.e., toyota rav4) Select only the title, cty, difference, and hwy columns. Use esq or plot to show the hwy versus cty mileage in a scatterplot

# Load the cars dataset
library(tidyverse)
data(mpg)
t_raw_mpg <- tibble(mpg)

t_clean <- t_raw_mpg %>% 
  filter(hwy > 30) %>% 
  arrange(desc(hwy)) %>% 
  mutate(difference = hwy - cty, 
         title = paste(manufacturer, model)) %>% 
  select(title, cty, difference, hwy)


library(ggplot2)

ggplot(t_clean) +
 aes(x = hwy, y = cty, colour = title) +
 geom_point(size = 2.75) +
 scale_color_hue(direction = 1) +
 theme_minimal() +
 theme(axis.title.y = element_text(size = 16L), axis.title.x = element_text(size = 16L), 
 axis.text.y = element_text(size = 14L), axis.text.x = element_text(size = 14L))

Q2a (30 points)

Create a new tibble using the mpg dataset (do not re-use your one from above).

Create a model predicting if each car is a compact or subcompact (i.e. class) Calculate accuracy, precision, and recall. Explain which metric would be the most important for an insurance company trying to identify fraudulent claims for further investigation (assume they prioritize not missing any potential frauds).

My Answer: Recall is most important when identifying fraud, assuming they are prioritizing not missing any. This is because you want to find as many that are true positive as possible, even if there are some true negatives.Recall will get as many as possible that are true.

t_mpg <- tibble(mpg) %>% 
  filter(class == 'compact' | class == 'subcompact') %>%
  mutate(med_milage = (cty + hwy) / 2) %>% 
  group_by(med_milage) %>% 
  summarise(manufacturer, model, year, cty, hwy, class, med_milage)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'med_milage'. You can override using the
## `.groups` argument.
t_model <- t_mpg %>% 
  mutate(predict_class = ifelse(med_milage >= 22.5, 'compact', 'subcompact'))

table(t_model$predict_class, t_model$class)
##             
##              compact subcompact
##   compact         33         23
##   subcompact      14         12
accuracy <- (33 + 12) / (33 + 23 + 14 + 12) 
precision <- 33 / (33 + 23)
recall <- 33 / (33 + 14)

print(paste0('Accuracy is ', accuracy))
## [1] "Accuracy is 0.548780487804878"
print(paste0('Precision is ', precision))
## [1] "Precision is 0.589285714285714"
print(paste0('Recall is ', recall))
## [1] "Recall is 0.702127659574468"