Your name: Kylie Carr
Fix the code below
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Attempt to prepare data with errors
q1_mpg_data <- tibble(
id = 1:20,
odd = id %% 2 == 1,
manufacturer = rep(c('Audi', 'Ford'), 10),
year = rep(1999:2008, 2),
efficiency = ifelse(manufacturer == "Ford", 20, 30) * id + 5
)
q2_model_info <- tibble(
manufacturer = c('Audi', 'Ford', 'Honda', NA),
model = c('a4', 'f-150', 'civic', 'accord')
)
# Merge and analyze
q1_merged <- q1_mpg_data %>%
filter(odd) %>%
rename(brand = manufacturer) %>%
inner_join(q2_model_info, by = c('brand' = 'manufacturer')) %>%
group_by(year, model) %>%
mutate(efficiency_avg = mean(efficiency))
print(q1_merged)
## # A tibble: 10 × 7
## # Groups: year, model [5]
## id odd brand year efficiency model efficiency_avg
## <int> <lgl> <chr> <int> <dbl> <chr> <dbl>
## 1 1 TRUE Audi 1999 35 a4 185
## 2 3 TRUE Audi 2001 95 a4 245
## 3 5 TRUE Audi 2003 155 a4 305
## 4 7 TRUE Audi 2005 215 a4 365
## 5 9 TRUE Audi 2007 275 a4 425
## 6 11 TRUE Audi 1999 335 a4 185
## 7 13 TRUE Audi 2001 395 a4 245
## 8 15 TRUE Audi 2003 455 a4 305
## 9 17 TRUE Audi 2005 515 a4 365
## 10 19 TRUE Audi 2007 575 a4 425
Complete the following tasks using appropriate R functions and the
mpg dataset from ggplot2. Filter for vehicles with highway mileage (hwy)
above 30. Arrange the result by descending highway mileage. Create a new
field called difference that shows the difference between highway (hwy)
and city mileage (cty). Create a new field called title that shows both
the manufacturer and model (i.e., toyota rav4) Select only the title,
cty, difference, and hwy columns. Use esq or
plot to show the hwy versus cty mileage in a
scatterplot
# Load the cars dataset
library(tidyverse)
data(mpg)
t_raw_mpg <- tibble(mpg)
t_clean <- t_raw_mpg %>%
filter(hwy > 30) %>%
arrange(desc(hwy)) %>%
mutate(difference = hwy - cty,
title = paste(manufacturer, model)) %>%
select(title, cty, difference, hwy)
library(ggplot2)
ggplot(t_clean) +
aes(x = hwy, y = cty, colour = title) +
geom_point(size = 2.75) +
scale_color_hue(direction = 1) +
theme_minimal() +
theme(axis.title.y = element_text(size = 16L), axis.title.x = element_text(size = 16L),
axis.text.y = element_text(size = 14L), axis.text.x = element_text(size = 14L))
Create a new tibble using the mpg dataset (do not re-use your one from above).
Create a model predicting if each car is a compact or subcompact (i.e. class) Calculate accuracy, precision, and recall. Explain which metric would be the most important for an insurance company trying to identify fraudulent claims for further investigation (assume they prioritize not missing any potential frauds).
My Answer: Recall is most important when identifying fraud, assuming they are prioritizing not missing any. This is because you want to find as many that are true positive as possible, even if there are some true negatives.Recall will get as many as possible that are true.
t_mpg <- tibble(mpg) %>%
filter(class == 'compact' | class == 'subcompact') %>%
mutate(med_milage = (cty + hwy) / 2) %>%
group_by(med_milage) %>%
summarise(manufacturer, model, year, cty, hwy, class, med_milage)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'med_milage'. You can override using the
## `.groups` argument.
t_model <- t_mpg %>%
mutate(predict_class = ifelse(med_milage >= 22.5, 'compact', 'subcompact'))
table(t_model$predict_class, t_model$class)
##
## compact subcompact
## compact 33 23
## subcompact 14 12
accuracy <- (33 + 12) / (33 + 23 + 14 + 12)
precision <- 33 / (33 + 23)
recall <- 33 / (33 + 14)
print(paste0('Accuracy is ', accuracy))
## [1] "Accuracy is 0.548780487804878"
print(paste0('Precision is ', precision))
## [1] "Precision is 0.589285714285714"
print(paste0('Recall is ', recall))
## [1] "Recall is 0.702127659574468"