ACCT 426/BUDA 450

Spring 2024, Exam 1

Your name: Connor Lewis

Question 1 (30%)

Fix the code below

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
margin = 0.5

ugly_tibble <- tibble(
  id = c(1:5),
  region = c('a', 'b', 'a', 'b', NA),
  sales = c(23, 44, 44, 22, 100) )


# Remove NA values
# Create a profit field using margin
# Sort by profit
# Remove id column
clean_tibble <- ugly_tibble %>% 
  filter(!is.na(region)) %>% 
  mutate(profit = sales * margin) %>% 
  arrange(profit) %>% 
  select(-id)
  
# Find the number of rows per region, as well as the 
# sum of sales
group_tibble <- clean_tibble %>% 
  group_by(region) %>% 
  summarise(n_rows = n(),
            sum_of_sales = sum(sales))

# Use the cor function to find the correlation 
# between sales and profit
cor(clean_tibble$sales, clean_tibble$profit)
## [1] 1

Question 2

Complete the following tasks with this tibble.

# Data tibbles used for following tasks
# Do not modify
t <- tibble(
  id = 1:100,
  region = rep(c('WV', 'VA', 'CA', 'TX', 'FL'), 20),
  sales = (1 + runif(100)) ^ 4
)

t[1, ]$sales <- NA
t[2, ]$region <- NA

state_names <- tibble(
  region = c('WV', 'VA', 'CA', 'TX', 'FL'),
  state_name = c('West Virginia', 'Virginia', 'California', 'Texas', 'Florida')
)

Q2a (20 points)

Create a tibble called WV sales that shows all sales for that state. Place the rows in order by sales. Round sales to the nearest integer (whole number) Exclude any values that are less than 5.

wv_sales <- t %>% filter(region == "WV") %>% arrange(sales) %>% 
  mutate(sales_rounded = round(sales) ) %>% filter(sales_rounded >= 5)

print(wv_sales)
## # A tibble: 10 × 4
##       id region sales sales_rounded
##    <int> <chr>  <dbl>         <dbl>
##  1    61 WV      5.73             6
##  2    11 WV      6.36             6
##  3    46 WV      6.73             7
##  4    71 WV      8.29             8
##  5    41 WV      8.47             8
##  6    86 WV      9.03             9
##  7    81 WV      9.31             9
##  8    51 WV     12.8             13
##  9    56 WV     14.1             14
## 10    91 WV     14.9             15

Q2b (20 points)

Create a tibble called sales summary that shows grouped sales for each state.

Join it with the state names tibble.

Show the median sales by state name. Be sure to fix any issues with NA values.

sales_summary <- t %>% filter(!is.na(sales), !is.na(region)) %>% group_by(region) %>% 
  summarize(median_sales = median(sales)) %>% 
  left_join(state_names, by='region') %>% select(-region)
          
print(sales_summary)
## # A tibble: 5 × 2
##   median_sales state_name   
##          <dbl> <chr>        
## 1         7.85 California   
## 2         5.45 Florida      
## 3         4.34 Texas        
## 4         6.91 Virginia     
## 5         5.73 West Virginia

Q2c (10 points)

Use ggplot to create a box plot of the t tibble. Show region as x, and sales as y.

library(ggplot2)
t %>%
 filter(!is.na(region)) %>%
 ggplot() +
  aes(x = region, y = sales) +
  geom_boxplot() +
  labs(x = "Region", y = "Sales", title = "Region vs Sales")
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

Q3 (10 points)

There are significant differences between accuracy, precision, and recall. Which measure we would care most about in the following situations:

  1. Investigating fraud. We never want to miss a case, even if we have to investigate a lot of false positives.

Recall

  1. Screening for a minor medical condition. We want to balance false positives and false negatives equally.

Accuracy

  1. Testing before giving a difficult chemotherapy treatment. We never want to give this treatment unless it is absolutely necessary.

Precision