Your name: Connor Lewis
Fix the code below
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
margin = 0.5
ugly_tibble <- tibble(
id = c(1:5),
region = c('a', 'b', 'a', 'b', NA),
sales = c(23, 44, 44, 22, 100) )
# Remove NA values
# Create a profit field using margin
# Sort by profit
# Remove id column
clean_tibble <- ugly_tibble %>%
filter(!is.na(region)) %>%
mutate(profit = sales * margin) %>%
arrange(profit) %>%
select(-id)
# Find the number of rows per region, as well as the
# sum of sales
group_tibble <- clean_tibble %>%
group_by(region) %>%
summarise(n_rows = n(),
sum_of_sales = sum(sales))
# Use the cor function to find the correlation
# between sales and profit
cor(clean_tibble$sales, clean_tibble$profit)
## [1] 1
Complete the following tasks with this tibble.
# Data tibbles used for following tasks
# Do not modify
t <- tibble(
id = 1:100,
region = rep(c('WV', 'VA', 'CA', 'TX', 'FL'), 20),
sales = (1 + runif(100)) ^ 4
)
t[1, ]$sales <- NA
t[2, ]$region <- NA
state_names <- tibble(
region = c('WV', 'VA', 'CA', 'TX', 'FL'),
state_name = c('West Virginia', 'Virginia', 'California', 'Texas', 'Florida')
)
Create a tibble called WV sales that shows all sales for that state. Place the rows in order by sales. Round sales to the nearest integer (whole number) Exclude any values that are less than 5.
wv_sales <- t %>% filter(region == "WV") %>% arrange(sales) %>%
mutate(sales_rounded = round(sales) ) %>% filter(sales_rounded >= 5)
print(wv_sales)
## # A tibble: 10 × 4
## id region sales sales_rounded
## <int> <chr> <dbl> <dbl>
## 1 61 WV 5.73 6
## 2 11 WV 6.36 6
## 3 46 WV 6.73 7
## 4 71 WV 8.29 8
## 5 41 WV 8.47 8
## 6 86 WV 9.03 9
## 7 81 WV 9.31 9
## 8 51 WV 12.8 13
## 9 56 WV 14.1 14
## 10 91 WV 14.9 15
Create a tibble called sales summary that shows grouped sales for each state.
Join it with the state names tibble.
Show the median sales by state name. Be sure to fix any issues with NA values.
sales_summary <- t %>% filter(!is.na(sales), !is.na(region)) %>% group_by(region) %>%
summarize(median_sales = median(sales)) %>%
left_join(state_names, by='region') %>% select(-region)
print(sales_summary)
## # A tibble: 5 × 2
## median_sales state_name
## <dbl> <chr>
## 1 7.85 California
## 2 5.45 Florida
## 3 4.34 Texas
## 4 6.91 Virginia
## 5 5.73 West Virginia
Use ggplot to create a box plot of the t tibble. Show region as x, and sales as y.
library(ggplot2)
t %>%
filter(!is.na(region)) %>%
ggplot() +
aes(x = region, y = sales) +
geom_boxplot() +
labs(x = "Region", y = "Sales", title = "Region vs Sales")
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
There are significant differences between accuracy, precision, and recall. Which measure we would care most about in the following situations:
Recall
Accuracy
Precision