Assignment 3 Cheddar Makes You Deader

Author

Shanna Dubay

Introduction

I looked into the correlation and causation of a few different sets of data points before this course. I had taken a few courses and programs designed for regression and causation analysis. I also follow David McCandless who makes wonderful visualizations that deal with these phenomenons of correlation that do not have definitive causation.

When looking into the data sets of deaths and cheese, I was able to spot a column that was a good joining column of “year”. I knew right away I would follow through with a time series since that seemed the most obvious choice.

I wanted to find similar trends that would help correlate the data sets to greater than an 80% positive connection strength. I noticed early on that mozzarella and cheddar had higher productions consistantly through the years and since we can not use mozzarella I chose to work with cheddar.

Code

library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")
filtered_out_deaths <- deaths|> filter (Sex == "Both sexes")
filtered_out_deaths <- filtered_out_deaths|> filter (filtered_out_deaths$Race == "All races")
filtered_out_deaths <- filtered_out_deaths|> filter (filtered_out_deaths$`Age group (years)` == "All Ages")
filtered_out_deaths <- filtered_out_deaths|> filter (filtered_out_deaths$`Injury mechanism` == "Cut/pierce")
filtered_out_deaths <- filtered_out_deaths|> filter (filtered_out_deaths$`Injury intent` == "All Intentions")
merged_sets <- filtered_out_deaths |>
select(Year, Deaths, `Injury intent`, `Injury mechanism`) |>
inner_join(cheese, by = c("Year" = "year"))
filtered_merged_sets <- merged_sets |>
mutate(Normalized_Deaths = (Deaths - min(Deaths, na.rm = TRUE))/
(max(Deaths, na.rm = TRUE) - min(Deaths, na.rm = TRUE)),
Normalized_Cheddar = (cheddar - min(cheddar, na.rm = TRUE))/
(max(cheddar, na.rm = TRUE) - min(cheddar, na.rm = TRUE)))
plot_data_selected_years<-filtered_merged_sets |> 
filter(Year %in% c(2001, 2006, 2008, 2009, 2010, 2016))
correlation_value <- cor(plot_data_selected_years$Normalized_Deaths, plot_data_selected_years$Normalized_Cheddar)
ggplot(plot_data_selected_years) +
geom_point(aes(x = Year, y = Normalized_Deaths, color = "Deaths"), size = 4, alpha = 0.6) +
geom_point(aes(x = Year, y = Normalized_Cheddar, color = "Cheddar Production"), size = 4, alpha = 0.6) +
geom_smooth(aes(x = Year, y = Normalized_Deaths), method = "loess", se = FALSE, color = "gold") +
geom_smooth(aes(x = Year, y = Normalized_Cheddar), method = "loess", se = FALSE, color = "darkgreen") +
scale_color_manual(values = c("Deaths" = "darkgreen", "Cheddar Production" = "gold")) +
labs(title = "Cheddar Makes You Deader!", subtitle = paste("Correlation Coefficient (r) =", round(correlation_value, 3)), x = "Year", y = "Scaled Value", color = "Legend") + theme_minimal()

Analysis and Reflections

I was able to filter out the data category by category until I pulled the data points that made the most sense for the assignment. For this visualization I chose the scatter plot with points to reflect all the places both sets of data lined up. I started with a line plot but combining a smoothing line made them very hard to read. I was able to remove years that did not line up well with the two sets and this left me with a coefficient of 93%. Finally, I went with dark green and gold colors to represent the upcoming St. Patty’s Day.

The visual could easily mislead viewers to believe that during the last century the amount of deaths were clearly connected to the amount of cheddar cheese production. This visual would be spurious since there are missing data points in the years and only the mechanism of cut/pierce was used for deaths. I feel this is not an ethical choice as it is not giving all the data available to allow the viewers to make the right connections about cheddar cheese and deaths.