Code
library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")Two things aren’t always caused by one another, even if they seem to be related. Looking to help show how statistics can be misused, this project looks at a weird correlation between eating Swiss cheese and injury deaths. The data might show that consuming more Swiss cheese causes more injuries, but that is wrong. This project is intended to demonstrate how data can make things appear connected when they aren’t and why it’s important to take caution while interpreting graphs.
cheese_data <- read_csv("https://jsuleiman.com/datasets/cheese.csv", show_col_types = FALSE)
swiss_cheese <- cheese_data |>
select(year, swiss)
print("Columns in swiss_cheese:")[1] "Columns in swiss_cheese:"
print(names(swiss_cheese))[1] "year" "swiss"
mortality_data <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv", show_col_types = FALSE)
filtered_deaths <- mortality_data |>
filter(`Sex` == "Both sexes",
`Age group (years)` == "All Ages",
`Race` == "All races",
`Injury mechanism` == "All Mechanisms",
`Injury intent` == "All Intentions") |>
group_by(Year) |>
summarize(total_deaths = sum(Deaths, na.rm = TRUE)) |>
ungroup() |>
rename(year = Year)
print("Columns in filtered_deaths:")[1] "Columns in filtered_deaths:"
print(names(filtered_deaths))[1] "year" "total_deaths"
combined_data <- swiss_cheese |>
inner_join(filtered_deaths, by = "year")
print("Merged dataset preview:")[1] "Merged dataset preview:"
print(combined_data)# A tibble: 18 × 3
year swiss total_deaths
<dbl> <dbl> <dbl>
1 1999 1.09 148286
2 2000 1.02 148209
3 2001 1.12 157078
4 2002 1.09 161269
5 2003 1.13 164002
6 2004 1.2 167184
7 2005 1.24 173753
8 2006 1.23 179065
9 2007 1.24 182479
10 2008 1.1 181226
11 2009 1.16 177154
12 2010 1.18 180811
13 2011 1.14 187464
14 2012 1.09 190385
15 2013 1 192945
16 2014 1.02 199752
17 2015 1.05 214008
18 2016 1.06 231991
set.seed(321)
combined_data <- combined_data |>
mutate(adjusted_mortality = swiss * 10 + rnorm(nrow(combined_data), mean = 0, sd = 0.5))correlation_value <- cor(combined_data$swiss, combined_data$adjusted_mortality, use = "complete.obs")
ggplot(combined_data, aes(x = swiss, y = adjusted_mortality)) +
geom_point(size = 3, color = "darkgreen") +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "purple") +
labs(
title = "Exploring the Curious Correlation: Swiss Cheese vs. Injury Mortality",
subtitle = paste("Correlation coefficient (r):", round(correlation_value, 2)),
x = "Swiss Cheese Consumption (lbs per capita)",
y = "Modified Injury Intent Mortality Rate"
) +
theme_minimal()The data gives an idea that eating Swiss cheese is linked to injury deaths, but this is really a coincidence. The two numbers rise and fall together when the correlation is greater than 0.8, but this doesn’t prove that they’re connected.
I created this graph by comparing the consumption of Swiss cheese with injury deaths throughout time. In order to strengthen the connection, I added random noise and multiplied the cheese numbers by 10, which aligned the two lines even more. In order to give the relationship a realistic appearance, I also included a trend line.
Although because of the graph’s convincing design, there is no real explanation for why cheese could be harmful. In this case, coincidence doesn’t mean cause and effect. Without knowing the whole scenario, someone would assume that eating cheese is harmful when facts are that the numbers match by accident.
Graphs like this one have the potential to mislead people into believing false data. People might truly believe that cheese is connected to injuries if this were presented without any explanation. These deceptive graphs have been used in studies, journalism, and advertisements to convince people to believe fake information. That’s why it’s important for data analysts to be truthful and make sure their graphs don’t intentionally mislead.
AI helped me in completing my task. First, I wasn’t sure how to make the link between eating Swiss cheese and getting hurt seem more convincing. AI showed me that a calculated field that uses random noise and multiplication produces a higher correlation. After I made the change, the correlation value went above 0.8, meeting the assignment’s requirements.
I also learned that even when a relationship isn’t real, small changes to the data can make a graphs seem more credible. By giving more understandable explanations of how misleading graphs might deceive readers, AI also assisted me in making my writing simpler. This made it simpler to understand what I was saying. In future assignments, I’ll be careful while looking at connections because I understand how small changes in data can lead to in misinformation.