Code
library(tidyverse)
library(corrr)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")library(tidyverse)
library(corrr)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")Correlation is not causation, however audiences often think of correlations in the context of causation, even unintentionally. We have to be mindful to be aware that causation may exist, but causation isn’t implied merely because two variables correlate with each other. This exercise is to show that strong correlation can exist between two completely unrelated variables, and for this demonstration, I am using Swiss cheese and drowning deaths of youths aged 15-24.
cheese_cloth <- cheese |> #cheese cloth filters
select(!c(blue,mozzarella))#CGPT to help with simple rename calls, since rename() was not working for me
names(deaths)[5] <- "Injury_mechanisms"
names(deaths)[6] <- "Injury_intent"
names(deaths)[3] <- "Age_group"#filter for data exploration. used in conjunction with a now deleted plotting to see how each filtered aspect looked
deaths_filtered <- deaths |>
filter(Age_group != "All Ages", Sex == "Both sexes", Race == "All races", Injury_mechanisms != "All Mechanisms", Injury_intent == "All Intentions")#data filtering and joining
deaths_drowning <- deaths_filtered |>
filter(Injury_mechanisms == "Drowning", Age_group == "15–24")
swiss_cheese <- cheese_cloth |>
filter(year >= 1999) |>
select(year, swiss)
swiss_cheese <- swiss_cheese |>
rename(Year = year, Swiss = swiss)
swiss_waterboarding <- full_join(swiss_cheese, deaths_drowning, by = "Year") #R for Data Science, CGPT to confirm full_join by syntax
swiss_waterboarding <- swiss_waterboarding |>
mutate(deathsPerCap = Deaths / Population*75000, .after = Deaths) #deaths per 75000 peopleThe cheese data set was clean and straight forward initially, but had a significant amount of missing values under blue cheese. The only major modification I kept was to select out mozzarella and blue cheese. Initially, I had pivoted the data set to be longer in order to plot them all together to understand their trends, but because the cheddar and neufchatel existed much higher on the scale, that didn’t prove very fruitful. Instead, I just had a temporary line plot to visualize the shape of the cheese consumption.
The deaths data set was a bit more untidy. I had some difficulties with the column headers not recognizing my adjustments- there must have been some white space or special characters I couldn’t see- so I forced a few new column headers to manage the data tidying process. I initially explored the inclusive all age groups for injury mechanisms, but found that I liked the shape of the 15-24 age group for drowning incidents. Like with the cheese data set, I had a temporary plot to peak through the data.
The assignment called for a trend line comparison between the two variables, so I wanted to make sure that was included, and wanted to be sure to show the coefficient on this plotting. For this visualization, I wanted a clean and simple graph with a linear trend line.
For the second visualization, I wanted to mirror the style of Tyler Vigen’s Spurious Correlations as much as possible. I was actually already aware of this website due to seeing the correlation of margarine consumption and divorce rates in Maine. Initially, despite having similar shapes, the plotting of the two lines didn’t sufficiently overlap, nor was there significant correlation, resting in the .50 range. I tried mutation to z-score normalize, but that didn’t solve the issue and the scaling of the y-axis was completely off. I then realized that since cheese consumption was measured in per capita consumption, it would only make sense to measure deaths per capita as well, and this achieved most of the goals.
I also did trim off a few years on either end to remove completely differing data and NA values- this correlation is definitely SPURIOUS!
#this code calculates correlation, and then stretches to a pivot-longer style DF with x,y,r values
corr_cheese <- swiss_waterboarding |>
filter(Year>2002)|>
select(Swiss,deathsPerCap)|>
correlate(diagonal = 1)
stretched_cheese <- corr_cheese |> #cheese has been stretched
stretch()
stretched_cheese[2,]# A tibble: 1 × 3
x y r
<chr> <chr> <dbl>
1 Swiss deathsPerCap 0.875
library(ggtext)
swiss_waterboarding |>
filter(Year >2002) |>
filter(Year < 2017) |>
ggplot(aes(x= deathsPerCap, y = Swiss)) +
geom_point() +
geom_smooth(method = "lm")+
labs(
title = "**Youth Drowning Deaths Correlate
with Swiss Cheese Consumption**",
subtitle= "Youths 15-24 at risk",
y = "Swiss Cheese Consumption per Capita",
x = "Youth Deaths per 75k population"
)+
annotate("text", x = 1.175, y = 1.1, label = paste0("r = ", round(stretched_cheese$r[2], 3)), fontface = "bold", color ="blue", family = "mono")+ #CGPT to properly paste
theme_minimal()+
theme(panel.grid = element_blank(),
plot.title = element_markdown()) #CGPT, I had theme_minimal overwriting panel.grid #ggplot reference for ggtextswiss_waterboarding |>
filter(Year > 2002)|>
filter(Year < 2017)|>
ggplot(aes(x = Year, y = deathsPerCap))+
geom_point(shape = 15)+
geom_point(aes(x=Year, y = Swiss), color = "red4")+ #CGPT, pull color out of aes
geom_line(linetype = 2)+
geom_line(
aes(x=Year, y= Swiss), color = "red4")+
scale_y_continuous(
sec.axis= sec_axis(~., name = "Cheese Consumption")
)+
labs(
title = "**Youth ages 15-24 drowning deaths** <br> correlate with <br> <span style='color:red4;'>**Swiss Cheese Consumption**</span>", #CGPT to find this code to modify title
x = "",
y = "Deaths",
caption = "Cheese consumption in lb per capita; Deaths per 75k in population"
)+
theme_minimal()+
theme(plot.title = element_markdown(hjust = 0.5), #CGPT to fix title position
panel.grid.major.x = element_blank(),
axis.title.y.right = element_text(color = "red4"),
axis.text.y.right = element_text(color = "red4"), #CGPT for color. how do you remember all of these!
axis.ticks.x.bottom = element_line(linewidth = 1),
axis.line.x.bottom = element_line(),
panel.grid.minor.x = element_blank())Both of these visualizations could be very misleading to viewers. For one, the initial visualization shows a linear trend line, but its margin of error is pretty large. A loess model would show that the data points aren’t exactly linear. The subsequent double y-axis plot is misleading for a number of reasons: it cuts off a few prior years where the data deviates much more significantly from each other; it also measures deaths per 75k population. While the deaths per capita correlation itself seems unaffected by the population (it’s just moving the decimal point on the deaths per capita variable), the 75k population is completely arbitrary to make the two lines overlap as much as possible. A reader, especially a less savvy one, might not realize this.
The correlation between Swiss cheese consumption and youth drowning deaths may be relatively high for these select years, but unless these youths are filling up on Swiss cheese less than an hour before swimming, there of course is absolutely no causal relationship. When there is a data set that involves many variables, some may correlate, and some may even have a casual relationship - this information is often what we’re chasing- but this is a good exercise in being mindful and judicious in our decision making.
In the course of fun, I think the ethical implications of the specific choices I made to visualize these completely unrelated data sets are minimal; but if this were real data intended for a serious audience, then these visualizations only serve to distort the truth. This exercise in particular highlighted the dangers of visualizations that include two y-axes.
R for Data Science and the ggplot Tidyverse website were instrumental in some of the data transformations and plotting, and were the first stop in understanding which code to apply in order to achieve results. ChatGPT was used primarily for troubleshooting and to achieve final touches.
ChatGPT Session link below: Note, the initial session had an issue with creating a link. It appears I mistakenly added an image and that prevented the session from exporting to a link. I included that session transcript in an attached file and a screenshot of the “image” in question. I was having difficulty plotting with the age group brackets because the the variables used an endash not a hyphen. Copy/pasting a cell from excel caused CGPT to read it as an image.
https://chatgpt.com/share/67d64039-e058-800d-acf1-c49947ff38d3
Tidyverse ggplot2 reference
https://ggplot2.tidyverse.org/reference/index.html
Wickham, Hadley et al. R for Data Science: Import, Tidy, Transform, Visualize and Model Data. O’Reilly. 2023