Code
library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")Correlation is a measure of the strength of the linear association between two quantitative variables. When visualizing the relationships between observations in data, it is important to understand when these relationships are coincidental; not all correlated variables have a true cause-and-effect relationship.
We can demonstrate this logical fallacy by visualizing a correlation between annual US cheese consumption per capita and US annual deaths by injury mechanism. My visualization uses a scatterplot and trendline to imply a relationship between Hispanic cheese consumption rates and deaths by poisoning.
library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")I first filtered the deaths dataset by injury mechanism, age group, race, injury intent, and sex, selected for year, injury mechanism, and death count, and used pivot_wider() to increase the number of columns in the new dataset.
deaths2 <- deaths |> filter(`Injury mechanism` != "All Mechanisms" & `Age group (years)` == "All Ages" & Race == "All races" & `Injury intent` == "All Intentions" & Sex == "Both sexes") |>
select(Year, `Injury mechanism`, Deaths) |>
rename(year = Year, death_type = `Injury mechanism`, death_count = Deaths) |>
pivot_wider(names_from = death_type, values_from = death_count)I then selected for year and ‘Hispanic’ cheese type, joined the results to the deaths2 dataset, and constructed the correlation matrix using cor().
hispcheesedeaths <- cheese |>
select(year, hispanic) |>
inner_join(deaths2, by = "year")
hispcheesedeaths |>
select(-year) |>
cor() hispanic Cut/pierce Drowning Fall
hispanic 1.0000000 0.37086174 0.55441499 0.9610017
Cut/pierce 0.3708617 1.00000000 0.17063452 0.2206411
Drowning 0.5544150 0.17063452 1.00000000 0.5567061
Fall 0.9610017 0.22064112 0.55670608 1.0000000
Fire/hot object or substance -0.8449676 -0.09090047 -0.53402721 -0.8667681
Firearm 0.8357115 0.22761514 0.64881514 0.9257407
Motor vehicle traffic -0.7126225 0.22880203 -0.36438044 -0.7506909
All Other Transport -0.6559104 -0.31775277 -0.44540194 -0.6067667
Poisoning 0.9463094 0.30911141 0.63334924 0.9791790
Suffocation 0.9458343 0.17177657 0.54091944 0.9938990
All Other Specified -0.4557235 -0.16056723 -0.01576699 -0.4647084
Unspecified -0.4610547 -0.14550347 0.05042755 -0.3734979
Fire/hot object or substance Firearm
hispanic -0.84496765 0.8357115
Cut/pierce -0.09090047 0.2276151
Drowning -0.53402721 0.6488151
Fall -0.86676812 0.9257407
Fire/hot object or substance 1.00000000 -0.7172177
Firearm -0.71721766 1.0000000
Motor vehicle traffic 0.88872560 -0.5418749
All Other Transport 0.57250666 -0.4904084
Poisoning -0.78997216 0.9589757
Suffocation -0.86387501 0.9113808
All Other Specified 0.52456683 -0.3167888
Unspecified 0.52310835 -0.1104524
Motor vehicle traffic All Other Transport
hispanic -0.7126225 -0.6559104
Cut/pierce 0.2288020 -0.3177528
Drowning -0.3643804 -0.4454019
Fall -0.7506909 -0.6067667
Fire/hot object or substance 0.8887256 0.5725067
Firearm -0.5418749 -0.4904084
Motor vehicle traffic 1.0000000 0.4476599
All Other Transport 0.4476599 1.0000000
Poisoning -0.6450346 -0.5999390
Suffocation -0.7582158 -0.5886872
All Other Specified 0.4785087 0.3578379
Unspecified 0.5946724 0.5042059
Poisoning Suffocation All Other Specified
hispanic 0.9463094 0.9458343 -0.45572346
Cut/pierce 0.3091114 0.1717766 -0.16056723
Drowning 0.6333492 0.5409194 -0.01576699
Fall 0.9791790 0.9938990 -0.46470837
Fire/hot object or substance -0.7899722 -0.8638750 0.52456683
Firearm 0.9589757 0.9113808 -0.31678878
Motor vehicle traffic -0.6450346 -0.7582158 0.47850867
All Other Transport -0.5999390 -0.5886872 0.35783788
Poisoning 1.0000000 0.9654581 -0.40415779
Suffocation 0.9654581 1.0000000 -0.47571858
All Other Specified -0.4041578 -0.4757186 1.00000000
Unspecified -0.2982013 -0.3468012 0.52552447
Unspecified
hispanic -0.46105471
Cut/pierce -0.14550347
Drowning 0.05042755
Fall -0.37349794
Fire/hot object or substance 0.52310835
Firearm -0.11045238
Motor vehicle traffic 0.59467240
All Other Transport 0.50420589
Poisoning -0.29820134
Suffocation -0.34680116
All Other Specified 0.52552447
Unspecified 1.00000000
When viewing the correlation matrix in hispcheesedeaths, I saw a strong correlation coefficient of r = 0.9463094 between Hispanic cheese consumption and poisoning as an injury mechanism. E.coli and listeria contamination in fresh cheeses is an actual concern, so I chose these observations for my visualization.
I created a scatterplot to visualize this correlation, and added a trendline using geom_smooth(). A strong positive correlation is apparent in the visual - but is Hispanic cheese consumption really the cause of an increase in poisoning deaths?
hispcheesedeaths |>
ggplot(aes(x = Poisoning, y = hispanic)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, col = "orange") +
labs(title = "Murderous Manchego: Is There a Poison Panic over Hispanic Cheese?") +
xlab("United States annual poisoning deaths") +
ylab("Annual per capita Hispanic cheese consumption (lbs)")While the trendline indicates a strong positive correlation between Hispanic cheese consumption and poisoning deaths in the US, ‘falls’ and ‘firearms’ as injury mechanisms also had correlation coefficients over 0.8 with Hispanic cheese consumption. What we can only infer from the correlation matrix that Hispanic cheese consumption is rising, along with the recording of deaths due to poisoning, falls, and firearms - not that Hispanic cheeses are responsible for these deaths.
Before sounding the alarm on killer cheeses (or other spurious relationships between observations), it is important to first confirm that the correlation is real; any potential causative relationships between observations must then be systematically explored through experimentation.