Assignment 3

Author

Justin Vachon

Introduction

Correlation is an example of this project, even though the consumption of cheese increases with the rate of suicide, does not mean these affect each other, the only similarity is the rate of which they increase. Causation is when a variable directly affects another, for example, study rate and exam score. I chose these variables because they have absolutely nothing to do with each other, but at first glance you would think they have a direct link, making a great example for this project about correlation vs causation.

Relationship Visualization

Code
library(tidyverse)
Cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
Deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")
Code
colnames(Cheese) <- tolower(colnames(Cheese))
colnames(Deaths) <- tolower(colnames(Deaths))


deaths_filtered <- Deaths %>%
  filter(`injury intent` == "Suicide", `age group (years)` == "All Ages", sex == "Both sexes", race == "All races", `injury mechanism` == "All Mechanisms")  
 
merged_data <- merge(Cheese, deaths_filtered, by = "year")

merged_data |>
  ggplot(aes(x = muenster, y = deaths)) + 
  geom_point() + 
  geom_smooth(formula = y ~ x, method = "lm") + 
  labs(x = "Muenster Consumption (pounds) Per-Capita", y = "Deaths Per Year by Homicide", title = "Does Muenster Increase the Rate of Suicide?")

Correlation Coefficient

Code
correlation_coefficient <- cor(merged_data$muenster, merged_data$deaths)
print(correlation_coefficient)
[1] 0.955551

Analysis and Reflection

  1. Technical Implementation

    My data preparation process specifically for this project was a lot of trial and error, specifically because we needed a greater correlation coefficient than .8. I originally wanted to compare cheese consumption to homicide rate, however none of the cheese fit the correlation coefficient requirement. However, I think this is a great example of how things can correlate just by accident or by luck. I Chose this type of trend line just because I think it sends the “message” better, especially when you’re trying to convince someone cheese has something to do with suicide rates.

  2. Critical Analysis

    I think this visualization can mislead viewers because it looks so good. If there were no titles anyone would think that there would be some sort of causation between x and y. This is a spurious relationship because in reality the consumption of cheese, specifically muenster does not have anything to do with suicide rates, at least there’s no research on it. Graphs with great correlation but no causation can be extremely dangerous, by giving people the wrong idea and spreading misinformation.