Assignment 3

Author

Azam Jiwa

Introduction

In this assignment, I explore the concept of correlation versus causation by examining the relationship between cheese consumption and poisoning deaths in the United States. Correlation refers to a statistical relationship between two variables, while causation implies a cause-and-effect relationship. This project aims to illustrate how correlations can be misleading and spurious, as demonstrated through the relationship between cheese consumption and poisoning deaths. While these two variables may appear to be correlated, there is no causal relationship between them.

The variables I chose to examine are cheese consumption (specifically, cheddar cheese and Hispanic cheese) and poisoning deaths from the injury mortality dataset. After analyzing the data, I will demonstrate how these correlations might appear to suggest a relationship, but in reality, this is a spurious correlation that has no real-world causal link.

Relationship Visualization

A. Primary Visualization

The initial visualization focuses on the relationship between cheddar cheese consumption and poisoning deaths. Despite the apparent correlation, further analysis reveals that the correlation coefficient for cheddar cheese and poisoning deaths is only 0.236, which is relatively weak. I also examined Hispanic cheese consumption, which showed a much stronger correlation with poisoning deaths at 0.946. Below is a scatter plot showing the relationship between Hispanic cheese consumption and poisoning deaths, with a regression line to highlight the trend.

Code
suppressMessages(library(tidyverse))

cheese <- suppressMessages(read_csv("https://jsuleiman.com/datasets/cheese.csv"))
deaths <- suppressMessages(read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv"))

deaths_filtered <- deaths %>%
  filter(Sex == "Both sexes", 
         `Age group (years)` == "All Ages", 
         Race == "All races", 
         `Injury mechanism` != "All Mechanisms", 
         `Injury intent` == "All Intentions") %>%
  select(Year, `Injury mechanism`, Deaths)

poisoning_deaths <- deaths_filtered %>%
  filter(`Injury mechanism` == "Poisoning")

cheese <- cheese %>% rename(Year = year)

combined_data <- cheese %>%
  inner_join(poisoning_deaths, by = "Year")

ggplot(combined_data, aes(x = cheddar, y = Deaths)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Cheddar Cheese Consumption vs Poisoning Deaths",
    x = "Cheddar Cheese Consumption (lbs per capita)",
    y = "Number of Poisoning Deaths"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Code
correlation <- cor(combined_data$cheddar, combined_data$Deaths, use = "complete.obs")
print(paste("Correlation between cheddar cheese consumption and poisoning deaths:", round(correlation, 2)))
[1] "Correlation between cheddar cheese consumption and poisoning deaths: 0.24"
Code
cor_cheddar <- cor(combined_data$cheddar, combined_data$Deaths)
cor_swiss <- cor(combined_data$swiss, combined_data$Deaths)
cor_blue <- cor(combined_data$blue, combined_data$Deaths)
cor_brick <- cor(combined_data$brick, combined_data$Deaths)
cor_muenster <- cor(combined_data$muenster, combined_data$Deaths)
cor_neufchatel <- cor(combined_data$neufchatel, combined_data$Deaths)
cor_hispanic <- cor(combined_data$hispanic, combined_data$Deaths)

correlations <- c(
  cheddar = cor_cheddar,
  swiss = cor_swiss,
  blue = cor_blue,
  brick = cor_brick,
  muenster = cor_muenster,
  neufchatel = cor_neufchatel,
  hispanic = cor_hispanic
)

correlations <- sort(correlations, decreasing = TRUE)
correlations
  hispanic   muenster neufchatel    cheddar      swiss      brick 
 0.9463094  0.9049784  0.6777556  0.2358524 -0.2373612 -0.6331987 
Code
ggplot(combined_data, aes(x = hispanic, y = Deaths)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship between Hispanic Cheese Consumption and Poisoning Deaths",
    x = "Hispanic Cheese Consumption (per capita)",
    y = "Number of Poisoning Deaths"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

B. Analysis & Reflection

Technical Implementation
The data preparation process involved filtering the injury mortality dataset for the relevant variables, including injury mechanism (poisoning), and then joining it with the cheese consumption dataset. To achieve the correlation, I used the cor() function in R to calculate the correlation coefficient between the number of poisoning deaths and cheese consumption for each type of cheese. The visualization choices were made to clearly demonstrate the spurious nature of the correlation, using scatter plots with regression lines.

Critical Analysis
The correlation between cheddar cheese consumption and poisoning deaths is very weak (r = 0.236), suggesting that no meaningful relationship exists. In contrast, Hispanic cheese has a very strong correlation (r = 0.946), which might initially suggest a significant link. However, this does not imply that cheese consumption causes poisoning deaths. This relationship is spurious, meaning the observed correlation is likely due to other, unmeasured factors, or just random chance. It’s important to remember that correlation does not imply causation, and in this case, the visualization could mislead viewers into thinking that cheese consumption is linked to poisoning deaths, when in fact, they are unrelated.

Final Thoughts

In conclusion, while the visualizations and correlation analysis initially suggest a connection between cheese consumption and poisoning deaths, this is an example of a spurious correlation. These variables are unrelated, and the correlation seen here is purely coincidental. This assignment helps emphasize the critical distinction between correlation and causation, a concept that is vital for accurately interpreting data.

Generative AI Use

I used ChatGPT to draft the report structure, brainstorm visualization choices, and refine analysis language. Here is the transcript.

Share

You said:

make me a graph using r with the title relationship betweeen hispanic cheese consumption and poisoning deaths. x axis hispanic cheese conumption, and y axis number of poisoning deaths.

ChatGPT said:

To create a graph in R with the title “Relationship Between Hispanic Cheese Consumption and Poisoning Deaths” and the specified axes, you can follow this R code:

R

CopyEdit

# Example Data (Replace with actual data) hispanic_cheese_consumption <- c(10, 20, 30, 40, 50) # Example values for Hispanic cheese consumption poisoning_deaths <- c(2, 4, 6, 8, 10) # Example values for poisoning deaths # Create the plot plot(hispanic_cheese_consumption, poisoning_deaths, xlab = "Hispanic Cheese Consumption", ylab = "Number of Poisoning Deaths", main = "Relationship Between Hispanic Cheese Consumption and Poisoning Deaths", pch = 19, # Plot points as solid circles col = "blue") # Set the color of the points

In this example:

  • hispanic_cheese_consumption and poisoning_deaths are example data vectors. You can replace them with your actual data.

  • The plot() function is used to create a scatter plot.

  • xlab and ylab set the labels for the x and y axes.

  • main sets the title for the graph.

  • The points are plotted using pch = 19 (solid circles) and the color is set to blue.

You can modify the data values as per your dataset. Let me know if you need further assistance!