Code
library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")colnames(cheese)[1] "year" "cheddar" "mozzarella" "swiss" "blue"
[6] "brick" "muenster" "neufchatel" "hispanic"
colnames(deaths) [1] "Year"
[2] "Sex"
[3] "Age group (years)"
[4] "Race"
[5] "Injury mechanism"
[6] "Injury intent"
[7] "Deaths"
[8] "Population"
[9] "Age Specific Rate"
[10] "Age Specific Rate Standard Error"
[11] "Age Specific Rate Lower Confidence Limit"
[12] "Age Specific Rate Upper Confidence Limit"
[13] "Age Adjusted Rate"
[14] "Age Adjusted Rate Standard Error"
[15] "Age Adjusted Rate Lower Confidence Limit"
[16] "Age Adjusted Rate Upper Confidence Limit"
[17] "Unit"
colnames(cheese)[1] "year" "cheddar" "mozzarella" "swiss" "blue"
[6] "brick" "muenster" "neufchatel" "hispanic"
head(cheese)# A tibble: 6 × 9
year cheddar mozzarella swiss blue brick muenster neufchatel hispanic
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995 9.04 7.89 1.09 0.16 0.04 0.41 2.04 NA
2 1996 9.19 8.22 1.07 0.17 0.04 0.39 2.11 0.25
3 1997 9.51 8.16 0.99 0.18 0.03 0.37 2.25 0.25
4 1998 9.6 8.33 1.01 NA 0.03 0.34 2.2 0.27
5 1999 10.0 8.74 1.09 NA 0.03 0.28 2.26 0.3
6 2000 9.87 9.05 1.02 NA 0.03 0.3 2.39 0.33
head(deaths)# A tibble: 6 × 17
Year Sex `Age group (years)` Race `Injury mechanism` `Injury intent`
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2016 Both sexes All Ages All r… All Mechanisms All Intentions
2 2015 Both sexes All Ages All r… All Mechanisms All Intentions
3 2014 Both sexes All Ages All r… All Mechanisms All Intentions
4 2013 Both sexes All Ages All r… All Mechanisms All Intentions
5 2012 Both sexes All Ages All r… All Mechanisms All Intentions
6 2011 Both sexes All Ages All r… All Mechanisms All Intentions
# ℹ 11 more variables: Deaths <dbl>, Population <dbl>,
# `Age Specific Rate` <dbl>, `Age Specific Rate Standard Error` <dbl>,
# `Age Specific Rate Lower Confidence Limit` <dbl>,
# `Age Specific Rate Upper Confidence Limit` <dbl>,
# `Age Adjusted Rate` <dbl>, `Age Adjusted Rate Standard Error` <dbl>,
# `Age Adjusted Rate Lower Confidence Limit` <dbl>,
# `Age Adjusted Rate Upper Confidence Limit` <dbl>, Unit <chr>
Correlation vs. Causation - Correlation is when two things are related, but one doesn’t necessarily cause the other. It just means that they move together in a similar pattern. Causation is when one thing directly causes another. Correlation shows a relationship between two points and causation shows a cause and effect relationship.
Chosen variables - The variables I’ve chosen for my analysis is “year”, and “cheddar”. To compare these variables i will use a scatter plot to visualize the relationship between these variables.
Purpose of the exercise - The purpose of this exercise is to evaluate the possible relationships between two variables that are seemingly unrelated and how there may be correlation or possibly causation.
First I need to filter the data to make sure I wont have redundant data.
deaths_clean <- deaths |>
filter(
Sex == "All Sexes",
`Age group (years)` == "All Ages",
Race == "All races",
`Injury intent` == "All Intentions",
`Injury mechanism` != "All Mechanisms"
) |>
select(Year, `Injury mechanism`, Deaths)I need to reshape the deaths data to pivot in long format for plotting the data
deaths_wide <- deaths_clean |>
rename(year = Year) |>
pivot_wider(names_from = `Injury mechanism`, values_from = Deaths)I need to also remove mozzerella from the list per the assignment directions.
cheese_filtered <- cheese |>
select(-mozzarella)merged death and cheese data
merged_data <- cheese_filtered |>
left_join(deaths_wide, by = "year")I now need to compute the correlation now that my data is merged.
cor_matrix <- merged_data |>
select(-year) |>
cor()Now i can plot my correlation matrix
library(ggcorrplot)Warning: package 'ggcorrplot' was built under R version 4.4.3
ggcorrplot(cor_matrix,
type = "lower",
lab = TRUE,
title = "Correlation Between Cheese Consumption and Mortality")before plotting I need to wrangle my data
deaths_filtered <- deaths |>
filter(Sex == "All Sexes",
`Age group (years)` == "All Ages",
Race == "All races",
`Injury mechanism` != "All Mechanisms",
`Injury intent` == "All Intentions") |>
select(Year, `Deaths`) |>
mutate(Deaths = as.numeric(Deaths))names(merged_data)[1] "year" "cheddar" "swiss" "blue" "brick"
[6] "muenster" "neufchatel" "hispanic"
ggplot(merged_data, aes(x = factor(`year`), y = neufchatel)) +
geom_point(color = "magenta", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "solid") +
labs(title = "Neufchatel Consumption vs Year",
x = "Year",
y = "Neufchatel Consumption") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))`geom_smooth()` using formula = 'y ~ x'
filtered my data to clean it to make sure there wasn’t redundancy
Had to reshape the deaths data
As per the assignment directions I had to filter mozzarella out of the dataset
Merged the cheese and death data
After merging the data I needed to compute the correlation and visualize which data point is closest to a correlation of 0.8
Needed to wrangle my data before plotting
Now that my data was ready for plotting I plotted a scatterplot to show the correlation between my two points of choice.
I chose a scatter plot for my visualization as in chapter 8 of the text book it states that a scatter plot is the better option to visualize a relationship between two data points. “It’s much easier to see the relationship between two variables in a parallel coordinates plot with just two axes (similar to a slope chart). An alternative visual approach is the scatter plot.” - Chapter 8
After prepping my data I discovered that neufchatel cheese has the strongest correlation at 0.79
Having a strong visual trend in my visualization may cause some false assumptions that more cheese consumption leads to higher injury death rates.
There is a spurious relationship since there is no actual logical link to cheese consumption and injury deaths. The trend could be due to underlying factors or just coincidence.
The assignment may be about demonstrating how a spurious correlation can be visually misleading
which can potentially be harmful and influence fake correlations causing even more misinformation.
- used AI to assist me in creating an outline for a scatter plot showing Spurious Correlation. https://chatgpt.com/share/67d4ee62-6af8-8002-8865-41132854c6ef
- found out I can use ggcorrplot to plot a heat map to better show correlation.
https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html