Assignment 3

Author

Brady Heath

Code
library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")
Code
colnames(cheese)
[1] "year"       "cheddar"    "mozzarella" "swiss"      "blue"      
[6] "brick"      "muenster"   "neufchatel" "hispanic"  
Code
colnames(deaths)
 [1] "Year"                                    
 [2] "Sex"                                     
 [3] "Age group (years)"                       
 [4] "Race"                                    
 [5] "Injury mechanism"                        
 [6] "Injury intent"                           
 [7] "Deaths"                                  
 [8] "Population"                              
 [9] "Age Specific Rate"                       
[10] "Age Specific Rate Standard Error"        
[11] "Age Specific Rate Lower Confidence Limit"
[12] "Age Specific Rate Upper Confidence Limit"
[13] "Age Adjusted Rate"                       
[14] "Age Adjusted Rate Standard Error"        
[15] "Age Adjusted Rate Lower Confidence Limit"
[16] "Age Adjusted Rate Upper Confidence Limit"
[17] "Unit"                                    
Code
colnames(cheese)
[1] "year"       "cheddar"    "mozzarella" "swiss"      "blue"      
[6] "brick"      "muenster"   "neufchatel" "hispanic"  
Code
head(cheese)
# A tibble: 6 × 9
   year cheddar mozzarella swiss  blue brick muenster neufchatel hispanic
  <dbl>   <dbl>      <dbl> <dbl> <dbl> <dbl>    <dbl>      <dbl>    <dbl>
1  1995    9.04       7.89  1.09  0.16  0.04     0.41       2.04    NA   
2  1996    9.19       8.22  1.07  0.17  0.04     0.39       2.11     0.25
3  1997    9.51       8.16  0.99  0.18  0.03     0.37       2.25     0.25
4  1998    9.6        8.33  1.01 NA     0.03     0.34       2.2      0.27
5  1999   10.0        8.74  1.09 NA     0.03     0.28       2.26     0.3 
6  2000    9.87       9.05  1.02 NA     0.03     0.3        2.39     0.33
Code
head(deaths)
# A tibble: 6 × 17
   Year Sex        `Age group (years)` Race   `Injury mechanism` `Injury intent`
  <dbl> <chr>      <chr>               <chr>  <chr>              <chr>          
1  2016 Both sexes All Ages            All r… All Mechanisms     All Intentions 
2  2015 Both sexes All Ages            All r… All Mechanisms     All Intentions 
3  2014 Both sexes All Ages            All r… All Mechanisms     All Intentions 
4  2013 Both sexes All Ages            All r… All Mechanisms     All Intentions 
5  2012 Both sexes All Ages            All r… All Mechanisms     All Intentions 
6  2011 Both sexes All Ages            All r… All Mechanisms     All Intentions 
# ℹ 11 more variables: Deaths <dbl>, Population <dbl>,
#   `Age Specific Rate` <dbl>, `Age Specific Rate Standard Error` <dbl>,
#   `Age Specific Rate Lower Confidence Limit` <dbl>,
#   `Age Specific Rate Upper Confidence Limit` <dbl>,
#   `Age Adjusted Rate` <dbl>, `Age Adjusted Rate Standard Error` <dbl>,
#   `Age Adjusted Rate Lower Confidence Limit` <dbl>,
#   `Age Adjusted Rate Upper Confidence Limit` <dbl>, Unit <chr>

Introduction

Correlation vs. Causation - Correlation is when two things are related, but one doesn’t necessarily cause the other. It just means that they move together in a similar pattern. Causation is when one thing directly causes another. Correlation shows a relationship between two points and causation shows a cause and effect relationship.

Chosen variables - The variables I’ve chosen for my analysis is “year”, and “cheddar”. To compare these variables i will use a scatter plot to visualize the relationship between these variables.

Purpose of the exercise - The purpose of this exercise is to evaluate the possible relationships between two variables that are seemingly unrelated and how there may be correlation or possibly causation.

Relationship Visualization

First I need to filter the data to make sure I wont have redundant data.

Code
deaths_clean <- deaths |>
  filter(
    Sex == "All Sexes",
    `Age group (years)` == "All Ages",
    Race == "All races",
    `Injury intent` == "All Intentions",
    `Injury mechanism` != "All Mechanisms"
  ) |>
  select(Year, `Injury mechanism`, Deaths)

I need to reshape the deaths data to pivot in long format for plotting the data

Code
deaths_wide <- deaths_clean |>
  rename(year = Year) |>
  pivot_wider(names_from = `Injury mechanism`, values_from = Deaths)

I need to also remove mozzerella from the list per the assignment directions.

Code
cheese_filtered <- cheese |>
  select(-mozzarella)

merged death and cheese data

Code
merged_data <- cheese_filtered |>
  left_join(deaths_wide, by = "year")

I now need to compute the correlation now that my data is merged.

Code
cor_matrix <- merged_data |>
  select(-year) |>
  cor()

Now i can plot my correlation matrix

Code
library(ggcorrplot)
Warning: package 'ggcorrplot' was built under R version 4.4.3
Code
ggcorrplot(cor_matrix,
           type = "lower",
           lab = TRUE,
           title = "Correlation Between Cheese Consumption and Mortality")

before plotting I need to wrangle my data

Code
deaths_filtered <- deaths |> 
  filter(Sex == "All Sexes", 
         `Age group (years)` == "All Ages", 
         Race == "All races", 
         `Injury mechanism` != "All Mechanisms", 
         `Injury intent` == "All Intentions") |> 
  select(Year, `Deaths`) |> 
  mutate(Deaths = as.numeric(Deaths))
Code
names(merged_data)
[1] "year"       "cheddar"    "swiss"      "blue"       "brick"     
[6] "muenster"   "neufchatel" "hispanic"  
Code
ggplot(merged_data, aes(x = factor(`year`), y = neufchatel)) +
  geom_point(color = "magenta", alpha = 0.7) +  
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "solid") +
  labs(title = "Neufchatel Consumption vs Year",
       x = "Year",
       y = "Neufchatel Consumption") +
  theme_minimal() +                        
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'

Data Preparation

  1. filtered my data to clean it to make sure there wasn’t redundancy

  2. Had to reshape the deaths data

  3. As per the assignment directions I had to filter mozzarella out of the dataset

  4. Merged the cheese and death data

  5. After merging the data I needed to compute the correlation and visualize which data point is closest to a correlation of 0.8

  6. Needed to wrangle my data before plotting

  7. Now that my data was ready for plotting I plotted a scatterplot to show the correlation between my two points of choice.

Visualization Choice

I chose a scatter plot for my visualization as in chapter 8 of the text book it states that a scatter plot is the better option to visualize a relationship between two data points. “It’s much easier to see the relationship between two variables in a parallel coordinates plot with just two axes (similar to a slope chart). An alternative visual approach is the scatter plot.” - Chapter 8

How I Achieved the Correlation

After prepping my data I discovered that neufchatel cheese has the strongest correlation at 0.79

How the visualization could mislead viewers

Having a strong visual trend in my visualization may cause some false assumptions that more cheese consumption leads to higher injury death rates.

Spurious Relationship

There is a spurious relationship since there is no actual logical link to cheese consumption and injury deaths. The trend could be due to underlying factors or just coincidence.

Ethical Implications

The assignment may be about demonstrating how a spurious correlation can be visually misleading

which can potentially be harmful and influence fake correlations causing even more misinformation.

AI or Other Assistance Citation:


- used AI to assist me in creating an outline for a scatter plot showing Spurious Correlation. https://chatgpt.com/share/67d4ee62-6af8-8002-8865-41132854c6ef

- found out I can use ggcorrplot to plot a heat map to better show correlation.

https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html