Assignment 3

Author

Brady Heath

Code

library(tidyverse)
cheese <- read_csv("https://jsuleiman.com/datasets/cheese.csv")
deaths <- read_csv("https://jsuleiman.com/datasets/Injury_Mortality__United_States.csv")

Code

colnames(cheese)

[1] "year"       "cheddar"    "mozzarella" "swiss"      "blue"      
[6] "brick"      "muenster"   "neufchatel" "hispanic"

Code

colnames(deaths)

 [1] "Year"                                    
 [2] "Sex"                                     
 [3] "Age group (years)"                       
 [4] "Race"                                    
 [5] "Injury mechanism"                        
 [6] "Injury intent"                           
 [7] "Deaths"                                  
 [8] "Population"                              
 [9] "Age Specific Rate"                       
[10] "Age Specific Rate Standard Error"        
[11] "Age Specific Rate Lower Confidence Limit"
[12] "Age Specific Rate Upper Confidence Limit"
[13] "Age Adjusted Rate"                       
[14] "Age Adjusted Rate Standard Error"        
[15] "Age Adjusted Rate Lower Confidence Limit"
[16] "Age Adjusted Rate Upper Confidence Limit"
[17] "Unit"

Code

colnames(cheese)

[1] "year"       "cheddar"    "mozzarella" "swiss"      "blue"      
[6] "brick"      "muenster"   "neufchatel" "hispanic"

Code

head(cheese)

# A tibble: 6 × 9
   year cheddar mozzarella swiss  blue brick muenster neufchatel hispanic
  <dbl>   <dbl>      <dbl> <dbl> <dbl> <dbl>    <dbl>      <dbl>    <dbl>
1  1995    9.04       7.89  1.09  0.16  0.04     0.41       2.04    NA   
2  1996    9.19       8.22  1.07  0.17  0.04     0.39       2.11     0.25
3  1997    9.51       8.16  0.99  0.18  0.03     0.37       2.25     0.25
4  1998    9.6        8.33  1.01 NA     0.03     0.34       2.2      0.27
5  1999   10.0        8.74  1.09 NA     0.03     0.28       2.26     0.3 
6  2000    9.87       9.05  1.02 NA     0.03     0.3        2.39     0.33

Code

head(deaths)

# A tibble: 6 × 17
   Year Sex        `Age group (years)` Race   `Injury mechanism` `Injury intent`
  <dbl> <chr>      <chr>               <chr>  <chr>              <chr>          
1  2016 Both sexes All Ages            All r… All Mechanisms     All Intentions 
2  2015 Both sexes All Ages            All r… All Mechanisms     All Intentions 
3  2014 Both sexes All Ages            All r… All Mechanisms     All Intentions 
4  2013 Both sexes All Ages            All r… All Mechanisms     All Intentions 
5  2012 Both sexes All Ages            All r… All Mechanisms     All Intentions 
6  2011 Both sexes All Ages            All r… All Mechanisms     All Intentions 
# ℹ 11 more variables: Deaths <dbl>, Population <dbl>,
#   `Age Specific Rate` <dbl>, `Age Specific Rate Standard Error` <dbl>,
#   `Age Specific Rate Lower Confidence Limit` <dbl>,
#   `Age Specific Rate Upper Confidence Limit` <dbl>,
#   `Age Adjusted Rate` <dbl>, `Age Adjusted Rate Standard Error` <dbl>,
#   `Age Adjusted Rate Lower Confidence Limit` <dbl>,
#   `Age Adjusted Rate Upper Confidence Limit` <dbl>, Unit <chr>

Introduction

Correlation vs. Causation - Correlation is when two things are related, but one doesn’t necessarily cause the other. It just means that they move together in a similar pattern. Causation is when one thing directly causes another. Correlation shows a relationship between two points and causation shows a cause and effect relationship.

Chosen variables - The variables I’ve chosen for my analysis is “year”, and “cheddar”. To compare these variables i will use a scatter plot to visualize the relationship between these variables.

Purpose of the exercise - The purpose of this exercise is to evaluate the possible relationships between two variables that are seemingly unrelated and how there may be correlation or possibly causation.

Relationship Visualization

First I need to filter the data to make sure I wont have redundant data.

Code

deaths_clean <- deaths |>
  filter(
    Sex == "All Sexes",
    `Age group (years)` == "All Ages",
    Race == "All races",
    `Injury intent` == "All Intentions",
    `Injury mechanism` != "All Mechanisms"
  ) |>
  select(Year, `Injury mechanism`, Deaths)

I need to reshape the deaths data to pivot in long format for plotting the data

Code

deaths_wide <- deaths_clean |>
  rename(year = Year) |>
  pivot_wider(names_from = `Injury mechanism`, values_from = Deaths)

I need to also remove mozzerella from the list per the assignment directions.

Code

cheese_filtered <- cheese |>
  select(-mozzarella)

merged death and cheese data

Code

merged_data <- cheese_filtered |>
  left_join(deaths_wide, by = "year")

I now need to compute the correlation now that my data is merged.

Code

cor_matrix <- merged_data |>
  select(-year) |>
  cor()

Now i can plot my correlation matrix

Code

library(ggcorrplot)

Warning: package 'ggcorrplot' was built under R version 4.4.3

Code

ggcorrplot(cor_matrix,
           type = "lower",
           lab = TRUE,
           title = "Correlation Between Cheese Consumption and Mortality")

before plotting I need to wrangle my data

Code

deaths_filtered <- deaths |> 
  filter(Sex == "All Sexes", 
         `Age group (years)` == "All Ages", 
         Race == "All races", 
         `Injury mechanism` != "All Mechanisms", 
         `Injury intent` == "All Intentions") |> 
  select(Year, `Deaths`) |> 
  mutate(Deaths = as.numeric(Deaths))

Code

names(merged_data)

[1] "year"       "cheddar"    "swiss"      "blue"       "brick"     
[6] "muenster"   "neufchatel" "hispanic"

Code

ggplot(merged_data, aes(x = factor(`year`), y = neufchatel)) +
  geom_point(color = "magenta", alpha = 0.7) +  
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "solid") +
  labs(title = "Neufchatel Consumption vs Year",
       x = "Year",
       y = "Neufchatel Consumption") +
  theme_minimal() +                        
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

`geom_smooth()` using formula = 'y ~ x'

Data Preparation

filtered my data to clean it to make sure there wasn’t redundancy
Had to reshape the deaths data
As per the assignment directions I had to filter mozzarella out of the dataset
Merged the cheese and death data
After merging the data I needed to compute the correlation and visualize which data point is closest to a correlation of 0.8
Needed to wrangle my data before plotting
Now that my data was ready for plotting I plotted a scatterplot to show the correlation between my two points of choice.

Visualization Choice

I chose a scatter plot for my visualization as in chapter 8 of the text book it states that a scatter plot is the better option to visualize a relationship between two data points. “It’s much easier to see the relationship between two variables in a parallel coordinates plot with just two axes (similar to a slope chart). An alternative visual approach is the scatter plot.” - Chapter 8

How I Achieved the Correlation

After prepping my data I discovered that neufchatel cheese has the strongest correlation at 0.79

How the visualization could mislead viewers

Having a strong visual trend in my visualization may cause some false assumptions that more cheese consumption leads to higher injury death rates.

Spurious Relationship

There is a spurious relationship since there is no actual logical link to cheese consumption and injury deaths. The trend could be due to underlying factors or just coincidence.

Ethical Implications

The assignment may be about demonstrating how a spurious correlation can be visually misleading

which can potentially be harmful and influence fake correlations causing even more misinformation.

AI or Other Assistance Citation:

- used AI to assist me in creating an outline for a scatter plot showing Spurious Correlation. https://chatgpt.com/share/67d4ee62-6af8-8002-8865-41132854c6ef

- found out I can use ggcorrplot to plot a heat map to better show correlation.

https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html