Assignment 1: Lying with Scatterplots

Death and Cheese Data Import

To understand the relationship between muenster cheese consumption and death intents, I imported the death data and cheese data using read_csv. I then used the glimpse() command to look at the cheese data and death data and see the variables present in each data set. Below is the output of the glimpse commands:

library(tidyverse)
cheese <- read_csv("http://jamessuleiman.com/teaching/datasets/cheese.csv") 
deaths <- read_csv("http://jamessuleiman.com/teaching/datasets/Injury_Mortality__United_States.csv")
glimpse(cheese)

## Rows: 24
## Columns: 9
## $ year       <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004…
## $ cheddar    <dbl> 9.04, 9.19, 9.51, 9.60, 10.01, 9.87, 9.89, 9.76, 9.38, 10.…
## $ mozzarella <dbl> 7.89, 8.22, 8.16, 8.33, 8.74, 9.05, 9.35, 9.38, 9.45, 9.68…
## $ swiss      <dbl> 1.09, 1.07, 0.99, 1.01, 1.09, 1.02, 1.12, 1.09, 1.13, 1.20…
## $ blue       <dbl> 0.16, 0.17, 0.18, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ brick      <dbl> 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03…
## $ muenster   <dbl> 0.41, 0.39, 0.37, 0.34, 0.28, 0.30, 0.28, 0.28, 0.27, 0.25…
## $ neufchatel <dbl> 2.04, 2.11, 2.25, 2.20, 2.26, 2.39, 2.21, 2.33, 2.30, 2.34…
## $ hispanic   <dbl> NA, 0.25, 0.25, 0.27, 0.30, 0.33, 0.37, 0.42, 0.45, 0.48, …

glimpse(deaths)

## Rows: 98,280
## Columns: 17
## $ Year                                       <dbl> 2016, 2015, 2014, 2013, 20…
## $ Sex                                        <chr> "Both sexes", "Both sexes"…
## $ `Age group (years)`                        <chr> "All Ages", "All Ages", "A…
## $ Race                                       <chr> "All races", "All races", …
## $ `Injury mechanism`                         <chr> "All Mechanisms", "All Mec…
## $ `Injury intent`                            <chr> "All Intentions", "All Int…
## $ Deaths                                     <dbl> 231991, 214008, 199752, 19…
## $ Population                                 <dbl> 323127513, 321418820, 3188…
## $ `Age Specific Rate`                        <dbl> 71.795496, 66.582287, 62.6…
## $ `Age Specific Rate Standard Error`         <dbl> 0.1490602, 0.1439275, 0.14…
## $ `Age Specific Rate Lower Confidence Limit` <dbl> 71.503338, 66.300189, 62.3…
## $ `Age Specific Rate Upper Confidence Limit` <dbl> 72.087654, 66.864384, 62.9…
## $ `Age Adjusted Rate`                        <dbl> 68.98224, 63.86611, 60.127…
## $ `Age Adjusted Rate Standard Error`         <dbl> 0.1460097, 0.1405070, 0.13…
## $ `Age Adjusted Rate Lower Confidence Limit` <dbl> 68.69606, 63.59072, 59.859…
## $ `Age Adjusted Rate Upper Confidence Limit` <dbl> 69.26842, 64.14151, 60.395…
## $ Unit                                       <chr> "per 100,000 population", …

Filter, Join and Correlate

Since I’m only interested in year, death count, death intent, and muenster cheese consumption, I will filter out the other variables in the deaths data frame. Then I pivot the data into a wide format and do an inner_join with the cheese data (filtering for muenster). Finally, I run a correlation matrix on the data frame. Below we see the highest correlation value for muenster consumption and suicide (death intent) at 0.95.

Note: I did have issues calculating the correlation of muenster cheese and death intent because each injury mechanism has a death intent, so each death intent variable had multiple values. I would get an error when I would run the correlation matrix because I did not have unique values. To fix this, I took the sum of all the injury intents (for all injury mechanisms) for a given year and then calculated the correlation matrix.

deaths_2 <- deaths %>% filter(`Injury mechanism` != "All Mechanisms" & `Age group (years)`== "All Ages" & `Injury intent` != "All Intentions" &
                    Race == "All races" &
                    Sex == "Both sexes") %>% select(Year, 'Injury intent', Deaths) %>% rename(year = Year, death_intent = 'Injury intent', death_count = Deaths) %>% 
  pivot_wider(names_from = death_intent, values_from = death_count, values_fn = sum) %>% drop_na()
  
death_by_muenster <- cheese %>% select(year, muenster) %>% inner_join(deaths_2, by = 'year')
death_by_muenster %>% select(-year) %>% cor()

##                          muenster Unintentional    Suicide    Homicide
## muenster                1.0000000    0.89182471  0.9555510 -0.25343850
## Unintentional           0.8918247    1.00000000  0.9541055 -0.03848175
## Suicide                 0.9555510    0.95410552  1.0000000 -0.23414469
## Homicide               -0.2534385   -0.03848175 -0.2341447  1.00000000
## Undetermined            0.2930508    0.64639007  0.4640975  0.28688435
## Legal intervention/war  0.8720061    0.82498442  0.8602972 -0.11764121
##                        Undetermined Legal intervention/war
## muenster                  0.2930508              0.8720061
## Unintentional             0.6463901              0.8249844
## Suicide                   0.4640975              0.8602972
## Homicide                  0.2868843             -0.1176412
## Undetermined              1.0000000              0.3248518
## Legal intervention/war    0.3248518              1.0000000

Lying with the Scatterplot

To imply causation between muenster cheese consumption and death by suicide, I use the ggplot() function and graph the muenster cheese consumption in pounds by annual deaths with suicide intent. I drive home my false message (that increased muenster cheese consumption increases death by suicide) using the geom_smooth function to fit multiple lines to the scatterplot. Finally, I create a title that suggests muenster cheese consumption increases the risk of death by suicide.

death_by_muenster %>% ggplot(aes(x = Suicide, y = muenster)) + geom_point() + geom_smooth() + xlab("annual deaths with suicide intent") + ylab("per capita muenster cheese consumption in pounds") + ggtitle("Consuming muenster cheese increases risk of death by suicide") + theme_minimal()

Assignment 1: Lying with Scatterplots

Jess DiBiase

01/30/2021

Death and Cheese Data Import

Filter, Join and Correlate

Lying with the Scatterplot