BAN350 - Assignment 1

Assignment Objective: You are an analyst for Initech Analytica (IA), a nefarious policy research company. IA has been hired by the Oat Milk Advocacy Group to create a misinformation campaign against the dairy industry. In the notes for this unit, I gave an example of lying with scatter plots using a spurious correlation between mozzarella and deaths by poisoning. Your assignment is to create a similar chart selecting a different variety of cheese (i.e., you may not use mozzarella).

Datasets: I loaded both of the datasets from Professor Suleiman’s website, which were called “cheese” and “injury mortality rates for the US.” Then used glimpse() to take a look at the variable in the two datasets

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Rows: 24 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): year, cheddar, mozzarella, swiss, blue, brick, muenster, neufchatel...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Rows: 24
## Columns: 9
## $ year       <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004,…
## $ cheddar    <dbl> 9.04, 9.19, 9.51, 9.60, 10.01, 9.87, 9.89, 9.76, 9.38, 10.2…
## $ mozzarella <dbl> 7.89, 8.22, 8.16, 8.33, 8.74, 9.05, 9.35, 9.38, 9.45, 9.68,…
## $ swiss      <dbl> 1.09, 1.07, 0.99, 1.01, 1.09, 1.02, 1.12, 1.09, 1.13, 1.20,…
## $ blue       <dbl> 0.16, 0.17, 0.18, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ brick      <dbl> 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
## $ muenster   <dbl> 0.41, 0.39, 0.37, 0.34, 0.28, 0.30, 0.28, 0.28, 0.27, 0.25,…
## $ neufchatel <dbl> 2.04, 2.11, 2.25, 2.20, 2.26, 2.39, 2.21, 2.33, 2.30, 2.34,…
## $ hispanic   <dbl> NA, 0.25, 0.25, 0.27, 0.30, 0.33, 0.37, 0.42, 0.45, 0.48, 0…

## Rows: 98280 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Sex, Age group (years), Race, Injury mechanism, Injury intent, Unit
## dbl (9): Year, Age Specific Rate, Age Specific Rate Standard Error, Age Spec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Rows: 98,280
## Columns: 17
## $ Year                                       <dbl> 2016, 2015, 2014, 2013, 201…
## $ Sex                                        <chr> "Both sexes", "Both sexes",…
## $ `Age group (years)`                        <chr> "All Ages", "All Ages", "Al…
## $ Race                                       <chr> "All races", "All races", "…
## $ `Injury mechanism`                         <chr> "All Mechanisms", "All Mech…
## $ `Injury intent`                            <chr> "All Intentions", "All Inte…
## $ Deaths                                     <dbl> 231991, 214008, 199752, 192…
## $ Population                                 <dbl> 323127513, 321418820, 31885…
## $ `Age Specific Rate`                        <dbl> 71.795496, 66.582287, 62.64…
## $ `Age Specific Rate Standard Error`         <dbl> 0.1490602, 0.1439275, 0.140…
## $ `Age Specific Rate Lower Confidence Limit` <dbl> 71.503338, 66.300189, 62.37…
## $ `Age Specific Rate Upper Confidence Limit` <dbl> 72.087654, 66.864384, 62.92…
## $ `Age Adjusted Rate`                        <dbl> 68.98224, 63.86611, 60.1272…
## $ `Age Adjusted Rate Standard Error`         <dbl> 0.1460097, 0.1405070, 0.136…
## $ `Age Adjusted Rate Lower Confidence Limit` <dbl> 68.69606, 63.59072, 59.8592…
## $ `Age Adjusted Rate Upper Confidence Limit` <dbl> 69.26842, 64.14151, 60.3952…
## $ Unit                                       <chr> "per 100,000 population", "…

I simplified the deaths dataset to be filtered by only important variables like injury intents, injury mechanisms, age groups, sexes, and all races. After filtering the deaths dataset, I only selected the year, injury intent, and the death count. Lastly, I pivoted the subset into more of a wide format instead of the default so it is easier to see. Then I created a new dataset that is called “death by cheese”, by taking the muenster cheese from the cheese dataset and then inner joined both the new cheese dataset and the death dataset by the year variable. I then took this inner joined dataset and created a correlation matrix between muenster cheese and the injury intents.

##                          muenster Unintentional    Suicide    Homicide
## muenster                1.0000000    0.86434680  0.9555510 -0.25343850
## Unintentional           0.8643468    1.00000000  0.9304585  0.02860829
## Suicide                 0.9555510    0.93045853  1.0000000 -0.23414469
## Homicide               -0.2534385    0.02860829 -0.2341447  1.00000000
## Undetermined            0.2930508    0.65625952  0.4640975  0.28688435
## Legal intervention/war  0.8720061    0.83474497  0.8602972 -0.11764121
##                        Undetermined Legal intervention/war
## muenster                  0.2930508              0.8720061
## Unintentional             0.6562595              0.8347450
## Suicide                   0.4640975              0.8602972
## Homicide                  0.2868843             -0.1176412
## Undetermined              1.0000000              0.3248518
## Legal intervention/war    0.3248518              1.0000000

Scatterplot: The correlation matrix showed that there were many injury intents that are over .80 (highly correlated), which could be used for this chart. Off of this correlation matrix, the one that was chosen is the “suicide” injury intent because it was the highest at 0.956. Below is the scatter plot (including a misleading title and trend line) that shows this correlated relationship.

## `geom_smooth()` using formula 'y ~ x'

BAN350 - Assignment 1

By: Logan Lloyd