Data Preparation

# load data
library(tidyverse)
library(readr)
library(psych)
library(corrgram)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What is the deathrate based off disease, race, and age

Cases

What are the cases, and how many are there?

708 total causes - showing year, leading cause of death, ethnicity, and death/rates

Data collection

Describe the method of data collection. Data was gathered from NYC Open Data

Type of study

What type of study is this (observational/experiment)? Obervational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

leadingCauses <- read_csv("New_York_City_Leading_Causes_of_Death.csv")
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Leading Cause` = col_character(),
##   Sex = col_character(),
##   `Race Ethnicity` = col_character(),
##   Deaths = col_character(),
##   `Death Rate` = col_character(),
##   `Age Adjusted Death Rate` = col_character()
## )
head(leadingCauses, 10)
## # A tibble: 10 x 7
##     Year `Leading Cause`        Sex   `Race Ethnicity` Deaths `Death Rate`
##    <int> <chr>                  <chr> <chr>            <chr>  <chr>       
##  1  2010 Assault (Homicide: Y8~ M     Black Non-Hispa~ 299    35.1        
##  2  2011 Mental and Behavioral~ M     Not Stated/Unkn~ 5      .           
##  3  2011 Diseases of Heart (I0~ M     Black Non-Hispa~ 1840   215.7       
##  4  2008 Certain Conditions or~ F     Other Race/ Eth~ .      .           
##  5  2014 Accidents Except Drug~ F     Hispanic         64     5.1         
##  6  2007 Intentional Self-Harm~ M     Not Stated/Unkn~ 5      .           
##  7  2012 Accidents Except Drug~ M     Black Non-Hispa~ 152    17.8        
##  8  2009 All Other Causes       M     Asian and Pacif~ 220    43.1        
##  9  2013 Diseases of Heart (I0~ F     Asian and Pacif~ 437    72.8        
## 10  2014 Accidents Except Drug~ M     Other Race/ Eth~ 12     .           
## # ... with 1 more variable: `Age Adjusted Death Rate` <chr>

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The dependent variable is Death rates. The death rate is quantitative data.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The indepedent variables are disease, race, and age.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

leadingCauses$Deaths[leadingCauses$Deaths=="."] <- NA
leadingCauses$`Death Rate`[leadingCauses$`Death Rate`=="."] <- NA
leadingCauses$`Age Adjusted Death Rate`[leadingCauses$`Age Adjusted Death Rate`=="."] <- NA
leadingCauses <- na.omit(leadingCauses)


leadingCauses$Deaths <- as.numeric(leadingCauses$Deaths)
leadingCauses$`Death Rate` <- as.numeric(leadingCauses$`Death Rate`)
leadingCauses$`Age Adjusted Death Rate` <- as.numeric(leadingCauses$`Age Adjusted Death Rate`)
summary(leadingCauses)
##       Year      Leading Cause          Sex            Race Ethnicity    
##  Min.   :2007   Length:708         Length:708         Length:708        
##  1st Qu.:2008   Class :character   Class :character   Class :character  
##  Median :2010   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2010                                                           
##  3rd Qu.:2012                                                           
##  Max.   :2014                                                           
##      Deaths         Death Rate     Age Adjusted Death Rate
##  Min.   :  13.0   Min.   :  2.40   Min.   :  2.50         
##  1st Qu.: 107.0   1st Qu.: 11.60   1st Qu.: 12.15         
##  Median : 209.5   Median : 18.35   Median : 20.35         
##  Mean   : 591.5   Mean   : 53.44   Mean   : 53.46         
##  3rd Qu.: 501.2   3rd Qu.: 64.62   3rd Qu.: 77.55         
##  Max.   :7050.0   Max.   :491.40   Max.   :350.70
describe(leadingCauses)
##                         vars   n    mean     sd  median trimmed    mad
## Year                       1 708 2010.48   2.30 2010.00 2010.48   2.97
## Leading Cause*             2 708     NaN     NA      NA     NaN     NA
## Sex*                       3 708     NaN     NA      NA     NaN     NA
## Race Ethnicity*            4 708     NaN     NA      NA     NaN     NA
## Deaths                     5 708  591.47 981.18  209.50  356.16 199.41
## Death Rate                 6 708   53.44  76.52   18.35   35.60  15.05
## Age Adjusted Death Rate    7 708   53.46  69.10   20.35   39.14  17.57
##                            min    max  range skew kurtosis    se
## Year                    2007.0 2014.0    7.0 0.01    -1.25  0.09
## Leading Cause*             Inf   -Inf   -Inf   NA       NA    NA
## Sex*                       Inf   -Inf   -Inf   NA       NA    NA
## Race Ethnicity*            Inf   -Inf   -Inf   NA       NA    NA
## Deaths                    13.0 7050.0 7037.0 3.15    11.71 36.88
## Death Rate                 2.4  491.4  489.0 2.48     6.84  2.88
## Age Adjusted Death Rate    2.5  350.7  348.2 1.81     2.68  2.60
ggplot(data = leadingCauses, aes(leadingCauses$Year, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = Year))

ggplot(data = leadingCauses, aes(leadingCauses$`Race Ethnicity`, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = `Race Ethnicity`))

leadingCauses %>%
  group_by(leadingCauses$`Race Ethnicity`) %>%
  summarize(mean = mean(Deaths), sd = sd(Deaths), median = median(Deaths))
## # A tibble: 4 x 4
##   `leadingCauses$\`Race Ethnicity\``  mean    sd median
##   <chr>                              <dbl> <dbl>  <dbl>
## 1 Asian and Pacific Islander          149.  176.    60 
## 2 Black Non-Hispanic                  624.  715.   256.
## 3 Hispanic                            423.  458.   186 
## 4 White Non-Hispanic                 1173. 1601.   330.
ggplot(data = leadingCauses, aes(leadingCauses$`Leading Cause`, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = `Leading Cause`)) +
  theme(legend.position="none")

nums <- unlist(lapply(leadingCauses, is.numeric)) 
nums1 <- leadingCauses[ , nums]
cor(nums1)
##                                 Year       Deaths   Death Rate
## Year                     1.000000000 -0.001807682 -0.003462548
## Deaths                  -0.001807682  1.000000000  0.972060559
## Death Rate              -0.003462548  0.972060559  1.000000000
## Age Adjusted Death Rate -0.028267102  0.826825203  0.919520032
##                         Age Adjusted Death Rate
## Year                                 -0.0282671
## Deaths                                0.8268252
## Death Rate                            0.9195200
## Age Adjusted Death Rate               1.0000000
corrgram(nums1, order=TRUE, 
         lower.panel=panel.shade,
        upper.panel=panel.pie, 
        text.panel=panel.txt,
        main="Data")

Reference:https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam