Blog 4

Libraries

# load data
library(tidyverse)
library(readr)
library(psych)
library(corrgram)
library(plotly)

Data

Data was gathered from NYC Open Data. The type of data is Observational. The dependent variable is Death rates. The death rate is quantitative data. The independent variables are disease, race, and age.

Informational regrading this data can be found by visiting the link below:

https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam

Exploratory Data Analysis & Inference

leadingCauses <- read_csv("https://raw.githubusercontent.com/mandiemannz/CUNY-MSDS-Data-621/master/New_York_City_Leading_Causes_of_Death.csv")
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Leading Cause` = col_character(),
##   Sex = col_character(),
##   `Race Ethnicity` = col_character(),
##   Deaths = col_character(),
##   `Death Rate` = col_character(),
##   `Age Adjusted Death Rate` = col_character()
## )
head(leadingCauses, 10)
## # A tibble: 10 x 7
##     Year `Leading Cause`        Sex   `Race Ethnicity` Deaths `Death Rate`
##    <int> <chr>                  <chr> <chr>            <chr>  <chr>       
##  1  2010 Influenza (Flu) and P~ F     Hispanic         228    18.7        
##  2  2008 Accidents Except Drug~ F     Hispanic         68     5.8         
##  3  2013 Accidents Except Drug~ M     White Non-Hispa~ 271    20.1        
##  4  2010 Cerebrovascular Disea~ M     Hispanic         140    12.3        
##  5  2009 Assault (Homicide: Y8~ M     Black Non-Hispa~ 255    30          
##  6  2012 Mental and Behavioral~ F     Other Race/ Eth~ .      .           
##  7  2012 Cerebrovascular Disea~ F     Asian and Pacif~ 102    17.5        
##  8  2009 Essential Hypertensio~ M     Asian and Pacif~ 26     5.1         
##  9  2010 All Other Causes       F     White Non-Hispa~ 2140   149.7       
## 10  2009 Alzheimer's Disease (~ F     Other Race/ Eth~ .      .           
## # ... with 1 more variable: `Age Adjusted Death Rate` <chr>

An overview of the data found from leading causes of diseases.

As show above, the data is in a messy format, so we must apply some techniques to clean and better make sense of the data.

Data Cleaning

The below code helps make sense of the data. Some “NA” data was formatted as “.”, so we apply a transformation to make “.” into NA, so they can easily be removed. We also convert some variables to numeric, as when imported into R they come in as character variables.

leadingCauses$Deaths[leadingCauses$Deaths=="."] <- NA
leadingCauses$`Death Rate`[leadingCauses$`Death Rate`=="."] <- NA
leadingCauses$`Age Adjusted Death Rate`[leadingCauses$`Age Adjusted Death Rate`=="."] <- NA
leadingCauses <- na.omit(leadingCauses)


leadingCauses$Deaths <- as.numeric(leadingCauses$Deaths)
leadingCauses$`Death Rate` <- as.numeric(leadingCauses$`Death Rate`)
leadingCauses$`Age Adjusted Death Rate` <- as.numeric(leadingCauses$`Age Adjusted Death Rate`)

Summary Statistics

The below summary and describe show some summary statistics, such as mean/median/range for variables in the dataset.

summary(leadingCauses)
##       Year      Leading Cause          Sex            Race Ethnicity    
##  Min.   :2007   Length:708         Length:708         Length:708        
##  1st Qu.:2008   Class :character   Class :character   Class :character  
##  Median :2010   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2010                                                           
##  3rd Qu.:2012                                                           
##  Max.   :2014                                                           
##      Deaths         Death Rate     Age Adjusted Death Rate
##  Min.   :  13.0   Min.   :  2.40   Min.   :  2.50         
##  1st Qu.: 107.0   1st Qu.: 11.60   1st Qu.: 12.15         
##  Median : 209.5   Median : 18.35   Median : 20.35         
##  Mean   : 591.5   Mean   : 53.44   Mean   : 53.46         
##  3rd Qu.: 501.2   3rd Qu.: 64.62   3rd Qu.: 77.55         
##  Max.   :7050.0   Max.   :491.40   Max.   :350.70
describe(leadingCauses)
##                         vars   n    mean     sd  median trimmed    mad
## Year                       1 708 2010.48   2.30 2010.00 2010.48   2.97
## Leading Cause*             2 708     NaN     NA      NA     NaN     NA
## Sex*                       3 708     NaN     NA      NA     NaN     NA
## Race Ethnicity*            4 708     NaN     NA      NA     NaN     NA
## Deaths                     5 708  591.47 981.18  209.50  356.16 199.41
## Death Rate                 6 708   53.44  76.52   18.35   35.60  15.05
## Age Adjusted Death Rate    7 708   53.46  69.10   20.35   39.14  17.57
##                            min    max  range skew kurtosis    se
## Year                    2007.0 2014.0    7.0 0.01    -1.25  0.09
## Leading Cause*             Inf   -Inf   -Inf   NA       NA    NA
## Sex*                       Inf   -Inf   -Inf   NA       NA    NA
## Race Ethnicity*            Inf   -Inf   -Inf   NA       NA    NA
## Deaths                    13.0 7050.0 7037.0 3.15    11.71 36.88
## Death Rate                 2.4  491.4  489.0 2.48     6.84  2.88
## Age Adjusted Death Rate    2.5  350.7  348.2 1.81     2.68  2.60

The data show that the # of individuals per ethnicity seems evenly distributed - there are about the same amount per group in the data.

leadingCauses %>%
  group_by(leadingCauses$`Race Ethnicity`) %>%
  summarize(n=n())
## # A tibble: 4 x 2
##   `leadingCauses$\`Race Ethnicity\``     n
##   <chr>                              <int>
## 1 Asian and Pacific Islander           177
## 2 Black Non-Hispanic                   178
## 3 Hispanic                             177
## 4 White Non-Hispanic                   176

The data also has an even number of males and females.

leadingCauses %>%
  group_by(leadingCauses$Sex) %>%
  summarize(n=n())
## # A tibble: 2 x 2
##   `leadingCauses$Sex`     n
##   <chr>               <int>
## 1 F                     354
## 2 M                     354

Visual Analysis

The following interactive bargraph shows the count of deaths by year.

p1 <- ggplot(data = leadingCauses, aes(leadingCauses$Year, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = Year))

ggplotly(p1)

The following graph shows the number of deaths broken down by Ethnicity.

p2 <- ggplot(data = leadingCauses, aes(leadingCauses$`Race Ethnicity`, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = `Race Ethnicity`)) +
  theme(legend.position="none")

ggplotly(p2)

According to the graph, white non-Hispanic have the most reported deaths; followed by black non-Hispanic.

The following graph shows the leading causes by number of deaths.

p3 <- ggplot(data = leadingCauses, aes(leadingCauses$`Leading Cause`, leadingCauses$Deaths)) + 
  stat_summary(fun.y = sum,
               geom = "bar", aes(fill = `Leading Cause`)) +
  theme(legend.position="none") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

ggplotly(p3)

The data seems to show that the leading cause of death is those related to the heart.

nums <- unlist(lapply(leadingCauses, is.numeric)) 
nums1 <- leadingCauses[ , nums]
cor(nums1)
##                                 Year       Deaths   Death Rate
## Year                     1.000000000 -0.001807682 -0.003462548
## Deaths                  -0.001807682  1.000000000  0.972060559
## Death Rate              -0.003462548  0.972060559  1.000000000
## Age Adjusted Death Rate -0.028267102  0.826825203  0.919520032
##                         Age Adjusted Death Rate
## Year                                 -0.0282671
## Deaths                                0.8268252
## Death Rate                            0.9195200
## Age Adjusted Death Rate               1.0000000

ANOVA Models

In order to test if there’s a significant difference between the death rate of different ethnicities, we’re going to use an ANOVA test (since there are more than 2 variables we would like to test for.)

\[H_0: \mu_{white} = \mu_{black} = \mu_{asian} = \mu_{hispanic}\]

\[H_a: \mu_{white} \neq \mu_{black} \neq \mu_{asian} \neq \mu_{hispanic} \]

model <- lm(leadingCauses$`Death Rate` ~ leadingCauses$`Race Ethnicity`, leadingCauses)
anova(model)
## Analysis of Variance Table
## 
## Response: leadingCauses$`Death Rate`
##                                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## leadingCauses$`Race Ethnicity`   3  375826  125275  23.428 1.835e-14 ***
## Residuals                      704 3764387    5347                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value from our model is less than 0.05, we reject the \(H_0\) and accept the \(H_a\), that there is significant evidence that there is a difference in death rate between different ethnicity.

model <- lm(leadingCauses$`Death Rate` ~ leadingCauses$Sex, leadingCauses)
anova(model)
## Analysis of Variance Table
## 
## Response: leadingCauses$`Death Rate`
##                    Df  Sum Sq Mean Sq F value Pr(>F)
## leadingCauses$Sex   1    2940  2939.8  0.5017  0.479
## Residuals         706 4137273  5860.2

Since the p-value from our model is higher than 0.05, we fail to reject the \(H_0\), that there is not enough significant evidence that there is a difference in death rate between different sexes.

Conclusions

After analyzing the data from the leadingcauses dataset, the data shows that white non-Hispanic account for most of the deaths within this dataset. The data also shows that the leading cause of death is those related to heart issues. An interesting follow-up study would be to see what factors impact the heart the most - and ways to prevent them.

The data seems to suggest that there is some significant difference between the death rate and ethnicties, whereas there isn’t a significant difference between the deathrate and sex.

Amanda Arce

May 15, 2020