Blog 4
Libraries
# load data
library(tidyverse)
library(readr)
library(psych)
library(corrgram)
library(plotly)Data
Data was gathered from NYC Open Data. The type of data is Observational. The dependent variable is Death rates. The death rate is quantitative data. The independent variables are disease, race, and age.
Informational regrading this data can be found by visiting the link below:
https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
Exploratory Data Analysis & Inference
leadingCauses <- read_csv("https://raw.githubusercontent.com/mandiemannz/CUNY-MSDS-Data-621/master/New_York_City_Leading_Causes_of_Death.csv")## Parsed with column specification:
## cols(
## Year = col_integer(),
## `Leading Cause` = col_character(),
## Sex = col_character(),
## `Race Ethnicity` = col_character(),
## Deaths = col_character(),
## `Death Rate` = col_character(),
## `Age Adjusted Death Rate` = col_character()
## )
head(leadingCauses, 10)## # A tibble: 10 x 7
## Year `Leading Cause` Sex `Race Ethnicity` Deaths `Death Rate`
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 2010 Influenza (Flu) and P~ F Hispanic 228 18.7
## 2 2008 Accidents Except Drug~ F Hispanic 68 5.8
## 3 2013 Accidents Except Drug~ M White Non-Hispa~ 271 20.1
## 4 2010 Cerebrovascular Disea~ M Hispanic 140 12.3
## 5 2009 Assault (Homicide: Y8~ M Black Non-Hispa~ 255 30
## 6 2012 Mental and Behavioral~ F Other Race/ Eth~ . .
## 7 2012 Cerebrovascular Disea~ F Asian and Pacif~ 102 17.5
## 8 2009 Essential Hypertensio~ M Asian and Pacif~ 26 5.1
## 9 2010 All Other Causes F White Non-Hispa~ 2140 149.7
## 10 2009 Alzheimer's Disease (~ F Other Race/ Eth~ . .
## # ... with 1 more variable: `Age Adjusted Death Rate` <chr>
An overview of the data found from leading causes of diseases.
As show above, the data is in a messy format, so we must apply some techniques to clean and better make sense of the data.
Data Cleaning
The below code helps make sense of the data. Some “NA” data was formatted as “.”, so we apply a transformation to make “.” into NA, so they can easily be removed. We also convert some variables to numeric, as when imported into R they come in as character variables.
leadingCauses$Deaths[leadingCauses$Deaths=="."] <- NA
leadingCauses$`Death Rate`[leadingCauses$`Death Rate`=="."] <- NA
leadingCauses$`Age Adjusted Death Rate`[leadingCauses$`Age Adjusted Death Rate`=="."] <- NA
leadingCauses <- na.omit(leadingCauses)
leadingCauses$Deaths <- as.numeric(leadingCauses$Deaths)
leadingCauses$`Death Rate` <- as.numeric(leadingCauses$`Death Rate`)
leadingCauses$`Age Adjusted Death Rate` <- as.numeric(leadingCauses$`Age Adjusted Death Rate`)Summary Statistics
The below summary and describe show some summary statistics, such as mean/median/range for variables in the dataset.
summary(leadingCauses)## Year Leading Cause Sex Race Ethnicity
## Min. :2007 Length:708 Length:708 Length:708
## 1st Qu.:2008 Class :character Class :character Class :character
## Median :2010 Mode :character Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2014
## Deaths Death Rate Age Adjusted Death Rate
## Min. : 13.0 Min. : 2.40 Min. : 2.50
## 1st Qu.: 107.0 1st Qu.: 11.60 1st Qu.: 12.15
## Median : 209.5 Median : 18.35 Median : 20.35
## Mean : 591.5 Mean : 53.44 Mean : 53.46
## 3rd Qu.: 501.2 3rd Qu.: 64.62 3rd Qu.: 77.55
## Max. :7050.0 Max. :491.40 Max. :350.70
describe(leadingCauses)## vars n mean sd median trimmed mad
## Year 1 708 2010.48 2.30 2010.00 2010.48 2.97
## Leading Cause* 2 708 NaN NA NA NaN NA
## Sex* 3 708 NaN NA NA NaN NA
## Race Ethnicity* 4 708 NaN NA NA NaN NA
## Deaths 5 708 591.47 981.18 209.50 356.16 199.41
## Death Rate 6 708 53.44 76.52 18.35 35.60 15.05
## Age Adjusted Death Rate 7 708 53.46 69.10 20.35 39.14 17.57
## min max range skew kurtosis se
## Year 2007.0 2014.0 7.0 0.01 -1.25 0.09
## Leading Cause* Inf -Inf -Inf NA NA NA
## Sex* Inf -Inf -Inf NA NA NA
## Race Ethnicity* Inf -Inf -Inf NA NA NA
## Deaths 13.0 7050.0 7037.0 3.15 11.71 36.88
## Death Rate 2.4 491.4 489.0 2.48 6.84 2.88
## Age Adjusted Death Rate 2.5 350.7 348.2 1.81 2.68 2.60
The data show that the # of individuals per ethnicity seems evenly distributed - there are about the same amount per group in the data.
leadingCauses %>%
group_by(leadingCauses$`Race Ethnicity`) %>%
summarize(n=n())## # A tibble: 4 x 2
## `leadingCauses$\`Race Ethnicity\`` n
## <chr> <int>
## 1 Asian and Pacific Islander 177
## 2 Black Non-Hispanic 178
## 3 Hispanic 177
## 4 White Non-Hispanic 176
The data also has an even number of males and females.
leadingCauses %>%
group_by(leadingCauses$Sex) %>%
summarize(n=n())## # A tibble: 2 x 2
## `leadingCauses$Sex` n
## <chr> <int>
## 1 F 354
## 2 M 354
Visual Analysis
The following interactive bargraph shows the count of deaths by year.
p1 <- ggplot(data = leadingCauses, aes(leadingCauses$Year, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = Year))
ggplotly(p1)The following graph shows the number of deaths broken down by Ethnicity.
p2 <- ggplot(data = leadingCauses, aes(leadingCauses$`Race Ethnicity`, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = `Race Ethnicity`)) +
theme(legend.position="none")
ggplotly(p2)According to the graph, white non-Hispanic have the most reported deaths; followed by black non-Hispanic.
The following graph shows the leading causes by number of deaths.
p3 <- ggplot(data = leadingCauses, aes(leadingCauses$`Leading Cause`, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = `Leading Cause`)) +
theme(legend.position="none") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p3)The data seems to show that the leading cause of death is those related to the heart.
nums <- unlist(lapply(leadingCauses, is.numeric))
nums1 <- leadingCauses[ , nums]
cor(nums1)## Year Deaths Death Rate
## Year 1.000000000 -0.001807682 -0.003462548
## Deaths -0.001807682 1.000000000 0.972060559
## Death Rate -0.003462548 0.972060559 1.000000000
## Age Adjusted Death Rate -0.028267102 0.826825203 0.919520032
## Age Adjusted Death Rate
## Year -0.0282671
## Deaths 0.8268252
## Death Rate 0.9195200
## Age Adjusted Death Rate 1.0000000
ANOVA Models
In order to test if there’s a significant difference between the death rate of different ethnicities, we’re going to use an ANOVA test (since there are more than 2 variables we would like to test for.)
\[H_0: \mu_{white} = \mu_{black} = \mu_{asian} = \mu_{hispanic}\]
\[H_a: \mu_{white} \neq \mu_{black} \neq \mu_{asian} \neq \mu_{hispanic} \]
model <- lm(leadingCauses$`Death Rate` ~ leadingCauses$`Race Ethnicity`, leadingCauses)
anova(model)## Analysis of Variance Table
##
## Response: leadingCauses$`Death Rate`
## Df Sum Sq Mean Sq F value Pr(>F)
## leadingCauses$`Race Ethnicity` 3 375826 125275 23.428 1.835e-14 ***
## Residuals 704 3764387 5347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value from our model is less than 0.05, we reject the \(H_0\) and accept the \(H_a\), that there is significant evidence that there is a difference in death rate between different ethnicity.
model <- lm(leadingCauses$`Death Rate` ~ leadingCauses$Sex, leadingCauses)
anova(model)## Analysis of Variance Table
##
## Response: leadingCauses$`Death Rate`
## Df Sum Sq Mean Sq F value Pr(>F)
## leadingCauses$Sex 1 2940 2939.8 0.5017 0.479
## Residuals 706 4137273 5860.2
Since the p-value from our model is higher than 0.05, we fail to reject the \(H_0\), that there is not enough significant evidence that there is a difference in death rate between different sexes.
Conclusions
After analyzing the data from the leadingcauses dataset, the data shows that white non-Hispanic account for most of the deaths within this dataset. The data also shows that the leading cause of death is those related to heart issues. An interesting follow-up study would be to see what factors impact the heart the most - and ways to prevent them.
The data seems to suggest that there is some significant difference between the death rate and ethnicties, whereas there isn’t a significant difference between the deathrate and sex.