Data Preparation
# load data
library(tidyverse)
library(readr)
library(psych)
library(corrgram)Research question
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
What is the deathrate based off disease, race, and age
Cases
What are the cases, and how many are there?
708 total causes - showing year, leading cause of death, ethnicity, and death/rates
Data collection
Describe the method of data collection. Data was gathered from NYC Open Data
Type of study
What type of study is this (observational/experiment)? Obervational
Data Source
If you collected the data, state self-collected. If not, provide a citation/link.
https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
leadingCauses <- read_csv("New_York_City_Leading_Causes_of_Death.csv")## Parsed with column specification:
## cols(
## Year = col_integer(),
## `Leading Cause` = col_character(),
## Sex = col_character(),
## `Race Ethnicity` = col_character(),
## Deaths = col_character(),
## `Death Rate` = col_character(),
## `Age Adjusted Death Rate` = col_character()
## )
head(leadingCauses, 10)## # A tibble: 10 x 7
## Year `Leading Cause` Sex `Race Ethnicity` Deaths `Death Rate`
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 2010 Assault (Homicide: Y8~ M Black Non-Hispa~ 299 35.1
## 2 2011 Mental and Behavioral~ M Not Stated/Unkn~ 5 .
## 3 2011 Diseases of Heart (I0~ M Black Non-Hispa~ 1840 215.7
## 4 2008 Certain Conditions or~ F Other Race/ Eth~ . .
## 5 2014 Accidents Except Drug~ F Hispanic 64 5.1
## 6 2007 Intentional Self-Harm~ M Not Stated/Unkn~ 5 .
## 7 2012 Accidents Except Drug~ M Black Non-Hispa~ 152 17.8
## 8 2009 All Other Causes M Asian and Pacif~ 220 43.1
## 9 2013 Diseases of Heart (I0~ F Asian and Pacif~ 437 72.8
## 10 2014 Accidents Except Drug~ M Other Race/ Eth~ 12 .
## # ... with 1 more variable: `Age Adjusted Death Rate` <chr>
Dependent Variable
What is the response variable? Is it quantitative or qualitative?
The dependent variable is Death rates. The death rate is quantitative data.
Independent Variable
You should have two independent variables, one quantitative and one qualitative.
The indepedent variables are disease, race, and age.
Relevant summary statistics
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
leadingCauses$Deaths[leadingCauses$Deaths=="."] <- NA
leadingCauses$`Death Rate`[leadingCauses$`Death Rate`=="."] <- NA
leadingCauses$`Age Adjusted Death Rate`[leadingCauses$`Age Adjusted Death Rate`=="."] <- NA
leadingCauses <- na.omit(leadingCauses)
leadingCauses$Deaths <- as.numeric(leadingCauses$Deaths)
leadingCauses$`Death Rate` <- as.numeric(leadingCauses$`Death Rate`)
leadingCauses$`Age Adjusted Death Rate` <- as.numeric(leadingCauses$`Age Adjusted Death Rate`)summary(leadingCauses)## Year Leading Cause Sex Race Ethnicity
## Min. :2007 Length:708 Length:708 Length:708
## 1st Qu.:2008 Class :character Class :character Class :character
## Median :2010 Mode :character Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2014
## Deaths Death Rate Age Adjusted Death Rate
## Min. : 13.0 Min. : 2.40 Min. : 2.50
## 1st Qu.: 107.0 1st Qu.: 11.60 1st Qu.: 12.15
## Median : 209.5 Median : 18.35 Median : 20.35
## Mean : 591.5 Mean : 53.44 Mean : 53.46
## 3rd Qu.: 501.2 3rd Qu.: 64.62 3rd Qu.: 77.55
## Max. :7050.0 Max. :491.40 Max. :350.70
describe(leadingCauses)## vars n mean sd median trimmed mad
## Year 1 708 2010.48 2.30 2010.00 2010.48 2.97
## Leading Cause* 2 708 NaN NA NA NaN NA
## Sex* 3 708 NaN NA NA NaN NA
## Race Ethnicity* 4 708 NaN NA NA NaN NA
## Deaths 5 708 591.47 981.18 209.50 356.16 199.41
## Death Rate 6 708 53.44 76.52 18.35 35.60 15.05
## Age Adjusted Death Rate 7 708 53.46 69.10 20.35 39.14 17.57
## min max range skew kurtosis se
## Year 2007.0 2014.0 7.0 0.01 -1.25 0.09
## Leading Cause* Inf -Inf -Inf NA NA NA
## Sex* Inf -Inf -Inf NA NA NA
## Race Ethnicity* Inf -Inf -Inf NA NA NA
## Deaths 13.0 7050.0 7037.0 3.15 11.71 36.88
## Death Rate 2.4 491.4 489.0 2.48 6.84 2.88
## Age Adjusted Death Rate 2.5 350.7 348.2 1.81 2.68 2.60
ggplot(data = leadingCauses, aes(leadingCauses$Year, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = Year))ggplot(data = leadingCauses, aes(leadingCauses$`Race Ethnicity`, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = `Race Ethnicity`))leadingCauses %>%
group_by(leadingCauses$`Race Ethnicity`) %>%
summarize(mean = mean(Deaths), sd = sd(Deaths), median = median(Deaths))## # A tibble: 4 x 4
## `leadingCauses$\`Race Ethnicity\`` mean sd median
## <chr> <dbl> <dbl> <dbl>
## 1 Asian and Pacific Islander 149. 176. 60
## 2 Black Non-Hispanic 624. 715. 256.
## 3 Hispanic 423. 458. 186
## 4 White Non-Hispanic 1173. 1601. 330.
ggplot(data = leadingCauses, aes(leadingCauses$`Leading Cause`, leadingCauses$Deaths)) +
stat_summary(fun.y = sum,
geom = "bar", aes(fill = `Leading Cause`)) +
theme(legend.position="none")nums <- unlist(lapply(leadingCauses, is.numeric))
nums1 <- leadingCauses[ , nums]
cor(nums1)## Year Deaths Death Rate
## Year 1.000000000 -0.001807682 -0.003462548
## Deaths -0.001807682 1.000000000 0.972060559
## Death Rate -0.003462548 0.972060559 1.000000000
## Age Adjusted Death Rate -0.028267102 0.826825203 0.919520032
## Age Adjusted Death Rate
## Year -0.0282671
## Deaths 0.8268252
## Death Rate 0.9195200
## Age Adjusted Death Rate 1.0000000
corrgram(nums1, order=TRUE,
lower.panel=panel.shade,
upper.panel=panel.pie,
text.panel=panel.txt,
main="Data")