library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# load data
library("readxl")
data <- read.csv("https://data.cityofnewyork.us/api/views/jb7j-dtam/rows.csv")
head(data)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
How are deaths changing over time? Are there more deaths for males than females? What is the trend for individual race? In this project we try to predict a trend by doing a trend analysis, and also we’ll try to predict the value of a variable based on the value of two or more other variables, by doing Multiple regression analysis.
What are the cases, and how many are there?
count(data, Year)
## # A tibble: 8 x 2
## Year n
## <int> <int>
## 1 2007 141
## 2 2008 136
## 3 2009 135
## 4 2010 138
## 5 2011 141
## 6 2012 134
## 7 2013 133
## 8 2014 136
The observations of New York City Leading Causes of Death in a given year. The leading causes of death by sex and ethnicity in New York City in since 2007.
There are 8 cases (8 years), each year include other variable as we can see below. a total of 1094 obeservation
glimpse(data)
## Observations: 1,094
## Variables: 7
## $ Year <int> 2010, 2008, 2013, 2010, 2009, 2012, 2012, 200…
## $ Leading.Cause <fct> "Influenza (Flu) and Pneumonia (J09-J18)", "A…
## $ Sex <fct> F, F, M, M, M, F, F, M, F, F, M, F, F, M, F, …
## $ Race.Ethnicity <fct> Hispanic, Hispanic, White Non-Hispanic, Hispa…
## $ Deaths <fct> 228, 68, 271, 140, 255, ., 102, 26, 2140, ., …
## $ Death.Rate <fct> 18.7, 5.8, 20.1, 12.3, 30, ., 17.5, 5.1, 149.…
## $ Age.Adjusted.Death.Rate <fct> 23.1, 6.6, 17.9, 21.4, 30, ., 20.7, 7.2, 93.9…
Describe the method of data collection.
Data will be downloaded from https://www.census.gov
What type of study is this (observational/experiment)?
This is an observational study. This is not an experiment because we do not have traitment group and/or control group.
If you collected the data, state self-collected. If not, provide a citation/link.
Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City. The Data is provided by Department of Health and Mental Hygiene (DOHMH), and published in NYC-Open-DATA
What is the response variable? Is it quantitative or qualitative?
Dependent Variale is the Age adjusted Death rate, and it is a quantitative variable. age as
You should have two independent variables, one quantitative and one qualitative.
The death rate is Quantitative independent variable, and the cause of death is a qualitative variable.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(data)
## Year Leading.Cause Sex
## Min. :2007 All Other Causes : 96 F:554
## 1st Qu.:2008 Diseases of Heart (I00-I09, I11, I13, I20-I51): 96 M:540
## Median :2010 Influenza (Flu) and Pneumonia (J09-J18) : 96
## Mean :2010 Malignant Neoplasms (Cancer: C00-C97) : 96
## 3rd Qu.:2012 Diabetes Mellitus (E10-E14) : 92
## Max. :2014 Cerebrovascular Disease (Stroke: I60-I69) : 90
## (Other) :528
## Race.Ethnicity Deaths Death.Rate
## Asian and Pacific Islander:177 . :138 . :386
## Black Non-Hispanic :178 5 : 28 13 : 7
## Hispanic :177 8 : 22 17.3 : 7
## Not Stated/Unknown :200 6 : 21 11.4 : 6
## Other Race/ Ethnicity :186 10 : 15 16.3 : 6
## White Non-Hispanic :176 7 : 15 18 : 6
## (Other):855 (Other):676
## Age.Adjusted.Death.Rate
## . :386
## 17.9 : 6
## 21.4 : 6
## 6.3 : 6
## 14.1 : 5
## 14.8 : 5
## (Other):680
library(dplyr)
library(ggplot2)
data %>% group_by(Year,Sex) %>% summarise(Total = sum(as.numeric(Deaths))) %>%
ggplot(.,aes(Year,Total)) + geom_line(aes(color = Sex)) + ylab('Total Death') + facet_wrap(~Sex)
g<-group_by(data,Year,Race.Ethnicity)
s<-summarise(g,Total_Death = sum(as.numeric(Deaths),na.rm=TRUE))
ggplot(s[s$Race.Ethnicity%in%data$Race.Ethnicity,],aes(Year,Total_Death)) + geom_point() +
geom_line(aes(color = Race.Ethnicity)) + ylab('Total Death') +
ggtitle('2007-2014 Death Trend') + theme(plot.title = element_text(hjust = 0.5)) +
guides(fill=guide_legend(title = "Race Ethnicity")) + theme(plot.title = element_text(size = 15, face = "bold"))