library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Preparation

# load data
library("readxl")
data <- read.csv("https://data.cityofnewyork.us/api/views/jb7j-dtam/rows.csv")
head(data)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

How are deaths changing over time? Are there more deaths for males than females? What is the trend for individual race? In this project we try to predict a trend by doing a trend analysis, and also we’ll try to predict the value of a variable based on the value of two or more other variables, by doing Multiple regression analysis.

Cases

What are the cases, and how many are there?

count(data, Year)
## # A tibble: 8 x 2
##    Year     n
##   <int> <int>
## 1  2007   141
## 2  2008   136
## 3  2009   135
## 4  2010   138
## 5  2011   141
## 6  2012   134
## 7  2013   133
## 8  2014   136

The observations of New York City Leading Causes of Death in a given year. The leading causes of death by sex and ethnicity in New York City in since 2007.

There are 8 cases (8 years), each year include other variable as we can see below. a total of 1094 obeservation

glimpse(data)
## Observations: 1,094
## Variables: 7
## $ Year                    <int> 2010, 2008, 2013, 2010, 2009, 2012, 2012, 200…
## $ Leading.Cause           <fct> "Influenza (Flu) and Pneumonia (J09-J18)", "A…
## $ Sex                     <fct> F, F, M, M, M, F, F, M, F, F, M, F, F, M, F, …
## $ Race.Ethnicity          <fct> Hispanic, Hispanic, White Non-Hispanic, Hispa…
## $ Deaths                  <fct> 228, 68, 271, 140, 255, ., 102, 26, 2140, ., …
## $ Death.Rate              <fct> 18.7, 5.8, 20.1, 12.3, 30, ., 17.5, 5.1, 149.…
## $ Age.Adjusted.Death.Rate <fct> 23.1, 6.6, 17.9, 21.4, 30, ., 20.7, 7.2, 93.9…

Data collection

Describe the method of data collection.

Data will be downloaded from https://www.census.gov

Type of study

What type of study is this (observational/experiment)?

This is an observational study. This is not an experiment because we do not have traitment group and/or control group.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Cause of death is derived from the NYC death certificate which is issued for every death that occurs in New York City. The Data is provided by Department of Health and Mental Hygiene (DOHMH), and published in NYC-Open-DATA

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

Dependent Variale is the Age adjusted Death rate, and it is a quantitative variable. age as

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The death rate is Quantitative independent variable, and the cause of death is a qualitative variable.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(data)
##       Year                                             Leading.Cause Sex    
##  Min.   :2007   All Other Causes                              : 96   F:554  
##  1st Qu.:2008   Diseases of Heart (I00-I09, I11, I13, I20-I51): 96   M:540  
##  Median :2010   Influenza (Flu) and Pneumonia (J09-J18)       : 96          
##  Mean   :2010   Malignant Neoplasms (Cancer: C00-C97)         : 96          
##  3rd Qu.:2012   Diabetes Mellitus (E10-E14)                   : 92          
##  Max.   :2014   Cerebrovascular Disease (Stroke: I60-I69)     : 90          
##                 (Other)                                       :528          
##                     Race.Ethnicity     Deaths      Death.Rate 
##  Asian and Pacific Islander:177    .      :138   .      :386  
##  Black Non-Hispanic        :178    5      : 28   13     :  7  
##  Hispanic                  :177    8      : 22   17.3   :  7  
##  Not Stated/Unknown        :200    6      : 21   11.4   :  6  
##  Other Race/ Ethnicity     :186    10     : 15   16.3   :  6  
##  White Non-Hispanic        :176    7      : 15   18     :  6  
##                                    (Other):855   (Other):676  
##  Age.Adjusted.Death.Rate
##  .      :386            
##  17.9   :  6            
##  21.4   :  6            
##  6.3    :  6            
##  14.1   :  5            
##  14.8   :  5            
##  (Other):680
library(dplyr)
library(ggplot2)
  
    data %>% group_by(Year,Sex) %>% summarise(Total = sum(as.numeric(Deaths))) %>%
      ggplot(.,aes(Year,Total)) + geom_line(aes(color = Sex)) + ylab('Total Death') + facet_wrap(~Sex)

g<-group_by(data,Year,Race.Ethnicity)
    s<-summarise(g,Total_Death = sum(as.numeric(Deaths),na.rm=TRUE)) 
    ggplot(s[s$Race.Ethnicity%in%data$Race.Ethnicity,],aes(Year,Total_Death)) + geom_point() + 
    geom_line(aes(color = Race.Ethnicity)) + ylab('Total Death') + 
    ggtitle('2007-2014 Death Trend') + theme(plot.title = element_text(hjust = 0.5)) + 
    guides(fill=guide_legend(title = "Race Ethnicity")) + theme(plot.title = element_text(size = 15, face = "bold"))