Overview

The Multiple Cause of Death database contains mortality and population counts for all U.S. states. Data are based on death certificates for U.S. residents. Each death certificate contains a single underlying cause of death, up to twenty additional multiple causes, and demographic data. We want to see the underlying and multiple causes of death of males, females, and infants in the year 2020 as well as differences in death rates among different races. Multiple causes of death include not only the underlying cause but also the immediate cause of death and all other intermediate and contributory conditions listed on the death certificate.Death rates for gender and race are calculated per 100,000 people whereas death rates of infants are calculated per 1000 infants.

Data Sources and Variables

  1. Data: Multiple Cause of Death File, 2020, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program http://wonder.cdc.gov/mcd-icd10-expanded.html

  2. Year: Only 2020

  3. Gender: Male and Female

  4. Race: White, Black/African American, American Indian/Alaska Native, Native Hawaiian/Pacific Islander, Asian, More than one race.

  5. Population: The population estimates are U.S. Census Bureau estimates of U.S. national, state, and county resident populations.

  6. Crude Death Rate = (number of deaths / population) * 100,000

  7. Death Cause: Underlying cause-of-death is selected from the conditions entered by the physician on the cause of death section of the death certificate.

Packages Used

  1. dplyr: Data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data.

  2. sqldf: Run SQL statements on R data frames for convenience.

  3. onewaytests: Perform normality tests including Shapiro-Wilk, also assess the normality of each group through plots.

  4. ggplot2: Make visualizations such as bar plots.

Research Questions

  1. How are death trends by month and death trends by days of the week different?

  2. What are the top 10 underlying and multiple causes of deaths? Is there any differences in top 5 multiple death causes between male and female?

  3. Are there any differences in crude death rates between male and female?

  4. Is there any impact of COVID on the number of deaths?

  5. How do drug and alcohol induced deaths differ between male and female?

  6. Are death places of males and females similar?

  7. Are there any differences in crude death rates for people of different races?

  8. What are the top 5 causes of infant deaths and how do deaths differ within the infant age groups?

Analysis

#reading the data
DeathCause <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Gender_Population1.txt")
#Removing na values
DeathCause <- na.omit(DeathCause)
#Renaming Crude.Rate
DeathCause<- DeathCause %>% rename(CrudeDeathRate = `Crude.Rate`) #Crude Death Rate is per 100, 000

Normality test for Crude Death Rate for Male and Female

Since, we will be using crude death rate most of the times in our project. I am interested to check the normality for crude death rate for both male and female. I Used Shapiro-Wilk test for normality for Deaths for both male and female. From the output, the two p-value is less than the significance level 0.05 implying that the distribution of the data is not significantly different from normal distribution

#Using Shapiro-Wilk test for normality for deaths for both male and female
library(onewaytests)
onewaytests::nor.test(CrudeDeathRate~Gender, data = DeathCause)

  Shapiro-Wilk Normality Test (alpha = 0.05) 
-------------------------------------------------- 
  data : CrudeDeathRate and Gender 

   Level Statistic   p.value   Normality
1 Female 0.9811101 0.5868239  Not reject
2   Male 0.9935746 0.9941627  Not reject
-------------------------------------------------- 

Deaths Trend by Days of the week

I used ggplot2 to make barplots to see the deaths trend by days of the week. We can see from the barplot that Weekdays have higher deaths compared to weekends.Tuesday, Wednesday, and Thursday have more deaths compared to other days.

library(ggplot2)
Month_Week_Deaths <- read.csv("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Month_Week_Deaths.csv")
#Making barplot
ggplot(data=Month_Week_Deaths, aes(x=factor(Weekday.Code), y=Deaths, fill=factor(Weekday))) + geom_bar(stat="identity",  position = "dodge") +
 labs(x= "Day of the week ", y="Deaths") + theme(panel.grid.major = element_blank()) + labs(title = "Days of the week Death Trends in the US in 2020", subtitle = "1=Sun, 2=Mon, 3=Tues, 4=Wed, 5=Thurs, 6=Fri, 7=Sat") + labs(fill="Days")

Deaths Trend by Month

We can see higher deaths in April and December, and Lower deaths in February, June, and September.

Top 10 Underlying Cause of Deaths

Each death certificate identifies a single underlying cause of death. COVID-19 deaths are the highest followed atherosclerotic heart disease and bronchus or lung, unspecified malignant neoplasms when grouped by male and female. And from the output, we can also see that Alzheimer disease deaths are higher for female.

Underlying_Cause <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Underlying Cause of Death.txt")
Underlying_Cause <-Underlying_Cause %>% select(-X) %>% rename(Underlying_Causes = 'Underlying.Cause.of.death', CauseOfDeath_Code='Underlying.Cause.of.death.Code')

newData<- sqldf("SELECT Underlying_Causes, CauseOfDeath_Code, Gender, SUM(Deaths) AS Total_Deaths FROM Underlying_Cause GROUP BY Underlying_Causes, Gender ORDER BY Total_Deaths DESC LIMIT 10 ")
newData
                                     Underlying_Causes CauseOfDeath_Code Gender
1                                             COVID-19             U07.1   Male
2                                             COVID-19             U07.1 Female
3                        Atherosclerotic heart disease             I25.1   Male
4                       Alzheimer disease, unspecified             G30.9 Female
5                        Atherosclerotic heart disease             I25.1 Female
6                                 Unspecified dementia               F03 Female
7  Bronchus or lung, unspecified - Malignant neoplasms             C34.9   Male
8  Bronchus or lung, unspecified - Malignant neoplasms             C34.9 Female
9   Chronic obstructive pulmonary disease, unspecified             J44.9 Female
10            Acute myocardial infarction, unspecified             I21.9   Male
   Total_Deaths
1        192512
2        158319
3         97732
4         88090
5         72123
6         71011
7         70988
8         61306
9         60594
10        60206

Top 10 Multiple Causes of Deaths

Highest Deaths due to hypertension followed by cardiac arrest and COVID-19. Multiple causes of death include not only the underlying cause but also the immediate cause of death and all other intermediate and contributory conditions listed on the death certificate.

Multiple.Cause.of.Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Multiple Cause of Death.txt")

Top 5 Multiple Causes of Deaths for Males and Females

Hypertension is one of the major cause of death for both Males and Females followed by Cardiac arrest. Males have significantly higher Atherosclerotic heart disease, Mental and behavioral disorders as well as COVID-19 deaths.

Deaths due to COVID in 2020

Significant increase in deaths from March due to COVID.

Gender VS Population

It is very important to know how the population difference is between male and female in 2020. This will give us a clear idea whether our deaths comparison is correct. Males and Females have similar population in 2020.

#boxplot
boxplot(Population~Gender, DeathCause, main="Gender VS Population ",
   xlab="Gender", ylab="Population in 2020", outcol="red", col=c("grey","pink"))

Deaths by Gender

Males have a higher crude death rate compared to Females. There is a significant evidence that death rates depend on gender.

#boxplot
boxplot(CrudeDeathRate~Gender, DeathCause, main="Gender VS Crude Death Rate",
   xlab="Gender", ylab="Crude Death Rate per 100000", outcol="red", col=c("grey","pink")) 

model_Gender <- lm(data=DeathCause, CrudeDeathRate ~ Gender)
summary(model_Gender)

Call:
lm(formula = CrudeDeathRate ~ Gender, data = DeathCause)

Residuals:
    Min      1Q  Median      3Q     Max 
-433.41  -97.35    2.98  115.09  444.59 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   996.43      23.00  43.330  < 2e-16 ***
GenderMale    128.58      32.52   3.954 0.000144 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 164.2 on 100 degrees of freedom
Multiple R-squared:  0.1352,    Adjusted R-squared:  0.1265 
F-statistic: 15.63 on 1 and 100 DF,  p-value: 0.0001438

Comparing Drug and Alcohol induced deaths between Male and Female

Males have a higher Drug, and Alcohol induced deaths than Females. The consumption of Drug, Alcohol, or others have a significant effect on the death of a person.

suppressMessages(library(dplyr))
DrugAlcohol <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Drug:Alcohol Induced Deaths for Male and Female.txt")

#Changing column name and removing missing column with missing values
DrugAlcohol <- DrugAlcohol %>%  rename(Drug_Alcohol_Other = 'UCD...Drug.Alcohol.Induced.Code') %>% select(-c(X))

ggplot(data=DrugAlcohol, aes(x=Drug_Alcohol_Other, y=Crude.Rate, fill=Gender)) + geom_bar(stat="identity",  position = "dodge") +
 labs(x= " ", y="Crude Death Rate") + theme(panel.grid.major = element_blank()) + labs(title = "Males have Higher Drug, and Alcohol Induced Deaths than Females", subtitle = "A=Alcohol, D=Drug, O=Other Non Alcohol or Non Drug")

DrugAlcohol <- as.data.frame(DrugAlcohol)

modell <- aov(data=DrugAlcohol, Crude.Rate ~ Gender + Drug_Alcohol_Other )
summary(modell)
                   Df  Sum Sq Mean Sq F value  Pr(>F)   
Gender              1    2642    2642   3.108 0.21994   
Drug_Alcohol_Other  2 1232657  616328 725.193 0.00138 **
Residuals           2    1700     850                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Drug Poisonings Deaths for Male and Female

Unintentional drug overdose deaths are higher for both male and female. Males have higher suicide death rates due to drug poisonings. Overall, males have higher drug related deaths than females.

Alcohol Related Deaths for Male and Female

Deaths due to all other alcohol induced causes are significantly higher compared to deaths due to alcohol poisonings for both males and females. Overall, males have higher alcohol related deaths than females.

Leading Causes of Deaths in Male and Female

#Male

Male_Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Male_Death Causes.txt")
#Removing NA column and renaming death cause code
Male_Death <- Male_Death %>% select(c(-X)) %>% rename(DeathCauseCode = `UCD...15.Leading.Causes.of.Death.Code`) %>% rename(DeathCauseName = `UCD...15.Leading.Causes.of.Death`) %>% rename(CrudeRate = Crude.Rate)

#Removing '#' from Death Cause Names to make it look nicer
Male_Death$DeathCauseName<-gsub("#","",as.character(Male_Death$DeathCauseName))

#Taking only 9 of the causes
newData3<- sqldf("SELECT DeathCauseName, DeathCauseCode, CrudeRate FROM Male_Death DESC LIMIT 9")
newData3
                      DeathCauseName DeathCauseCode CrudeRate
1                  Diseases of heart      GR113-054     235.9
2                Malignant neoplasms      GR113-019     195.8
3                           COVID-19      GR113-137     118.6
4                          Accidents      GR113-112      82.1
5 Chronic lower respiratory diseases      GR113-082      45.0
6          Cerebrovascular diseases       GR113-070      42.9
7                 Diabetes mellitus       GR113-046      35.5
8                 Alzheimer disease       GR113-052      25.4
9   Intentional self-harm (suicide)       GR113-124      22.5

#Female

Female_Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Female_Death Causes.txt")
#Removing NA column and renaming death cause code
Female_Death <- Female_Death %>% select(c(-X)) %>% rename(DeathCauseCode = `UCD...15.Leading.Causes.of.Death.Code`) %>% rename(DeathCauseName = `UCD...15.Leading.Causes.of.Death`) %>% rename(CrudeRate = Crude.Rate)

#Removing '#' from Death Cause Names to make it look nicer
Female_Death$DeathCauseName<-gsub("#","",as.character(Female_Death$DeathCauseName))

newData3<- sqldf("SELECT DeathCauseName, DeathCauseCode, CrudeRate FROM Female_Death DESC  LIMIT 9")
newData3
                      DeathCauseName DeathCauseCode CrudeRate
1                 Diseases of heart       GR113-054     187.9
2                Malignant neoplasms      GR113-019     170.2
3                          COVID-19       GR113-137      94.7
4                 Alzheimer disease       GR113-052      55.6
5           Cerebrovascular diseases      GR113-070      54.2
6 Chronic lower respiratory diseases      GR113-082      47.7
7 Accidents (unintentional injuries)      GR113-112      40.5
8                 Diabetes mellitus       GR113-046      26.7
9           Influenza and pneumonia       GR113-076      15.4

Place of deaths for Males and Females

We look at top 5 death places for males and females. There are no differences in death places between male and female. Place of death has a significant effect on the number of deaths.

PlaceOfDeath <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Place of Death.txt")
PlaceOfDeath <- PlaceOfDeath %>% select(-c(X, Population, Crude.Rate)) %>% rename(PlaceOfdeath = Place.of.Death, PlaceOfDeathCode= Place.of.Death.Code)

newData4<- sqldf("SELECT PlaceOfDeath, Gender, SUM(Deaths) AS Deaths FROM PlaceOfDeath GROUP BY PlaceOfDeath, Gender LIMIT 10 ")
newData4
                          PlaceOfdeath Gender Deaths
1                      Decedent's home Female 517058
2                      Decedent's home   Male 610909
3                     Hospice facility Female 104184
4                     Hospice facility   Male 100487
5   Medical Facility - Dead on Arrival Female   3252
6   Medical Facility - Dead on Arrival   Male   6421
7         Medical Facility - Inpatient Female 456685
8         Medical Facility - Inpatient   Male 567152
9  Medical Facility - Outpatient or ER Female  78836
10 Medical Facility - Outpatient or ER   Male 123632
PlaceOfDeath$PlaceOfdeath <- as.factor(PlaceOfDeath$PlaceOfdeath)
model <- lm(data=PlaceOfDeath, Deaths~PlaceOfdeath + Gender)
summary(model)

Call:
lm(formula = Deaths ~ PlaceOfdeath + Gender, data = PlaceOfDeath)

Residuals:
   Min     1Q Median     3Q    Max 
-73110 -11862      0  11862  73110 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                       554231      38668  14.333
PlaceOfdeathHospice facility                     -461648      51558  -8.954
PlaceOfdeathMedical Facility - Dead on Arrival   -559147      51558 -10.845
PlaceOfdeathMedical Facility - Inpatient          -52065      51558  -1.010
PlaceOfdeathMedical Facility - Outpatient or ER  -462750      51558  -8.975
PlaceOfdeathNursing home/long term care          -269990      51558  -5.237
PlaceOfdeathOther                                -450744      51558  -8.743
PlaceOfdeathPlace of death unknown               -563660      51558 -10.933
GenderMale                                         19505      25779   0.757
                                                Pr(>|t|)    
(Intercept)                                     1.91e-06 ***
PlaceOfdeathHospice facility                    4.41e-05 ***
PlaceOfdeathMedical Facility - Dead on Arrival  1.25e-05 ***
PlaceOfdeathMedical Facility - Inpatient          0.3462    
PlaceOfdeathMedical Facility - Outpatient or ER 4.34e-05 ***
PlaceOfdeathNursing home/long term care           0.0012 ** 
PlaceOfdeathOther                               5.15e-05 ***
PlaceOfdeathPlace of death unknown              1.19e-05 ***
GenderMale                                        0.4740    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51560 on 7 degrees of freedom
Multiple R-squared:  0.9736,    Adjusted R-squared:  0.9434 
F-statistic: 32.27 on 8 and 7 DF,  p-value: 7.533e-05

Deaths by Race

White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander. Asian and people with more than one race have lower death rates. Also, We have a significant evidence that variable race has an effect on the crude death rates.

Race_Data <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Race:Deaths.txt")
Race_Data<- Race_Data %>% select(c(-X)) %>% rename(Race = Single.Race.6, Race_Code = Single.Race.6.Code, Crude_Rate = Crude.Rate)

newData5<- sqldf("SELECT Race, Crude_Rate FROM Race_Data ORDER BY Crude_Rate DESC ")
newData5
                                       Race Crude_Rate
1                                     White     1112.8
2                 Black or African American     1025.1
3          American Indian or Alaska Native      612.1
4 Native Hawaiian or Other Pacific Islander      576.6
5                                     Asian      461.9
6                        More than one race      190.7

Deaths by race and gender?

White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander for both males and females. Males have a higher death rates than females for all races. Race and Gender have a significant effect on the death rates. The fact that the coefficient for Gender Male in the regression output is positive indicates that being a male is associated with increase in Crude Rate (relative to Females).

Higher crude rates can be found in developed countries like United States, despite high life expectancy because typically these countries have a much higher proportion of older people and lower recent birth rates.

Race_Gender <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Race:Gender.txt")

Race_Gender<- Race_Gender %>% select(c(-X)) %>% rename(Race = Single.Race.6, Race_Code = Single.Race.6.Code, Crude_Rate = Crude.Rate)

#Arranging Crude_Rate in decreasing order
newData5<- sqldf("SELECT Race, Gender, Crude_Rate FROM Race_Gender ORDER BY Crude_Rate DESC ")
newData5
                                        Race Gender Crude_Rate
1                                      White   Male     1170.6
2                  Black or African American   Male     1134.5
3                                      White Female     1056.1
4                  Black or African American Female      924.5
5           American Indian or Alaska Native   Male      663.3
6  Native Hawaiian or Other Pacific Islander   Male      644.8
7           American Indian or Alaska Native Female      560.2
8  Native Hawaiian or Other Pacific Islander Female      506.3
9                                      Asian   Male      506.1
10                                     Asian Female      421.4
11                        More than one race   Male      213.5
12                        More than one race Female      168.2
newData5$Gender <- factor(newData5$Gender)

#Compute the model 

#for race
model4 <- lm(Crude_Rate ~ Race, data = Race_Gender)
summary(model4)

Call:
lm(formula = Crude_Rate ~ Race, data = Race_Gender)

Residuals:
    Min      1Q  Median      3Q     Max 
-105.00  -52.98    0.00   52.98  105.00 

Coefficients:
                                              Estimate Std. Error t value
(Intercept)                                     611.75      63.33   9.660
RaceAsian                                      -148.00      89.56  -1.653
RaceBlack or African American                   417.75      89.56   4.665
RaceMore than one race                         -420.90      89.56  -4.700
RaceNative Hawaiian or Other Pacific Islander   -36.20      89.56  -0.404
RaceWhite                                       501.60      89.56   5.601
                                              Pr(>|t|)    
(Intercept)                                   7.05e-05 ***
RaceAsian                                      0.14951    
RaceBlack or African American                  0.00345 ** 
RaceMore than one race                         0.00333 ** 
RaceNative Hawaiian or Other Pacific Islander  0.70007    
RaceWhite                                      0.00138 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 89.56 on 6 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9304 
F-statistic: 30.42 on 5 and 6 DF,  p-value: 0.0003434
#for race and gender
model5 <- lm(Crude_Rate ~ Race + Gender, data = Race_Gender)
summary(model5)

Call:
lm(formula = Crude_Rate ~ Race + Gender, data = Race_Gender)

Residuals:
       1        2        3        4        5        6        7        8 
  6.4583  -6.4583  15.6583 -15.6583 -46.9917  46.9917 -11.2417  11.2417 
       9       10       11       12 
  0.7583  -0.7583  35.3583 -35.3583 

Coefficients:
                                              Estimate Std. Error t value
(Intercept)                                     553.74      30.06  18.422
RaceAsian                                      -148.00      39.36  -3.761
RaceBlack or African American                   417.75      39.36  10.615
RaceMore than one race                         -420.90      39.36 -10.695
RaceNative Hawaiian or Other Pacific Islander   -36.20      39.36  -0.920
RaceWhite                                       501.60      39.36  12.745
GenderMale                                      116.02      22.72   5.106
                                              Pr(>|t|)    
(Intercept)                                   8.67e-06 ***
RaceAsian                                     0.013150 *  
RaceBlack or African American                 0.000128 ***
RaceMore than one race                        0.000124 ***
RaceNative Hawaiian or Other Pacific Islander 0.399876    
RaceWhite                                     5.29e-05 ***
GenderMale                                    0.003752 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.36 on 5 degrees of freedom
Multiple R-squared:  0.9939,    Adjusted R-squared:  0.9866 
F-statistic: 135.6 on 6 and 5 DF,  p-value: 2.275e-05
summary(model5)$coefficients[ , 1]
                                  (Intercept) 
                                     553.7417 
                                    RaceAsian 
                                    -148.0000 
                RaceBlack or African American 
                                     417.7500 
                       RaceMore than one race 
                                    -420.9000 
RaceNative Hawaiian or Other Pacific Islander 
                                     -36.2000 
                                    RaceWhite 
                                     501.6000 
                                   GenderMale 
                                     116.0167 

Infant Mortality

The graph presents top 5 causes of infant deaths and their respective age groups. Age groups <1 have highest death rates. The death rate is per 1000 infants. The perinatal period begins at the 22 weeks of pregnancy and ends a week after the child’s birth.

Limitations

  1. Inconsistent between data sources might lead to bias in death rates.

  2. Bias also results from undercounts of some population groups in the census, particularly young black males, young white males, and elderly persons, resulting in an overestimation of death rates.

Findings

  1. Hypertension is one of the major causes of deaths in males and females.

  2. Males have significantly higher Atherosclerotic heart disease, Mental and behavioral disorders , and COVID-19 deaths.

  3. Males have a higher alcohol, drug, and other non-alcohol/drug induced deaths compared to females.

  4. Unintentional drug overdose deaths and suicide deaths are higher for males than females.

  5. There are no differences in death places between males and females.

  6. White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander compared to Asian and people with more than one race.

  7. Age groups <1 have highest death rates and certain conditions originating in the perinatal period is one of the major cause of infant deaths.

Tools Used

  1. R Programming

  2. Microsoft Excel

  3. Tableau

References

Centers for Disease Control and Prevention, National Center for Health Statistics. Multiple Cause of Death" “2018-2020 on CDC WONDER Online Database, released in 2021. Data are from the Multiple Cause of Death Files, 2018-2020, as” “compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed” “at http://wonder.cdc.gov/mcd-icd10-expanded.html on Apr 25, 2022 1:13:10 PM”