Overview
The Multiple Cause of Death database contains mortality and population counts for all U.S. states. Data are based on death certificates for U.S. residents. Each death certificate contains a single underlying cause of death, up to twenty additional multiple causes, and demographic data. We want to see the underlying and multiple causes of death of males, females, and infants in the year 2020 as well as differences in death rates among different races. Multiple causes of death include not only the underlying cause but also the immediate cause of death and all other intermediate and contributory conditions listed on the death certificate.Death rates for gender and race are calculated per 100,000 people whereas death rates of infants are calculated per 1000 infants.
Data Sources and Variables
Data: Multiple Cause of Death File, 2020, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program http://wonder.cdc.gov/mcd-icd10-expanded.html
Year: Only 2020
Gender: Male and Female
Race: White, Black/African American, American Indian/Alaska Native, Native Hawaiian/Pacific Islander, Asian, More than one race.
Population: The population estimates are U.S. Census Bureau estimates of U.S. national, state, and county resident populations.
Crude Death Rate = (number of deaths / population) * 100,000
Death Cause: Underlying cause-of-death is selected from the conditions entered by the physician on the cause of death section of the death certificate.
Packages Used
dplyr: Data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data.
sqldf: Run SQL statements on R data frames for convenience.
onewaytests: Perform normality tests including Shapiro-Wilk, also assess the normality of each group through plots.
ggplot2: Make visualizations such as bar plots.
Research Questions
How are death trends by month and death trends by days of the week different?
What are the top 10 underlying and multiple causes of deaths? Is there any differences in top 5 multiple death causes between male and female?
Are there any differences in crude death rates between male and female?
Is there any impact of COVID on the number of deaths?
How do drug and alcohol induced deaths differ between male and female?
Are death places of males and females similar?
Are there any differences in crude death rates for people of different races?
What are the top 5 causes of infant deaths and how do deaths differ within the infant age groups?
Analysis
#reading the data
DeathCause <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Gender_Population1.txt")
#Removing na values
DeathCause <- na.omit(DeathCause)
#Renaming Crude.Rate
DeathCause<- DeathCause %>% rename(CrudeDeathRate = `Crude.Rate`) #Crude Death Rate is per 100, 000
Normality test for Crude Death Rate for Male and Female
Since, we will be using crude death rate most of the times in our project. I am interested to check the normality for crude death rate for both male and female. I Used Shapiro-Wilk test for normality for Deaths for both male and female. From the output, the two p-value is less than the significance level 0.05 implying that the distribution of the data is not significantly different from normal distribution
#Using Shapiro-Wilk test for normality for deaths for both male and female
library(onewaytests)
onewaytests::nor.test(CrudeDeathRate~Gender, data = DeathCause)
Shapiro-Wilk Normality Test (alpha = 0.05)
--------------------------------------------------
data : CrudeDeathRate and Gender
Level Statistic p.value Normality
1 Female 0.9811101 0.5868239 Not reject
2 Male 0.9935746 0.9941627 Not reject
--------------------------------------------------
Deaths Trend by Days of the week
I used ggplot2 to make barplots to see the deaths trend by days of the week. We can see from the barplot that Weekdays have higher deaths compared to weekends.Tuesday, Wednesday, and Thursday have more deaths compared to other days.
library(ggplot2)
Month_Week_Deaths <- read.csv("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Month_Week_Deaths.csv")
#Making barplot
ggplot(data=Month_Week_Deaths, aes(x=factor(Weekday.Code), y=Deaths, fill=factor(Weekday))) + geom_bar(stat="identity", position = "dodge") +
labs(x= "Day of the week ", y="Deaths") + theme(panel.grid.major = element_blank()) + labs(title = "Days of the week Death Trends in the US in 2020", subtitle = "1=Sun, 2=Mon, 3=Tues, 4=Wed, 5=Thurs, 6=Fri, 7=Sat") + labs(fill="Days")
Deaths Trend by Month
We can see higher deaths in April and December, and Lower deaths in February, June, and September.
Top 10 Underlying Cause of Deaths
Each death certificate identifies a single underlying cause of death. COVID-19 deaths are the highest followed atherosclerotic heart disease and bronchus or lung, unspecified malignant neoplasms when grouped by male and female. And from the output, we can also see that Alzheimer disease deaths are higher for female.
Underlying_Cause <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Underlying Cause of Death.txt")
Underlying_Cause <-Underlying_Cause %>% select(-X) %>% rename(Underlying_Causes = 'Underlying.Cause.of.death', CauseOfDeath_Code='Underlying.Cause.of.death.Code')
newData<- sqldf("SELECT Underlying_Causes, CauseOfDeath_Code, Gender, SUM(Deaths) AS Total_Deaths FROM Underlying_Cause GROUP BY Underlying_Causes, Gender ORDER BY Total_Deaths DESC LIMIT 10 ")
newData
Underlying_Causes CauseOfDeath_Code Gender
1 COVID-19 U07.1 Male
2 COVID-19 U07.1 Female
3 Atherosclerotic heart disease I25.1 Male
4 Alzheimer disease, unspecified G30.9 Female
5 Atherosclerotic heart disease I25.1 Female
6 Unspecified dementia F03 Female
7 Bronchus or lung, unspecified - Malignant neoplasms C34.9 Male
8 Bronchus or lung, unspecified - Malignant neoplasms C34.9 Female
9 Chronic obstructive pulmonary disease, unspecified J44.9 Female
10 Acute myocardial infarction, unspecified I21.9 Male
Total_Deaths
1 192512
2 158319
3 97732
4 88090
5 72123
6 71011
7 70988
8 61306
9 60594
10 60206
Top 10 Multiple Causes of Deaths
Highest Deaths due to hypertension followed by cardiac arrest and COVID-19. Multiple causes of death include not only the underlying cause but also the immediate cause of death and all other intermediate and contributory conditions listed on the death certificate.
Multiple.Cause.of.Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Multiple Cause of Death.txt")
Top 5 Multiple Causes of Deaths for Males and Females
Hypertension is one of the major cause of death for both Males and Females followed by Cardiac arrest. Males have significantly higher Atherosclerotic heart disease, Mental and behavioral disorders as well as COVID-19 deaths.
Deaths due to COVID in 2020
Significant increase in deaths from March due to COVID.
Gender VS Population
It is very important to know how the population difference is between male and female in 2020. This will give us a clear idea whether our deaths comparison is correct. Males and Females have similar population in 2020.
#boxplot
boxplot(Population~Gender, DeathCause, main="Gender VS Population ",
xlab="Gender", ylab="Population in 2020", outcol="red", col=c("grey","pink"))
Deaths by Gender
Males have a higher crude death rate compared to Females. There is a significant evidence that death rates depend on gender.
#boxplot
boxplot(CrudeDeathRate~Gender, DeathCause, main="Gender VS Crude Death Rate",
xlab="Gender", ylab="Crude Death Rate per 100000", outcol="red", col=c("grey","pink"))
model_Gender <- lm(data=DeathCause, CrudeDeathRate ~ Gender)
summary(model_Gender)
Call:
lm(formula = CrudeDeathRate ~ Gender, data = DeathCause)
Residuals:
Min 1Q Median 3Q Max
-433.41 -97.35 2.98 115.09 444.59
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 996.43 23.00 43.330 < 2e-16 ***
GenderMale 128.58 32.52 3.954 0.000144 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 164.2 on 100 degrees of freedom
Multiple R-squared: 0.1352, Adjusted R-squared: 0.1265
F-statistic: 15.63 on 1 and 100 DF, p-value: 0.0001438
Comparing Drug and Alcohol induced deaths between Male and Female
Males have a higher Drug, and Alcohol induced deaths than Females. The consumption of Drug, Alcohol, or others have a significant effect on the death of a person.
suppressMessages(library(dplyr))
DrugAlcohol <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Drug:Alcohol Induced Deaths for Male and Female.txt")
#Changing column name and removing missing column with missing values
DrugAlcohol <- DrugAlcohol %>% rename(Drug_Alcohol_Other = 'UCD...Drug.Alcohol.Induced.Code') %>% select(-c(X))
ggplot(data=DrugAlcohol, aes(x=Drug_Alcohol_Other, y=Crude.Rate, fill=Gender)) + geom_bar(stat="identity", position = "dodge") +
labs(x= " ", y="Crude Death Rate") + theme(panel.grid.major = element_blank()) + labs(title = "Males have Higher Drug, and Alcohol Induced Deaths than Females", subtitle = "A=Alcohol, D=Drug, O=Other Non Alcohol or Non Drug")
DrugAlcohol <- as.data.frame(DrugAlcohol)
modell <- aov(data=DrugAlcohol, Crude.Rate ~ Gender + Drug_Alcohol_Other )
summary(modell)
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 2642 2642 3.108 0.21994
Drug_Alcohol_Other 2 1232657 616328 725.193 0.00138 **
Residuals 2 1700 850
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Drug Poisonings Deaths for Male and Female
Unintentional drug overdose deaths are higher for both male and female. Males have higher suicide death rates due to drug poisonings. Overall, males have higher drug related deaths than females.
Alcohol Related Deaths for Male and Female
Deaths due to all other alcohol induced causes are significantly higher compared to deaths due to alcohol poisonings for both males and females. Overall, males have higher alcohol related deaths than females.
Leading Causes of Deaths in Male and Female
#Male
Male_Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Male_Death Causes.txt")
#Removing NA column and renaming death cause code
Male_Death <- Male_Death %>% select(c(-X)) %>% rename(DeathCauseCode = `UCD...15.Leading.Causes.of.Death.Code`) %>% rename(DeathCauseName = `UCD...15.Leading.Causes.of.Death`) %>% rename(CrudeRate = Crude.Rate)
#Removing '#' from Death Cause Names to make it look nicer
Male_Death$DeathCauseName<-gsub("#","",as.character(Male_Death$DeathCauseName))
#Taking only 9 of the causes
newData3<- sqldf("SELECT DeathCauseName, DeathCauseCode, CrudeRate FROM Male_Death DESC LIMIT 9")
newData3
DeathCauseName DeathCauseCode CrudeRate
1 Diseases of heart GR113-054 235.9
2 Malignant neoplasms GR113-019 195.8
3 COVID-19 GR113-137 118.6
4 Accidents GR113-112 82.1
5 Chronic lower respiratory diseases GR113-082 45.0
6 Cerebrovascular diseases GR113-070 42.9
7 Diabetes mellitus GR113-046 35.5
8 Alzheimer disease GR113-052 25.4
9 Intentional self-harm (suicide) GR113-124 22.5
#Female
Female_Death <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Female_Death Causes.txt")
#Removing NA column and renaming death cause code
Female_Death <- Female_Death %>% select(c(-X)) %>% rename(DeathCauseCode = `UCD...15.Leading.Causes.of.Death.Code`) %>% rename(DeathCauseName = `UCD...15.Leading.Causes.of.Death`) %>% rename(CrudeRate = Crude.Rate)
#Removing '#' from Death Cause Names to make it look nicer
Female_Death$DeathCauseName<-gsub("#","",as.character(Female_Death$DeathCauseName))
newData3<- sqldf("SELECT DeathCauseName, DeathCauseCode, CrudeRate FROM Female_Death DESC LIMIT 9")
newData3
DeathCauseName DeathCauseCode CrudeRate
1 Diseases of heart GR113-054 187.9
2 Malignant neoplasms GR113-019 170.2
3 COVID-19 GR113-137 94.7
4 Alzheimer disease GR113-052 55.6
5 Cerebrovascular diseases GR113-070 54.2
6 Chronic lower respiratory diseases GR113-082 47.7
7 Accidents (unintentional injuries) GR113-112 40.5
8 Diabetes mellitus GR113-046 26.7
9 Influenza and pneumonia GR113-076 15.4
Place of deaths for Males and Females
We look at top 5 death places for males and females. There are no differences in death places between male and female. Place of death has a significant effect on the number of deaths.
PlaceOfDeath <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Place of Death.txt")
PlaceOfDeath <- PlaceOfDeath %>% select(-c(X, Population, Crude.Rate)) %>% rename(PlaceOfdeath = Place.of.Death, PlaceOfDeathCode= Place.of.Death.Code)
newData4<- sqldf("SELECT PlaceOfDeath, Gender, SUM(Deaths) AS Deaths FROM PlaceOfDeath GROUP BY PlaceOfDeath, Gender LIMIT 10 ")
newData4
PlaceOfdeath Gender Deaths
1 Decedent's home Female 517058
2 Decedent's home Male 610909
3 Hospice facility Female 104184
4 Hospice facility Male 100487
5 Medical Facility - Dead on Arrival Female 3252
6 Medical Facility - Dead on Arrival Male 6421
7 Medical Facility - Inpatient Female 456685
8 Medical Facility - Inpatient Male 567152
9 Medical Facility - Outpatient or ER Female 78836
10 Medical Facility - Outpatient or ER Male 123632
PlaceOfDeath$PlaceOfdeath <- as.factor(PlaceOfDeath$PlaceOfdeath)
model <- lm(data=PlaceOfDeath, Deaths~PlaceOfdeath + Gender)
summary(model)
Call:
lm(formula = Deaths ~ PlaceOfdeath + Gender, data = PlaceOfDeath)
Residuals:
Min 1Q Median 3Q Max
-73110 -11862 0 11862 73110
Coefficients:
Estimate Std. Error t value
(Intercept) 554231 38668 14.333
PlaceOfdeathHospice facility -461648 51558 -8.954
PlaceOfdeathMedical Facility - Dead on Arrival -559147 51558 -10.845
PlaceOfdeathMedical Facility - Inpatient -52065 51558 -1.010
PlaceOfdeathMedical Facility - Outpatient or ER -462750 51558 -8.975
PlaceOfdeathNursing home/long term care -269990 51558 -5.237
PlaceOfdeathOther -450744 51558 -8.743
PlaceOfdeathPlace of death unknown -563660 51558 -10.933
GenderMale 19505 25779 0.757
Pr(>|t|)
(Intercept) 1.91e-06 ***
PlaceOfdeathHospice facility 4.41e-05 ***
PlaceOfdeathMedical Facility - Dead on Arrival 1.25e-05 ***
PlaceOfdeathMedical Facility - Inpatient 0.3462
PlaceOfdeathMedical Facility - Outpatient or ER 4.34e-05 ***
PlaceOfdeathNursing home/long term care 0.0012 **
PlaceOfdeathOther 5.15e-05 ***
PlaceOfdeathPlace of death unknown 1.19e-05 ***
GenderMale 0.4740
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 51560 on 7 degrees of freedom
Multiple R-squared: 0.9736, Adjusted R-squared: 0.9434
F-statistic: 32.27 on 8 and 7 DF, p-value: 7.533e-05
Deaths by Race
White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander. Asian and people with more than one race have lower death rates. Also, We have a significant evidence that variable race has an effect on the crude death rates.
Race_Data <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Race:Deaths.txt")
Race_Data<- Race_Data %>% select(c(-X)) %>% rename(Race = Single.Race.6, Race_Code = Single.Race.6.Code, Crude_Rate = Crude.Rate)
newData5<- sqldf("SELECT Race, Crude_Rate FROM Race_Data ORDER BY Crude_Rate DESC ")
newData5
Race Crude_Rate
1 White 1112.8
2 Black or African American 1025.1
3 American Indian or Alaska Native 612.1
4 Native Hawaiian or Other Pacific Islander 576.6
5 Asian 461.9
6 More than one race 190.7
Deaths by race and gender?
White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander for both males and females. Males have a higher death rates than females for all races. Race and Gender have a significant effect on the death rates. The fact that the coefficient for Gender Male in the regression output is positive indicates that being a male is associated with increase in Crude Rate (relative to Females).
Higher crude rates can be found in developed countries like United States, despite high life expectancy because typically these countries have a much higher proportion of older people and lower recent birth rates.
Race_Gender <- read.delim("~/Desktop/Final Project-STAT 436/Analysis of Deaths in US/Race:Gender.txt")
Race_Gender<- Race_Gender %>% select(c(-X)) %>% rename(Race = Single.Race.6, Race_Code = Single.Race.6.Code, Crude_Rate = Crude.Rate)
#Arranging Crude_Rate in decreasing order
newData5<- sqldf("SELECT Race, Gender, Crude_Rate FROM Race_Gender ORDER BY Crude_Rate DESC ")
newData5
Race Gender Crude_Rate
1 White Male 1170.6
2 Black or African American Male 1134.5
3 White Female 1056.1
4 Black or African American Female 924.5
5 American Indian or Alaska Native Male 663.3
6 Native Hawaiian or Other Pacific Islander Male 644.8
7 American Indian or Alaska Native Female 560.2
8 Native Hawaiian or Other Pacific Islander Female 506.3
9 Asian Male 506.1
10 Asian Female 421.4
11 More than one race Male 213.5
12 More than one race Female 168.2
newData5$Gender <- factor(newData5$Gender)
#Compute the model
#for race
model4 <- lm(Crude_Rate ~ Race, data = Race_Gender)
summary(model4)
Call:
lm(formula = Crude_Rate ~ Race, data = Race_Gender)
Residuals:
Min 1Q Median 3Q Max
-105.00 -52.98 0.00 52.98 105.00
Coefficients:
Estimate Std. Error t value
(Intercept) 611.75 63.33 9.660
RaceAsian -148.00 89.56 -1.653
RaceBlack or African American 417.75 89.56 4.665
RaceMore than one race -420.90 89.56 -4.700
RaceNative Hawaiian or Other Pacific Islander -36.20 89.56 -0.404
RaceWhite 501.60 89.56 5.601
Pr(>|t|)
(Intercept) 7.05e-05 ***
RaceAsian 0.14951
RaceBlack or African American 0.00345 **
RaceMore than one race 0.00333 **
RaceNative Hawaiian or Other Pacific Islander 0.70007
RaceWhite 0.00138 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 89.56 on 6 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9304
F-statistic: 30.42 on 5 and 6 DF, p-value: 0.0003434
#for race and gender
model5 <- lm(Crude_Rate ~ Race + Gender, data = Race_Gender)
summary(model5)
Call:
lm(formula = Crude_Rate ~ Race + Gender, data = Race_Gender)
Residuals:
1 2 3 4 5 6 7 8
6.4583 -6.4583 15.6583 -15.6583 -46.9917 46.9917 -11.2417 11.2417
9 10 11 12
0.7583 -0.7583 35.3583 -35.3583
Coefficients:
Estimate Std. Error t value
(Intercept) 553.74 30.06 18.422
RaceAsian -148.00 39.36 -3.761
RaceBlack or African American 417.75 39.36 10.615
RaceMore than one race -420.90 39.36 -10.695
RaceNative Hawaiian or Other Pacific Islander -36.20 39.36 -0.920
RaceWhite 501.60 39.36 12.745
GenderMale 116.02 22.72 5.106
Pr(>|t|)
(Intercept) 8.67e-06 ***
RaceAsian 0.013150 *
RaceBlack or African American 0.000128 ***
RaceMore than one race 0.000124 ***
RaceNative Hawaiian or Other Pacific Islander 0.399876
RaceWhite 5.29e-05 ***
GenderMale 0.003752 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.36 on 5 degrees of freedom
Multiple R-squared: 0.9939, Adjusted R-squared: 0.9866
F-statistic: 135.6 on 6 and 5 DF, p-value: 2.275e-05
summary(model5)$coefficients[ , 1]
(Intercept)
553.7417
RaceAsian
-148.0000
RaceBlack or African American
417.7500
RaceMore than one race
-420.9000
RaceNative Hawaiian or Other Pacific Islander
-36.2000
RaceWhite
501.6000
GenderMale
116.0167
Infant Mortality
The graph presents top 5 causes of infant deaths and their respective age groups. Age groups <1 have highest death rates. The death rate is per 1000 infants. The perinatal period begins at the 22 weeks of pregnancy and ends a week after the child’s birth.
Limitations
Inconsistent between data sources might lead to bias in death rates.
Bias also results from undercounts of some population groups in the census, particularly young black males, young white males, and elderly persons, resulting in an overestimation of death rates.
Findings
Hypertension is one of the major causes of deaths in males and females.
Males have significantly higher Atherosclerotic heart disease, Mental and behavioral disorders , and COVID-19 deaths.
Males have a higher alcohol, drug, and other non-alcohol/drug induced deaths compared to females.
Unintentional drug overdose deaths and suicide deaths are higher for males than females.
There are no differences in death places between males and females.
White people have a higher death rate followed by black or African American, American Indian, and Native Hawaiian or others pacific islander compared to Asian and people with more than one race.
Age groups <1 have highest death rates and certain conditions originating in the perinatal period is one of the major cause of infant deaths.
Tools Used
R Programming
Microsoft Excel
Tableau
References
Centers for Disease Control and Prevention, National Center for Health Statistics. Multiple Cause of Death" “2018-2020 on CDC WONDER Online Database, released in 2021. Data are from the Multiple Cause of Death Files, 2018-2020, as” “compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program. Accessed” “at http://wonder.cdc.gov/mcd-icd10-expanded.html on Apr 25, 2022 1:13:10 PM”