Introduction

Background

Almost 20,000 infants die every year in the United States alone due to birth complications including birth defects, Sudden Infant Death Syndrome (SIDS), and Maternal Pregnancy complications. For every 1000 babies born in the United States, 5.6 die within a year of being born making the US one of the developed countries with the highest infant mortality rates. Infant Mortality Rate is widely used as one of the best measurements of population health and analyzing differences in Infant Mortality Rate can bring to light potential health system inequalities in the United States. Understanding the maternal demographic factors that contribute to infant mortality is, thus, vital to determining where health disparities lie and has profound public health implications. Digging into key factors such as maternal education level, marital status, demographic characteristics (age and race), and whether or not the baby was born in a hospital may be instrumental to establishing patterns, disparities, and potential risk factors contributing to infant mortality. The knowledge found by this research could aid the development of targeted interventions and inform public policies that could improve maternal and infant health.

Other pre-existing health factors that have been previously linked to increasing infant mortality rates will also be analyzed. Tobacco use, presence of Hypertension prior to pregnancy, and presence of Diabetes prior to pregnancy have all been shown to be correlated with premature birth. Studies funded by the CDC have found that Tobacco usage is associated with an increase in the likelihood a mother delivers their child prematurely which is a leading cause of death and disability amongst newborns. Hypertension in mothers has also been shown to lead to premature birth and low birth weight in newborns. The lack of blood traveling through the placenta and through to the child due to high levels of blood pressure are the key factors in this result. Diabetes has been shown to have a similar effect in inducing the premature birth of infants. Taking these illnesses and habits into consideration will be necessary in determining the true association that social factors have on infant mortality.

Prior studies have been done to evaluate the effect of demographic factors on infant mortality rates, but very few on data past 2017. Past studies inspecting social determinants and infant mortality rates also did so qualitatively. The few recent studies that have looked into infant mortality rates do so either in specific counties or states and not in relation to or while controlling for maternal health issues such as tobacco use or cardiovascular disease. Therefore there is a research need to quantitatively analyze recent data on how different social factors may impact infant mortality rates across the US as a whole.

Research Objective

The primary objective of the study is to determine what demographic, maternal, and birthplace characteristics are associated with an increase in infant mortality rate which is defined as the number of infants that die within one year of being born per 1000 babies. The information gained from this study could be vital in not only providing the best care to patients of different demographics, but also in influencing public health policy targeted at supplying demographics at an increased risk of infant mortality with the necessary resources to most effectively assist them during childbirth and their child’s first year of life. The secondary objective of the study is to confirm the positive association of tobacco usage, pre-pregnancy hypertension, and pre-pregnancy diabetes with infant mortality.

Dataset Overview

The data used in this analysis comes from the CDC WONDER online database and was collected by the National Center for Health Statistics and tracks infant mortality rates from 2017-2021. The dataset provides number of deaths, number of live births, as well as death rate for several categories of characteristics including mother’s characteristics, maternal risk factors, pregnancy risk factors, and delivery characteristics, amongst others. The datasets were filtered for death rate numbers deemed unreliable by the CDC either due to low population counts or unreliable data collection. The datasets were also filtered for unreported or unknown characteristics.

Predictor Variables:

Race: Race uses Single Race 6 which includes 6 single race categories: White, Asian, Black or African American, American Indian or Alaska Native, Native Hawaiian or Pacific Islander, and more than one race.

Age: Age is separated into 9 categories, starting with under 15, then continuing with 15-19, and increasing at intervals of 5 until getting to 50 and over.

Metro: Variable is ‘Metro’ if the mother’s place of living is classified as in a metro area, ‘Nonmetro’ otherwise. Married: Variable is ‘Married’ if mother is married, ‘Nonmarried’ otherwise.

Education: Education is split into 5 categories, middle school, some high school, high school, some college, and college.

Hospital: Variable is ‘Hospital’ if the baby was born in a hospital, ‘Not in Hospital’ otherwise.

Tobacco Use: Variable is ‘Yes’ if tobacco was used during pregnancy, ‘No’ otherwise.

Hypertension: Variable is ‘Yes’ if mother had hypertension prior to pregnancy, ‘No’ otherwise.

Diabetes: Variable is ‘Yes’ if mother had diabetes prior to pregnancy, ‘No’ otherwise.

Response Variables:

Deaths: Number of infant deaths within a year of being born

Births: Total number of live births

Death Rate: Also Infant Mortality/Infant Mortality Rate. Number of deaths divided by number of births times 1000. For example, a death rate of 14.2 means that for every 1000 live babies born, on average, 14.2 of them die within the first year of life.

Exploratory Data Analysis

Initial EDA was done to determine, first by the eye test, if there was a substantial difference between levels of characteristics and the average infant mortality rate. An example bar chart is pictured below which contains the average infant mortality rate grouped by race. Further bar charts are listed in the appendix. The bar chart clearly displays there is likely a difference in infant mortality rate between the different races, although further statistical analysis must and will be done to determine if this statement is in fact accurate.

A heat map visualization, shown below, of race and its relationship with hypertension reveal the relationship the two have with each other. The race variable had a similar relationship with tobacco use and diabetes (heatmaps shown in appendix). Each heat map cell is considerably different in color from those around it representing the association and potential dependence the characteristics have with the different health issues. It follows that the separation of the two in regression analysis should be done to get each characteristic’s individual effect on infant mortality.

Finally, the following table signifies which level of each characteristic is present in the dataset at the highest number. These tables are significant in that they determine the baseline characteristic in each of the regression models. The results of the count tables determined that race White, age 30-34, college education level, married, living in a metro area, and baby born in hospital are the most common levels of each characteristic. All tables follow this format and can be seen in the appendix.

Race Count
White 2686272
Black 568673
Asian 218221
Multiple 95512
Native 31770
Other 9601

Methodology

Chi-Square Test and Fisher Exact Test

Before creating a regression model it is important to prove independence between the death rates of each individual level within a characteristic. This will be done using a combination of Chi-Square Tests and Fisher Exact Tests. Both Chi-Square Tests and Fisher Exact Tests are good in determining independence of categorical data which is the type of data used in this analysis. Fisher Exact Tests are computationally intensive, however, and are best used when the sample size is low. Chi-Square Tests, on the other hand, are less computationally intensive and if the sample size is large enough converges to the Fisher Exact Test. The normal p-value cutoff of 0.05 will be used to determine significance in rejecting the null hypothesis that the infant mortality rate between multiple levels is equivalent. Of the nine predictor variables (including tobacco use, pre-pregnancy hypertension, and pre-pregnancy diabetes), only the hospital variable had few enough observations to warrant the use of a Fisher Exact Test. A Chi-Square Test was used to determine independence for the other eight variables.

Poisson Regression

Because infant mortality is modeled as a rate, poisson regression is suitable for estimating the odds ratio increase that belonging to a specific demographic versus baseline has on infant mortality rate. Each of the six characteristic response variables will first be individually modeled using a poisson regression with infant mortality rate as the response variable and that characteristic as the predictor variable. However, as previous research has found there are other health factors that may independently contribute to the infant mortality rate. Tobacco usage during pregnancy, pre-pregnancy hypertension (not to be confused with preeclampsia which is a pregnancy induced blood condition), and pre-pregnancy diabetes have been linked to a higher infant mortality rate. To control for these positive predictors, the characteristic variable will then be modeled again, but adjusted for tobacco use, pre-pregnancy hypertension, and pre-pregnancy diabetes. The adjusted model will keep infant mortality rate as the response variable, but will instead have the characteristics of note, tobacco use, pre-pregnancy hypertension, and pre-pregnancy diabetes as predictor variables. The slope coefficient for the characteristic will be the slope for that variable holding each of the three health factors constant. Variables that display a p-value below the 0.05 significance threshold for the adjusted poisson regression model will be kept for the final model and displayed in a results table.

The unadjusted poisson regression model is as follows:

\[\text{log}(\lambda_i) = \beta_0 + \beta_1 x_{1i} ...+... \beta_n x_{ni}\] \[ \lambda_i = \text{exp}(\beta_0 + \beta_1 x_{1i} ... + ...\beta_n x_{ni})\] \[Y_i \sim \text{Poisson}(\lambda_i)\] Where the index of observation is each demographic grouping \(i\) with data \(x_{i}\) and infant mortality \(\lambda_i\), and \(\beta\) is the race, age, education, metro, married, or hospital variables. Baseline characteristics are the most common characteristics in the dataset which are: White, Age 30-34, College education, Married, and Born in Hospital.

Stepwise Backwards Selection

Characteristic variables that are found to be significantly associated with infant mortality rate will all be included in another poisson regression model, still with infant mortality rate as the response variable. A stepwise backwards selection method which iteratively removes non-significant variables from the model, starting with the full model containing all predictors will be used to determine the predictor variables most strongly correlated with infant mortality rate. The Akaike Information Criterion (AIC) of each model as well as the statistical significance of each variable will be assessed to determine if any variables are non-significant. If any variables are below the 0.05 significance level threshold they will be removed from the model and the model rerun on all variables minus the one removed. If after removing the non-significant variable, the remaining variables are statistically significant and the AIC does not decrease, the model will be considered the “best” at modeling the data.

Results

After completion of Chi-Square or Fisher-Exact Tests on each predictor variable, all variables were shown to have statistically significant evidence to reject the hypothesis that the infant mortality rate between each level of characteristic was equivalent. Based on the results of each test it can be concluded that each level within each characteristic is independent from the other levels in that characteristic. The results of all Chi–Square and Fisher Exact tests are located in the appendix.

Firstly, given that tobacco use, pre-pregnancy hypertension, and pre-pregnancy diabetes were all independently associated with increased likelihood of infant mortality as well as being significant predictors of higher infant mortality rates in each of the adjusted models this analysis supports evidence disclosed in prior research of tobacco use, pre-pregnancy hypertension, and pre-pregnancy diabetes’ association with increased infant mortality. The odds-ratio coefficients associated with each are displayed in the table below:

Health Issue Odds Ratio 95% Confidence Interval
Tobacco Use 1.99 (1.71,2.32)
Diabetes 1.74 (1.46,2.07)
Hypertension 1.54 (1.34,1.78)

The unadjusted and adjusted poisson regression models revealed that compared to the baseline of White, age 30-34, at least a college education, married, and birth in a hospital, Black and American Indian or Alaskan Native, age under 15, age 15-19, age 20-24, age 45-49, some college education, high school education, some high school education, unmarried, and birth not in hospital were all significantly associated with an increase in infant mortality rate. Compared to baseline the following were the most strongly associated with an increase in infant mortality rate: some high school education is associated with 2.42 (95% CI: [1.87, 3.13]) times the infant mortality rate, only completed high school education is associated with 2.35 (95% CI: [1.83, 3.02]) times the infant mortality rate, age under 15 is associated with 2.35 (95% CI: [1.32, 4.23]) times the infant mortality rate, and birth not in hospital is associated with 2.24 (95% CI: [1.73, 2.90]) times the infant mortality rate. The full results are shown in the table below:

Odds Ratios for Characteristics (Indiviual Poisson Regression)
Unadjusted odds ratios are from a poisson regression model for infant mortality rates (out of 1000). Adjusted odds ratios adjust for tobacco use, diabetes, and hypertension which are all proven to be positively associated with increased infant mortality. Baseline characteristics are for most common characteristics in the dataset which are: White, Age 30-34, College education, Married, and Born in Hospital
Characteristic Model Type Odds Ratio 95% Confidence Interval
Race
Black or African American Unadjusted 1.94 (1.66,2.27)
Black or African American Adjusted 1.92 (1.64,2.25)
American Indian or Alaskan Native Unadjusted 1.14 (0.9,1.45)
American Indian or Alaskan Native Adjusted 1.38 (1.07,1.78)
Age
Under 15 Unadjusted 1.25 (0.71,2.21)
Under 15 Adjusted 2.36 (1.32,4.23)
15-19 Unadjusted 0.92 (0.68,1.24)
15-19 Adjusted 1.45 (1.06,1.99)
20-24 Unadjusted 1.14 (0.92,1.41)
20-24 Adjusted 1.38 (1.11,1.72)
45-49 Unadjusted 1.13 (0.73,1.75)
45-49 Adjusted 2.12 (1.35,3.34)
Edcuation
Some College Unadjusted 1.37 (1.01,1.87)
Some College Adjusted 1.15 (0.84,1.57)
High School Unadjusted 2.35 (1.83,3.02)
High School Adjusted 1.75 (1.34,2.29)
Some High School Unadjusted 2.42 (1.87,3.13)
Some High School Adjusted 1.97 (1.51,2.56)
Marriage Status
Unmarried Unadjusted 2.12 (1.65,2.73)
Unmarried Adjusted 1.83 (1.63,2.75)
Birthplace
Not In Hospital Adjusted 2.24 (1.73,2.9)
Only characteristics with p-values of less than 0.05 for both the adjusted and unadjusted models are included.
Birthplace had a p-value of less than 0.05 only for the adjusted model.

Stepwise backwards regression was done on the following poisson regression model: \[\text{log}(\lambda_i) =\beta x_{i}\] Where the index of observation is each demographic grouping \(i\) with data \(x_{i}\) and infant mortality \(\lambda_i\), and \(\beta\) is the vector of regression coefficients \([\beta_{race = Black}, \beta_{race = Native}, \beta_{age = Under 15}, \beta_{age = 15-19}, \beta_{age = 20-24}, \beta_{age = 45-49}, \beta_{education = \text{some college}}, \\\ \beta_{education = highschool}, \beta_{\text{education = some highschool}}, \beta_{married = unmarried}, \beta_{hospital = \text{not in hospital}}]\)

The first step found that age was not a significant predictor of infant mortality rate when controlling for race, education, whether the mother was married, and whether the birth took place in a hospital or not. It was thus removed from the model. Poisson regression was then done again using the following poisson regression model: \[\text{log}(\lambda_i) =\beta x_{i}\] Where the index of observation is each demographic grouping \(i\) with data \(x_{i}\) and infant mortality \(\lambda_i\), and \(\beta\) is the vector of regression coefficients \([\beta_{race = Black}, \beta_{race = Native}, \beta_{education = \text{some college}},\\\ \beta_{education = highschool}, \beta_{\text{education = some highschool}}, \beta_{married = unmarried}, \beta_{hospital = \text{not in hospital}}]\)

No variables in the second model were found to have a p-value above the significance threshold of 0.05 and removing the least significant variable from this model resulted in an increase in the AIC of the model. This model was then deemed final and the odds-ratio results for each predictor variable are displayed below.

Odds Ratios for Characteristics after Backwards Regression
Unadjusted odds ratios are from a poisson regression model after backwards selection for infant mortality rates (out of 1000). Baseline characteristics are for most common characteristics in the dataset which are: White, College education, Married, and Born in Hospital
Characteristic Odds Ratio 95% Confidence Interval
Race
Black or African American 1.74 (1.53,1.98)
American Indian or Alaskan Native 1.32 (1.05,1.66)
Edcuation
Some College 1.31 (1.11,1.54)
High School 1.51 (1.3,1.76)
Some High School 1.70 (1.44,2.01)
Marriage Status
Unmarried 1.28 (1.14,1.43)
Birthplace
Not In Hospital 2.14 (1.63,2.82)
Only characteristics with p-values of less than 0.05 for adjusted and unadjusted poisson regressio nmodels and the backwards regression model are included.

As seen in the table above, the predictors that had the strongest relationship with an increase in the infant mortality rate compared to baseline were the following: birth not in hospital is associated with 2.14 (95% CI: [1.63, 2.82]) times the infant mortality rate, Black or African American race is associated with 1.74 (95% CI: [1.53, 1.98]) times the infant mortality rate, and only some high school education is associated with 1.70 (95% CI: [1.44, 2.01]) times the infant mortality rate.

Discussion

Conclusion and Potential Solutions

Based on the results of the poisson regression models and the stepwise backward selection of poisson regression model it was found that along with the known infant mortality factors of tobacco usage during pregnancy, hypertension prior to pregnancy, and diabetes prior to pregnancy, Black or African American and American Indian or Alaskan Native mothers compared to White mothers, unmarried mothers compared to married mothers, and mothers giving birth not in a hospital compared to mothers giving birth in a hospital are all associated with a higher likelihood of infant mortality. Mothers with less years of education are also associated with a higher likelihood of infant mortality.

The information in this analysis has large implications on the US health system and US health policy when it comes to infant care. It is clear that efforts must be taken in communities that have large Black or African American and American Indian or Alaskan Native populations and/or access to a hospital is sparse to provide the necessary support systems required for infant care. Furthermore, support systems for single mothers should be increased as infants of that population have a higher propensity for death within a year of their birth. A lack of education is also correlated with increased infant mortality and mothers with less education should have additional resources to successfully take care of their child more than one year after their birth.

Several public health policies could be implemented to more specifically target the issue of infant mortality and alleviate stress on the health care system. Based on the factors that contribute to infant mortality a three tiered approach to improving infant care could be effectively applied. The first tier is access; increase access to prenatal and postnatal care and increase access to child raising support systems. As babies born not in hospitals have a much greater chance of dying within a year of birth, increasing access to hospitals or care centers could be monumental in increasing likelihood of infant survival. Mother’s without immediate social support, in this case the presence of a partner, have also shown higher rates of infant mortality. Providing mother’s with more immediate community support to raise a child may help decrease the infant mortality rate. The second tier is literacy. Studies show that Black or African American and people with lower levels of education have lower levels of health literacy. The development of educational campaigns designed with a focus on post-pregnancy maternal and infant health targeted at populations with low health literacy may help improve the infant mortality rate. Education is important for health care providers as well. Cultural competency training for health care providers can help them understand the diverse needs of mothers from underrepresented populations such as Black or African American, American Indian or Native Alaskan, unmarried, and below college education level. A greater understanding of these populations could lead to better care and a higher infant survival rate. The last, and perhaps most important tier, is economic support. Although there is already the Women, Children, and Infants program which provides economic support to new mothers, increasing funding to this program may help mothers more feasibly care for their newborn. Increasing funding to paid leave programs may also assist mothers that need to work for financial reasons to give their infants the proper care. Finally, increased economic support may improve quality of living for underrepresented groups and may give their children a higher chance at survival past the first year of life.

Limitations

There are limitations in the analysis done in this paper and most issues stem from the process of data collection and the data itself. The data itself does not provide information for individual patients which prevents a more robust analysis of each predictor and its effect on infant mortality. The data instead groups by various demographics, but in requesting data a maximum of five predictors can be analyzed. This prevented the ability to build a model on death rate using more than five predictors. This limitation could lead to the exclusion of at least one predictor that may have had an effect on infant mortality and could have slightly skewed the odds-ratio results of the data. Another limitation in the data is that several infant mortality rate numbers were deemed unreliable. Because each pairing of categories appeared only once in the dataset, imputation for the missing data was not possible and data with ‘unreliable’ infant mortality rates were instead deleted from the data. This deletion could have a detrimental effect on the poisson regression model and potentially greatly reduced the power of the model. There are also some limitations to using a poisson regression model. The poisson regression model assumes an equal mean and variance for infant mortality rate which may not be the case. A poisson regression model also assumes a constant infant mortality rate over time which empirically is also not the case as the infant mortality rate is constantly changing year over year. Finally, any potential outliers in the data have a greater chance to skew the estimators in the poisson regression model.

Future Work

There is vast opportunity for further research to be done on this subject. Public policies are continuously changing and so are infant mortality rates. The infant mortality rate in the United States was 99.9 in 1916 and 6.9 in 2000 and is currently 5.6. A future study done on data from 2022 and forward could show the different impacts of various variables as compared to data from 2017-2021. Further studies could also use the data of individual mothers for a more accurate and complete analysis of predictor variables, their interaction with other predictor variables, and their association with infant mortality rate. This would also prevent the rise of unreliable data and would allow for including more than 5 predictors in the model. A final future research question would be in creating a experimental study that looks at how either providing increased social support, providing greater resources for prenatal and postnatal care, and increased economic support for new mothers impacts the infant mortality rate. The following study could go a long way in potentially showing the positive effect these three support systems have on decreasing infant mortality and would be more influential in changing US public health policy.

Citations

America’s Health Literacy: Why We Need Accessible Health Information. An Issue Brief From the U.S. Department of Health and Human Services. 2008.

Cadez-Martin, A. & Tan, B. & Fox, S. & Matusko, N. & Gadepalli, S., (2022) “Effects of Social Determinants of Health on Infant Mortality in Washtenaw and Wayne County, Michigan”, Undergraduate Journal of Public Health 6. doi: https://doi.org/10.3998/ujph.2313

CDC. (2022, July 14). Type 1 or Type 2 Diabetes and Pregnancy | CDC. Centers for Disease Control and Prevention. https://www.cdc.gov/pregnancy/diabetes-types.html#:~:text=Early%20(Preterm)%20Birth

CDCTobaccoFree. (2019, May 29). Smoking During Pregnancy. Centers for Disease Control and Prevention. https://www.cdc.gov/tobacco/basic_information/health_effects/pregnancy/index.htm#:~:text=Health%20Effects%20of%20Smoking%20and%20Secondhand%20Smoke%20on%20Babies

Chaudhry, S. I., Herrin, J., Phillips, C., Butler, J., Mukerjhee, S., Murillo, J., Onwuanyi, A., Seto, T. B., Spertus, J., & Krumholz, H. M. (2011). Racial disparities in health literacy and access to care among patients with heart failure. Journal of cardiac failure, 17(2), 122–127. https://doi.org/10.1016/j.cardfail.2010.09.016

Federal Resources for Women. (n.d.). DOL. https://www.dol.gov/agencies/wb/federal-agency-resources#:~:text=Women%2C%20Infants%20and%20Children%20Program

How might high blood pressure affect you and your baby? (2022, July 23). Mayo Clinic. https://www.mayoclinic.org/healthy-lifestyle/pregnancy-week-by-week/in-depth/pregnancy/art-20046098#:~:text=High%20blood%20pressure%20during%20pregnancy%20poses%20the%20following%20risks%3A

Infant Mortality | Maternal and Infant Health | Reproductive Health | CDC. (2022, September 8). Www.cdc.gov. https://www.cdc.gov/reproductivehealth/maternalinfanthealth/infantmortality.htm#:~:text=Almost%2020%2C000%20infants%20died%20in

Orischak, M., Fru, D., Kelly, E., & DeFranco, E. (2022). Social determinants of infant mortality amongst births to non-Hispanic Black women. American Journal of Obstetrics and Gynecology, 226(1). Elveiser. https://doi.org/10.1016/j.ajog.2021.11.1164

Reno, R., & Hyder, A. (2018). The Evidence Base for Social Determinants of Health as Risk Factors for Infant Mortality: A Systematic Scoping Review. Journal of Health Care for the Poor and Underserved 29(4), 1188-1208. https://doi.org/10.1353/hpu.2018.0091.

Singh, G. K., & Yu, S. M. (2019). Infant Mortality in the United States, 1915-2017: Large Social Inequalities have Persisted for Over a Century. International journal of MCH and AIDS, 8(1), 19–31. https://doi.org/10.21106/ijma.271

Appendix

Exploratory Data Analysis

Age Count
30-34 1103128
25-29 1014095
20-24 641748
35-39 582185
15-19 142165
40-44 119313
45-49 6311
15 1104
Place of Living Count
Metro 2742674
Nonmetro 483084
Marriage Status Count
Married 1934275
Unmarried 1291483
Education Count
College 1576678
High School 939527
Some College 667617
Some High School 288376
Middle School 105627
Birthplace Count
In Hospital 3503100
Not in Hospital 74725

Chi-Square and Fisher Exact Test Results

tobacco_use_data <- data |>
  group_by(tobacco_use) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

tobacco_use_table <- matrix(c(tobacco_use_data$alive, tobacco_use_data$dead), ncol = 2)

tobacco_result <- chisq.test(tobacco_use_table)
tobacco_result #tobacco use associated with higher infant mortality rates
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tobacco_use_table
## X-squared = 1356.7, df = 1, p-value < 2.2e-16
hypertension_data <- data |>
  group_by(hypertension) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

hypertension_table <- as.table(matrix(c(hypertension_data$alive, hypertension_data$dead), ncol = 2))

hypertension_result <- chisq.test(hypertension_table)
hypertension_result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hypertension_table
## X-squared = 296.89, df = 1, p-value < 2.2e-16
diabetes_data <- data |>
  group_by(diabetes) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

diabetes_table <- as.table(matrix(c(diabetes_data$alive, diabetes_data$dead), ncol = 2))

diabetes_result <- chisq.test(diabetes_table)
diabetes_result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  diabetes_table
## X-squared = 142.66, df = 1, p-value < 2.2e-16
diabetes_data <- data |>
  group_by(diabetes) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

#all have low p-values, statistically significant that having these three drastically increases your risk of infant mortality

hospital_data <- data2 |>
  group_by(hospital) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

hospital_table <- as.table(matrix(c(hospital_data$alive, hospital_data$dead), ncol = 2))

hospital_result <- fisher.test(hospital_table)
hospital_result
## 
##  Fisher's Exact Test for Count Data
## 
## data:  hospital_table
## p-value = 0.01294
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7829404 0.9728141
## sample estimates:
## odds ratio 
##  0.8739892
married_data <- data1 |>
  group_by(married) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

married_table <- as.table(matrix(c(married_data$alive,  married_data$dead), ncol = 2))

married_result <- chisq.test(married_table)
married_result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  married_table
## X-squared = 2123, df = 1, p-value < 2.2e-16
metro_data <- data1 |>
  group_by(metro) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

metro_table <- as.table(matrix(c(metro_data$alive,  metro_data$dead), ncol = 2))

metro_result <- chisq.test(metro_table)
metro_result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  metro_table
## X-squared = 76.917, df = 1, p-value < 2.2e-16
age_data <- data |>
  group_by(age) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

age_table <- as.table(matrix(c(age_data$alive,  age_data$dead), ncol = 2))

age_result <- chisq.test(age_table)
age_result
## 
##  Pearson's Chi-squared test
## 
## data:  age_table
## X-squared = 872.91, df = 7, p-value < 2.2e-16
race_data <- data |>
  group_by(race) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

race_table <- as.table(matrix(c(race_data$alive,  race_data$dead), ncol = 2))

race_result <- chisq.test(race_table)
race_result
## 
##  Pearson's Chi-squared test
## 
## data:  race_table
## X-squared = 2806, df = 5, p-value < 2.2e-16
education_data <- data2 |>
  group_by(education) |>
  summarise(alive = sum(n) - sum(n_deaths),
          dead = sum(n_deaths))

education_table <- as.table(matrix(c(education_data$alive,  education_data$dead), ncol = 2))

education_result <- chisq.test(education_table)
education_result
## 
##  Pearson's Chi-squared test
## 
## data:  education_table
## X-squared = 2316.9, df = 4, p-value < 2.2e-16

Full Model Results

## 
## Call:
## glm(formula = death_rate ~ race, family = "poisson", data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.168445   0.063907  33.931  < 2e-16 ***
## raceAsian    -0.635793   0.200194  -3.176  0.00149 ** 
## raceBlack     0.662753   0.080261   8.258  < 2e-16 ***
## raceMultiple  0.007619   0.135164   0.056  0.95505    
## raceNative    0.134721   0.123242   1.093  0.27433    
## raceOther    -0.256081   0.230927  -1.109  0.26746    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 272.67  on 78  degrees of freedom
## Residual deviance: 150.95  on 73  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ race + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.71711    0.08617  19.928  < 2e-16 ***
## raceAsian       -0.18445    0.20837  -0.885    0.376    
## raceBlack        0.65049    0.08071   8.059 7.67e-16 ***
## raceMultiple     0.16414    0.14021   1.171    0.242    
## raceNative       0.32002    0.12884   2.484    0.013 *  
## raceOther        0.19526    0.23805   0.820    0.412    
## tobacco_useYes   0.64939    0.08044   8.073 6.84e-16 ***
## hypertensionYes  0.32167    0.07856   4.095 4.23e-05 ***
## diabetesYes      0.46131    0.09286   4.968 6.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 272.674  on 78  degrees of freedom
## Residual deviance:  69.666  on 70  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ age, family = "poisson", data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.31652    0.07616  30.415   <2e-16 ***
## age15        0.22360    0.29096   0.768   0.4422    
## age15-19    -0.08460    0.15391  -0.550   0.5825    
## age20-24     0.13533    0.10933   1.238   0.2158    
## age25-29     0.14529    0.10550   1.377   0.1685    
## age35-39     0.18536    0.10613   1.747   0.0807 .  
## age40-44    -0.00697    0.13496  -0.052   0.9588    
## age45-49     0.11996    0.22257   0.539   0.5899    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 272.67  on 78  degrees of freedom
## Residual deviance: 265.90  on 71  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = death_rate ~ age + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.68317    0.09915  16.977  < 2e-16 ***
## age15            0.85695    0.29780   2.878  0.00401 ** 
## age15-19         0.37134    0.16098   2.307  0.02107 *  
## age20-24         0.32215    0.11203   2.876  0.00403 ** 
## age25-29         0.19152    0.10579   1.810  0.07023 .  
## age35-39         0.11391    0.10672   1.067  0.28582    
## age40-44         0.26663    0.13791   1.933  0.05319 .  
## age45-49         0.75331    0.23144   3.255  0.00113 ** 
## tobacco_useYes   0.77229    0.08184   9.436  < 2e-16 ***
## hypertensionYes  0.52368    0.07780   6.731 1.68e-11 ***
## diabetesYes      0.65366    0.09280   7.044 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 272.67  on 78  degrees of freedom
## Residual deviance: 126.15  on 68  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ metro, family = "poisson", data = data1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.55255    0.06977   36.59   <2e-16 ***
## metroNonmetro -0.13990    0.11758   -1.19    0.234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 83.762  on 25  degrees of freedom
## Residual deviance: 82.328  on 24  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ metro + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data1)
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         1.6751     0.1576  10.628  < 2e-16 ***
## metroNonmetro                       0.1254     0.1284   0.976    0.329    
## tobacco_useYes                      0.5069     0.1272   3.984 6.78e-05 ***
## hypertensionUnknown or Not Stated   1.0288     0.2415   4.260 2.04e-05 ***
## hypertensionYes                     0.5120     0.1205   4.249 2.15e-05 ***
## diabetesUnknown or Not Stated           NA         NA      NA       NA    
## diabetesYes                         0.6486     0.1264   5.132 2.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 83.762  on 25  degrees of freedom
## Residual deviance: 32.340  on 20  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ married, family = "poisson", data = data1)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.0016     0.1108  18.060  < 2e-16 ***
## marriedUnmarried   0.7527     0.1286   5.855 4.76e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 83.762  on 25  degrees of freedom
## Residual deviance: 45.686  on 24  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ married + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data1)
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         1.4903     0.1468  10.150  < 2e-16 ***
## marriedUnmarried                    0.6036     0.1331   4.536 5.74e-06 ***
## tobacco_useYes                      0.3518     0.1237   2.844 0.004458 ** 
## hypertensionUnknown or Not Stated   0.8670     0.2237   3.876 0.000106 ***
## hypertensionYes                     0.4346     0.1201   3.619 0.000296 ***
## diabetesUnknown or Not Stated           NA         NA      NA       NA    
## diabetesYes                         0.4986     0.1230   4.053 5.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 83.762  on 25  degrees of freedom
## Residual deviance: 11.219  on 20  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = death_rate ~ education, family = "poisson", data = data2)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.9038     0.0965  19.728  < 2e-16 ***
## educationSome College       0.3176     0.1575   2.017   0.0437 *  
## educationHigh School        0.8547     0.1279   6.683 2.34e-11 ***
## educationSome High School   0.8822     0.1305   6.762 1.36e-11 ***
## educationMiddle School      0.1303     0.2050   0.635   0.5251    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 214.52  on 43  degrees of freedom
## Residual deviance: 143.20  on 39  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = death_rate ~ education + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data2)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                1.54869    0.11398  13.587  < 2e-16 ***
## educationSome College      0.13695    0.16020   0.855  0.39262    
## educationHigh School       0.55725    0.13628   4.089 4.33e-05 ***
## educationSome High School  0.67943    0.13459   5.048 4.46e-07 ***
## educationMiddle School     0.19109    0.20803   0.919  0.35833    
## tobacco_useYes             0.70042    0.10698   6.547 5.85e-11 ***
## hypertensionYes            0.30297    0.09608   3.153  0.00161 ** 
## diabetesYes                0.49589    0.10914   4.543 5.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 214.519  on 43  degrees of freedom
## Residual deviance:  87.326  on 36  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = death_rate ~ hospital, family = "poisson", data = data2)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              2.33451    0.05261  44.377   <2e-16 ***
## hospitalNot in Hospital  0.19435    0.10783   1.802   0.0715 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 214.52  on 43  degrees of freedom
## Residual deviance: 211.38  on 42  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = death_rate ~ hospital + tobacco_use + hypertension + 
##     diabetes, family = "poisson", data = data2)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.3916     0.1159  12.004  < 2e-16 ***
## hospitalNot in Hospital   0.8067     0.1326   6.083 1.18e-09 ***
## tobacco_useYes            1.0162     0.1030   9.863  < 2e-16 ***
## hypertensionYes           0.5864     0.1080   5.431 5.62e-08 ***
## diabetesYes               0.8141     0.1157   7.035 1.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 214.519  on 43  degrees of freedom
## Residual deviance:  85.305  on 39  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5