Child Mortality

Author

Wilfried Bilong

The topic of this project is Infant Mortality. It’s an extremely sad subject but it’s important to look at how children are doing after they are born, it may also help us find trends that can be prevented in the future. I’ll be exploring the different infant mortality rates over the years and all the variables that go with them. Some of the variables include race, year, infant mortality rate, live births (baby is alive upon birth), neonatal deaths (mortality in the first 28 days of life), and post neonatal deaths (mortality from 28 days to 11 months old). The name of the Data Set is Infant Mortality and its sourced directly from the CDC.

Data

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(readr)
infantmortality <- read_csv("Desktop/Data 110/infantmortality.csv")

Rows: 60 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Maternal Race or Ethnicity
dbl (8): Year, Infant Mortality Rate, Neonatal Mortality Rate, Postneonatal ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(RColorBrewer)
library(alluvial)
library(ggalluvial)

ggplot(infantmortality, aes(x = `Infant Mortality Rate`, y = `Neonatal Mortality Rate`)) + 
  theme_minimal(base_size = 12)

p1 <- ggplot(infantmortality, aes(x = `Infant Mortality Rate`, y = `Neonatal Mortality Rate`)) + 
  labs(title = "Neonatal Vs Infant Mortality", caption = "Source = CDC") +
  theme_minimal(base_size = 12) 
p1 + geom_point()

Warning: Removed 10 rows containing missing values (`geom_point()`).

By creating a simple graph we can check to see if there’s any linearity between the two variables. The conclusions are self explanatory, while this is a scatter plot its not very scattered at all, showing a very strong correlation.

p2 <- p1 + geom_point() + geom_smooth(color = "red")
p2

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 10 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 10 rows containing missing values (`geom_point()`).

p3 <- p2 + geom_smooth(method = 'lm', formula = y~x)
p3

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 10 rows containing non-finite values (`stat_smooth()`).
Removed 10 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 10 rows containing missing values (`geom_point()`).

In both graphs above we can see a linear regression and a confidence interval, proving the point that there is a linear relationship between the Infant Mortality Rate and the Neonatal Mortality Rate.

cor(infantmortality$`Infant Mortality Rate`, infantmortality$`Neonatal Mortality Rate`)

[1] NA

In this case the goal was to look for a correlation but for whatever reason it came out as “NA”. I chose rather to use a different function to find the correlation and eventually write out an equation.

EQ1 <- lm(`Neonatal Mortality Rate` ~ `Infant Mortality Rate`, data = infantmortality)
summary(EQ1)


Call:
lm(formula = `Neonatal Mortality Rate` ~ `Infant Mortality Rate`, 
    data = infantmortality)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52434 -0.15839  0.00774  0.10615  0.53983 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.19487    0.08221    2.37   0.0218 *  
`Infant Mortality Rate`  0.61726    0.01494   41.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.237 on 48 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.9726,    Adjusted R-squared:  0.9721 
F-statistic:  1707 on 1 and 48 DF,  p-value: < 2.2e-16

I’ll be using NMR for Neonatal Mortality Rate and IMR for Infant Mortality Rate. Based on the data presented the equation of my line is

NMR = 0.617(IMR) + 0.195

We can already see from the graphs that the correlation between IMR and NMR is almost completely linear, so we can understand from this equation that when we infant deaths the better chance we have the death was neonatal (in the first 28 days of life).

Data Visualization

ggalluv <- ggplot(infantmortality,
                  aes(x = Year, y = `Infant Mortality Rate`, alluvium = `Maternal Race or Ethnicity`)) + 
  theme_bw() + 
  geom_alluvium(aes(fill = `Maternal Race or Ethnicity`), 
                color = "white", 
                width = .1, 
                alpha = .8, 
                decreasing = FALSE) +
  
scale_fill_brewer(palette = "Set2") + 
scale_x_continuous(lim = c(2007, 2016)) + 
ggtitle("Infant Mortality Based on Race") +
  ylab("Infant Mortality Rate")
ggalluv

Warning: Removed 10 rows containing non-finite values (`stat_alluvium()`).

The graph above is an alluvial of Infant Mortality by race. This shows the changes over time and gives us interesting information on black people in general were always at the top of the “ranks” in terms of infant mortality over the years.

Summary and Conclusion

The data that I chose to look at was the infant mortality data set. It gives us information on the deaths of children between the years of 2007 to 2016 and the changes that occurred over time. In the process of using this data the main challenge I personally had was simply syntax. Using the data to get certain results was hard because of things like missing punctuation that I really had to search to find. Other than that the data itself was relatively clean so no cleaning was necessary. Sadly it wasn’t a surprise to me but it was still interesting to see through the data and some graphs I made (which I didn’t put up here), that black people on a consistent basis were those that had the highest rates of infant mortality over this 13 year period. It was also interesting to see in my correlation graph the connection between Neonatal Mortality and Infant Mortality in general. In the news and other social media outlets we’ve heard stories of malpractice with infants who are recently born. Especially with African Americans its been reported from place to place that they are disproportionately maltreated at the hands of medical professionals especially during child birth. In an interesting way this data seemed to prove those theories which to a lay person may have seemed as just that, theoretical. If I was to continue on this project I would have taken more time to look into that, possibly mixing in other data sets to see why that is the case. As previously stated the correlation graph brought us to the conclusion that if a baby will die it is more likely to be within the first 28 days of their life and that is for children across the board. The question that rises is now what would that look like if the variable of race were added, it’s possible that the correlation would become even stronger, further proving that point. All in all I feel it was successful and the data set was great to work with.