Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Introduction
Last semester, I conducted various analysis on maternal morbidity which
showed that maternal mortality was on the increase for all racial/ethnic
groups in the United States, as well as NYC. I also showed that the
maternal mortality rate in the United States was higher than other G-7
counties. Maternal mortality is widely considered a proxy for the health
of the population. Another such proxy is the under-five mortality rate.
The under-five mortality rate is defined as the probability per 1,000
that a newborn baby will die before reaching age five. For this analysis
I wanted to determine if 1)female under-five mortality rate is
increasing 2. investigate the difference between under-five mortality
rate for females and males.
Data & Minor Wrangling
Data can be found at https://data.who.int/. Mortality rates are standardized
per 1000 live births. The years included in the analysis range from 1932
- 2021. The data was filtered to include mortality rates in the United
States.
library(tidyverse)
library(broom)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/2322814_ALL_LATEST.csv"
my_data <- read_csv(url)
print(head(my_data))
## # A tibble: 6 × 13
## IND_UUID IND_NAME IND_CODE DIM_GEO_CODE_M49 country DIM_TIME DIM_TIME_TYPE
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 2322814 Mortality r… MDG_000… 288 Ghana 1932 YEAR
## 2 2322814 Mortality r… MDG_000… 288 Ghana 1933 YEAR
## 3 2322814 Mortality r… MDG_000… 76 Brazil 1934 YEAR
## 4 2322814 Mortality r… MDG_000… 288 Ghana 1934 YEAR
## 5 2322814 Mortality r… MDG_000… 76 Brazil 1935 YEAR
## 6 2322814 Mortality r… MDG_000… 288 Ghana 1935 YEAR
## # ℹ 6 more variables: DIM_SEX <chr>, DIM_VALUE_TYPE <chr>, VALUE_NUMERIC <dbl>,
## # VALUE_NUMERIC_LOWER <dbl>, VALUE_NUMERIC_UPPER <dbl>,
## # DIM_PUBLISH_STATE_CODE <chr>
filtered_data <- my_data %>%
filter(country == 'United States of America') %>%
filter(DIM_SEX!="Total")
Female and Male Linear Models and Plot
Two linear models were fit. One for male data and one for female data.
The dependent variable is mortality per 1,000 live births, and the
independent variable is year.
The multiple R-squared values (0.9085 male vs 0.9084 female) are relatively high and indicate the variance of the dependent variable (mortality) is accounted for by the model. The coefficient estimates for both were statistically significant (p-value <2e-16). There was greater change of morality for one year increase for males (-0.50244 male vs -0.38659 female).
The linear plot displays that mortality rates are both decreasing and converging for both sexes to between 6.8 per 1000 live births male and 5.7 per 1000 live births female.
#female LMF
female_data <- filtered_data %>%
filter(DIM_SEX == 'Female')
model_female <- lm(VALUE_NUMERIC ~ DIM_TIME, data = female_data)
summary(model_female)
##
## Call:
## lm(formula = VALUE_NUMERIC ~ DIM_TIME, data = female_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6380 -2.4592 0.0759 2.0914 4.7825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 782.16633 29.13482 26.85 <2e-16 ***
## DIM_TIME -0.38659 0.01467 -26.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.588 on 70 degrees of freedom
## Multiple R-squared: 0.9084, Adjusted R-squared: 0.9071
## F-statistic: 694.2 on 1 and 70 DF, p-value: < 2.2e-16
#male LMF
male_data <- filtered_data %>%
filter(DIM_SEX == 'Male')
model_male <- lm(VALUE_NUMERIC ~ DIM_TIME, data = male_data)
summary(model_male)
##
## Call:
## lm(formula = VALUE_NUMERIC ~ DIM_TIME, data = male_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.826 -3.340 0.212 2.549 6.225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1016.00524 37.83685 26.85 <2e-16 ***
## DIM_TIME -0.50244 0.01906 -26.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.36 on 70 degrees of freedom
## Multiple R-squared: 0.9085, Adjusted R-squared: 0.9072
## F-statistic: 695.2 on 1 and 70 DF, p-value: < 2.2e-16
annotate_data <- filtered_data %>%
group_by(DIM_SEX) %>%
arrange(DIM_SEX, DIM_TIME) %>%
filter(row_number() == 1 | row_number() == n()) %>%
ungroup()
#Plot
ggplot(filtered_data, aes(x = DIM_TIME, y = VALUE_NUMERIC, color = DIM_SEX)) +
geom_point(alpha = 0.5) +
geom_smooth(data = filtered_data %>% filter(DIM_SEX == 'Male'),
method = "lm", se = FALSE, aes(color = "Male")) +
geom_smooth(data = filtered_data %>% filter(DIM_SEX == 'Female'),
method = "lm", se = FALSE, aes(color = "Female")) +
geom_point(data = annotate_data, aes(x = DIM_TIME, y = VALUE_NUMERIC),
size = 3, shape = 8) +
geom_text(data = annotate_data, aes(x = DIM_TIME, y = VALUE_NUMERIC, label = VALUE_NUMERIC),
vjust = -1, hjust = 1.5, size = 3, color = "black") +
scale_color_manual(values = c("Male" = "light blue", "Female" = "pink")) +
labs(title = "VALUE_NUMERIC vs. DIM_TIME by DIM_SEX",
x = "DIM_TIME", y = "VALUE_NUMERIC") +
theme_minimal()
Residual Analysis
Residuals vs Fitted
The data points are not scattered randomly around zero, but rather a are
‘U’ shape, which suggests the model is not be linear.
QQ Residuals
The tails ends of the Q-Q plot deviate from the straight line and
indicate that the data is not normally distributed.
# Plotting for Males
par(mfrow = c(2, 2)) # Prepare layout for 4 plots
# Residuals vs. Fitted for Male
ggplot(model_male, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs. Fitted for Males", x = "Fitted Values", y = "Residuals") +
theme_minimal()
# Q-Q plot Male
qqnorm(residuals(model_male), pch = 16, cex = 1.5, main = "", ylab = "Quantiles of Residuals", xlab = "Theoretical Quantiles")
qqline(residuals(model_male), col = "red", lwd = 2)
title("Q-Q Plot Male Residuals", cex.main = 1)
# Plotting Females
par(mfrow = c(2, 2))
# Residuals vs. Fitted for Female
ggplot(model_female, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs. Fitted for Females", x = "Fitted Values", y = "Residuals") +
theme_minimal()
# Q-Q plot for Female
qqnorm(residuals(model_female), pch = 16, cex = 1.5, main = "", ylab = "Quantiles of Residuals", xlab = "Theoretical Quantiles")
qqline(residuals(model_female), col = "red", lwd = 2)
title("Q-Q Plot Female Residuals", cex.main = 1)
Conclusion
The residuals vs fitted plot suggests that the relationship between time
and mortality is not linear and other model types should be considered.
The QQ plot indicates that the residuals are not normal distributed and
suggests a linear model is not a good fit for this data. Furthermore, I
now realize that with regard to time series data, the independence
condition is violated because mortality rates from one year are
correlated with rates from previous years. Linear regression was not
appropriate for this analysis. Regardless, it is clear that under-five
mortality for both females and males is decreasing in the United States,
with male rates decreasing faster in the time period covered because
male under-five mortality was historically higher.