Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction
Last semester, I conducted various analysis on maternal morbidity which showed that maternal mortality was on the increase for all racial/ethnic groups in the United States, as well as NYC. I also showed that the maternal mortality rate in the United States was higher than other G-7 counties. Maternal mortality is widely considered a proxy for the health of the population. Another such proxy is the under-five mortality rate. The under-five mortality rate is defined as the probability per 1,000 that a newborn baby will die before reaching age five. For this analysis I wanted to determine if 1)female under-five mortality rate is increasing 2. investigate the difference between under-five mortality rate for females and males.

Data & Minor Wrangling
Data can be found at https://data.who.int/. Mortality rates are standardized per 1000 live births. The years included in the analysis range from 1932 - 2021. The data was filtered to include mortality rates in the United States.

library(tidyverse)
library(broom)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/2322814_ALL_LATEST.csv"
my_data <- read_csv(url)
print(head(my_data))
## # A tibble: 6 × 13
##   IND_UUID IND_NAME     IND_CODE DIM_GEO_CODE_M49 country DIM_TIME DIM_TIME_TYPE
##      <dbl> <chr>        <chr>               <dbl> <chr>      <dbl> <chr>        
## 1  2322814 Mortality r… MDG_000…              288 Ghana       1932 YEAR         
## 2  2322814 Mortality r… MDG_000…              288 Ghana       1933 YEAR         
## 3  2322814 Mortality r… MDG_000…               76 Brazil      1934 YEAR         
## 4  2322814 Mortality r… MDG_000…              288 Ghana       1934 YEAR         
## 5  2322814 Mortality r… MDG_000…               76 Brazil      1935 YEAR         
## 6  2322814 Mortality r… MDG_000…              288 Ghana       1935 YEAR         
## # ℹ 6 more variables: DIM_SEX <chr>, DIM_VALUE_TYPE <chr>, VALUE_NUMERIC <dbl>,
## #   VALUE_NUMERIC_LOWER <dbl>, VALUE_NUMERIC_UPPER <dbl>,
## #   DIM_PUBLISH_STATE_CODE <chr>
filtered_data <- my_data %>%
  filter(country == 'United States of America') %>%
    filter(DIM_SEX!="Total")

Female and Male Linear Models and Plot
Two linear models were fit. One for male data and one for female data. The dependent variable is mortality per 1,000 live births, and the independent variable is year.

The multiple R-squared values (0.9085 male vs 0.9084 female) are relatively high and indicate the variance of the dependent variable (mortality) is accounted for by the model. The coefficient estimates for both were statistically significant (p-value <2e-16). There was greater change of morality for one year increase for males (-0.50244 male vs -0.38659 female).

The linear plot displays that mortality rates are both decreasing and converging for both sexes to between 6.8 per 1000 live births male and 5.7 per 1000 live births female.

#female LMF
female_data <- filtered_data %>% 
  filter(DIM_SEX == 'Female')

model_female <- lm(VALUE_NUMERIC ~ DIM_TIME, data = female_data)

summary(model_female)
## 
## Call:
## lm(formula = VALUE_NUMERIC ~ DIM_TIME, data = female_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6380 -2.4592  0.0759  2.0914  4.7825 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 782.16633   29.13482   26.85   <2e-16 ***
## DIM_TIME     -0.38659    0.01467  -26.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.588 on 70 degrees of freedom
## Multiple R-squared:  0.9084, Adjusted R-squared:  0.9071 
## F-statistic: 694.2 on 1 and 70 DF,  p-value: < 2.2e-16
#male LMF
male_data <- filtered_data %>% 
  filter(DIM_SEX == 'Male')

model_male <- lm(VALUE_NUMERIC ~ DIM_TIME, data = male_data)

summary(model_male)
## 
## Call:
## lm(formula = VALUE_NUMERIC ~ DIM_TIME, data = male_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.826 -3.340  0.212  2.549  6.225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1016.00524   37.83685   26.85   <2e-16 ***
## DIM_TIME      -0.50244    0.01906  -26.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.36 on 70 degrees of freedom
## Multiple R-squared:  0.9085, Adjusted R-squared:  0.9072 
## F-statistic: 695.2 on 1 and 70 DF,  p-value: < 2.2e-16
annotate_data <- filtered_data %>%
  group_by(DIM_SEX) %>%
  arrange(DIM_SEX, DIM_TIME) %>%
  filter(row_number() == 1 | row_number() == n()) %>%
  ungroup()

#Plot

ggplot(filtered_data, aes(x = DIM_TIME, y = VALUE_NUMERIC, color = DIM_SEX)) +
  geom_point(alpha = 0.5) +  
  geom_smooth(data = filtered_data %>% filter(DIM_SEX == 'Male'), 
              method = "lm", se = FALSE, aes(color = "Male")) +
  geom_smooth(data = filtered_data %>% filter(DIM_SEX == 'Female'), 
              method = "lm", se = FALSE, aes(color = "Female")) +
  geom_point(data = annotate_data, aes(x = DIM_TIME, y = VALUE_NUMERIC), 
             size = 3, shape = 8) +  
  geom_text(data = annotate_data, aes(x = DIM_TIME, y = VALUE_NUMERIC, label = VALUE_NUMERIC), 
            vjust = -1, hjust = 1.5, size = 3, color = "black") +  
  scale_color_manual(values = c("Male" = "light blue", "Female" = "pink")) +
  labs(title = "VALUE_NUMERIC vs. DIM_TIME by DIM_SEX",
       x = "DIM_TIME", y = "VALUE_NUMERIC") +
  theme_minimal()

Residual Analysis
Residuals vs Fitted
The data points are not scattered randomly around zero, but rather a are ‘U’ shape, which suggests the model is not be linear.

QQ Residuals
The tails ends of the Q-Q plot deviate from the straight line and indicate that the data is not normally distributed.

# Plotting for Males
par(mfrow = c(2, 2)) # Prepare layout for 4 plots

# Residuals vs. Fitted for Male
ggplot(model_male, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs. Fitted for Males", x = "Fitted Values", y = "Residuals") +
  theme_minimal()

# Q-Q plot Male
qqnorm(residuals(model_male), pch = 16, cex = 1.5, main = "", ylab = "Quantiles of Residuals", xlab = "Theoretical Quantiles")
qqline(residuals(model_male), col = "red", lwd = 2)
title("Q-Q Plot Male Residuals", cex.main = 1)

# Plotting Females
par(mfrow = c(2, 2)) 

# Residuals vs. Fitted for Female
ggplot(model_female, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs. Fitted for Females", x = "Fitted Values", y = "Residuals") +
  theme_minimal()

# Q-Q plot for Female
qqnorm(residuals(model_female), pch = 16, cex = 1.5, main = "", ylab = "Quantiles of Residuals", xlab = "Theoretical Quantiles")
qqline(residuals(model_female), col = "red", lwd = 2)
title("Q-Q Plot Female Residuals", cex.main = 1)

Conclusion
The residuals vs fitted plot suggests that the relationship between time and mortality is not linear and other model types should be considered. The QQ plot indicates that the residuals are not normal distributed and suggests a linear model is not a good fit for this data. Furthermore, I now realize that with regard to time series data, the independence condition is violated because mortality rates from one year are correlated with rates from previous years. Linear regression was not appropriate for this analysis. Regardless, it is clear that under-five mortality for both females and males is decreasing in the United States, with male rates decreasing faster in the time period covered because male under-five mortality was historically higher.