Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Data: Adult Mortality Rate (2019-2021)

https://www.kaggle.com/datasets/mikhail1681/adult-mortality-rate-2019-2021

The data contains the following columns:

Load data

raw_data <- read.csv("https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/Adult%20mortality%20rate%20(2019-2021).csv")


colnames(raw_data)
##  [1] "Countries"                          "Continent"                         
##  [3] "Average_Pop.thousands.people."      "Average_GDP.M.."                   
##  [5] "Average_GDP_per_capita..."          "Average_HEXP..."                   
##  [7] "Development_level"                  "AMR_female.per_1000_female_adults."
##  [9] "AMR_male.per_1000_male_adults."     "Average_CDR"

Check relationship between average population vs average crude mortality rate

# Make linear model of response variable average crude mortality rate by predictor average population
mortality_population.lm <- lm(Average_CDR ~ Average_Pop.thousands.people., data=raw_data)
mortality_population.lm
## 
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
## 
## Coefficients:
##                   (Intercept)  Average_Pop.thousands.people.  
##                     8.161e+00                     -4.110e-07
# Get summary of our model
summary(mortality_population.lm)
## 
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9896 -1.7593 -0.4998  1.2313 10.2422 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    8.161e+00  2.449e-01  33.319   <2e-16 ***
## Average_Pop.thousands.people. -4.110e-07  1.472e-06  -0.279    0.781    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 154 degrees of freedom
## Multiple R-squared:  0.0005056,  Adjusted R-squared:  -0.005985 
## F-statistic: 0.0779 on 1 and 154 DF,  p-value: 0.7805

Check residuals

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
ggplot(mortality_population.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color='red') +
  labs(x = "Fitted Values", y = "Residuals")

Quantile-vs-Quantile

qqnorm(resid(mortality_population.lm))
qqline(resid(mortality_population.lm))

Analysis

par(mfrow=c(2,2))
plot(mortality_population.lm)

  • As we can see, there doesn’t seem to be a strong relationship between the average population and average crude mortality rate. While the residuals are very closely centered with a median value of \(-0.4998\), the min value \((-6.9896)\) and max value \((10.2422)\) are not equidistant. The 1st and 3rd quartiles do appear to be, interestingly enough, with values of \(-1.7593\) and \(1.2313\) respectively. Once we head towards the +2 quantile, the Q-Q plot heavily skews right

  • Looking at the test statistic, the t-value is minuscule with \(-0.279\), and \(P(>|t|) <2e-16\), with the 3 asterisks (***). The relationship between population and crude mortality rate has an intercept of \(8.161e+00\), and a slope of \(-4.110e-07\), which is very weak.

  • However, when looking at the R-squared value with 154 degrees of freedom, we see that the Multiple R-squared is \(0.0005056\), and Adjusted R-squared is \(-0.005985\), showing that this linear model with the total population being the independent variable accounts for very little of the variance within the data itself, so the population is not a good predictor for the crude mortality rate.

Let’s check to see if average health expenditure has a better influence on mortality rate

# Make linear model of average crude mortality rate by average population
mortality_Hexp.lm <- lm(Average_CDR ~ Average_HEXP..., data=raw_data)
mortality_Hexp.lm
## 
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
## 
## Coefficients:
##     (Intercept)  Average_HEXP...  
##       7.9801765        0.0001403
# Get summary of our model
summary(mortality_Hexp.lm)
## 
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0881 -1.7851 -0.5841  1.1844 10.2985 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.9801765  0.2784704   28.66   <2e-16 ***
## Average_HEXP... 0.0001403  0.0001264    1.11    0.269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.952 on 154 degrees of freedom
## Multiple R-squared:  0.007933,   Adjusted R-squared:  0.001491 
## F-statistic: 1.231 on 1 and 154 DF,  p-value: 0.2689

Check residuals

ggplot(mortality_Hexp.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color='red') +
  labs(x = "Fitted Values", y = "Residuals")

Quantile-vs-Quantile

qqnorm(resid(mortality_Hexp.lm))
qqline(resid(mortality_Hexp.lm))

Analysis

par(mfrow=c(2,2))
plot(mortality_Hexp.lm)

  • As we can see, this model is not a good better either, much like the prior model. The intercept \((8.161e+00)\) and slope \((-4.110e-07)\) are nearly identical, the residuals are neatly centered around a median value \((-0.4998)\) close to 0, with the min being slightly closer than the max value. The t-value is smaller \((-0.279)\), and the p-value is weak \((0.781)\), along with the Multiple R-squared value of\(0.0005056\) and Adjusted R-squared value of \(-0.005985\), which show that the relationship between healthcare expenditure and mortality rate are not linear either. There are definitely other factors at play that influence the crude mortality rate, such as wars/street violence/drug deaths/etc., and we’d need a more refined model to understand what factors and their weights are on the mortality rate in countries.