Week 11 Discussion

The data contains the following columns:

Countries: Country of study.
Continent: Continent location of the country.
Average_Pop(thousands people): Average population of the country under study for 2019-2021 in thousands.
Average_GDP(M$): Average GDP of the country under study for 2019-2021 in millions of dollars.
Average_GDP_per_capita: Average GDP per capita of the country under study for 2019-2021 in dollars.
Average_HEXP($): Health Expenditure Per capita in the country under study in dollars.
Development_level: Level of development of the state under study (calculated by GDP per capita of the country). [Please note that in this dataset we calculate this indicator only by calculating GDP per capita! Despite the fact that the United Nations (UN) does not have an unambiguous classification of countries into developed, developing and backward based on only one indicator, such as the amount of GDP per capita. It uses a wider range of economic, social and quality indicators to determine the level of development of countries.]
AMR_female(per_1000_female_adults): Average mortality of adult women in the country under study (per 1000 adult women per year) for 2019-2023.
AMR_male(per_1000_male_adults): Average mortality of adult men in the country under study (per 1000 adult men per year) for 2019-2023.
Average_CDR: Average crude mortality rate for 2019–2021 in the country under study.

Load data

raw_data <- read.csv("https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/Adult%20mortality%20rate%20(2019-2021).csv")


colnames(raw_data)

##  [1] "Countries"                          "Continent"                         
##  [3] "Average_Pop.thousands.people."      "Average_GDP.M.."                   
##  [5] "Average_GDP_per_capita..."          "Average_HEXP..."                   
##  [7] "Development_level"                  "AMR_female.per_1000_female_adults."
##  [9] "AMR_male.per_1000_male_adults."     "Average_CDR"

Check relationship between average population vs average crude mortality rate

# Make linear model of response variable average crude mortality rate by predictor average population
mortality_population.lm <- lm(Average_CDR ~ Average_Pop.thousands.people., data=raw_data)
mortality_population.lm

## 
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
## 
## Coefficients:
##                   (Intercept)  Average_Pop.thousands.people.  
##                     8.161e+00                     -4.110e-07

# Get summary of our model
summary(mortality_population.lm)

## 
## Call:
## lm(formula = Average_CDR ~ Average_Pop.thousands.people., data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9896 -1.7593 -0.4998  1.2313 10.2422 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    8.161e+00  2.449e-01  33.319   <2e-16 ***
## Average_Pop.thousands.people. -4.110e-07  1.472e-06  -0.279    0.781    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 154 degrees of freedom
## Multiple R-squared:  0.0005056,  Adjusted R-squared:  -0.005985 
## F-statistic: 0.0779 on 1 and 154 DF,  p-value: 0.7805

Check residuals

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

ggplot(mortality_population.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color='red') +
  labs(x = "Fitted Values", y = "Residuals")

Quantile-vs-Quantile

qqnorm(resid(mortality_population.lm))
qqline(resid(mortality_population.lm))

Analysis

par(mfrow=c(2,2))
plot(mortality_population.lm)

As we can see, there doesn’t seem to be a strong relationship between the average population and average crude mortality rate. While the residuals are very closely centered with a median value of $-0.4998$, the min value $(-6.9896)$ and max value $(10.2422)$ are not equidistant. The 1st and 3rd quartiles do appear to be, interestingly enough, with values of $-1.7593$ and $1.2313$ respectively. Once we head towards the +2 quantile, the Q-Q plot heavily skews right
Looking at the test statistic, the t-value is minuscule with $-0.279$, and $P(>|t|) <2e-16$, with the 3 asterisks (***). The relationship between population and crude mortality rate has an intercept of $8.161e+00$, and a slope of $-4.110e-07$, which is very weak.
However, when looking at the R-squared value with 154 degrees of freedom, we see that the Multiple R-squared is $0.0005056$, and Adjusted R-squared is $-0.005985$, showing that this linear model with the total population being the independent variable accounts for very little of the variance within the data itself, so the population is not a good predictor for the crude mortality rate.

Let’s check to see if average health expenditure has a better influence on mortality rate

# Make linear model of average crude mortality rate by average population
mortality_Hexp.lm <- lm(Average_CDR ~ Average_HEXP..., data=raw_data)
mortality_Hexp.lm

## 
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
## 
## Coefficients:
##     (Intercept)  Average_HEXP...  
##       7.9801765        0.0001403

# Get summary of our model
summary(mortality_Hexp.lm)

## 
## Call:
## lm(formula = Average_CDR ~ Average_HEXP..., data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0881 -1.7851 -0.5841  1.1844 10.2985 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.9801765  0.2784704   28.66   <2e-16 ***
## Average_HEXP... 0.0001403  0.0001264    1.11    0.269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.952 on 154 degrees of freedom
## Multiple R-squared:  0.007933,   Adjusted R-squared:  0.001491 
## F-statistic: 1.231 on 1 and 154 DF,  p-value: 0.2689

Check residuals

ggplot(mortality_Hexp.lm, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color='red') +
  labs(x = "Fitted Values", y = "Residuals")

Quantile-vs-Quantile

qqnorm(resid(mortality_Hexp.lm))
qqline(resid(mortality_Hexp.lm))

Analysis

par(mfrow=c(2,2))
plot(mortality_Hexp.lm)

As we can see, this model is not a good better either, much like the prior model. The intercept $(8.161e+00)$ and slope $(-4.110e-07)$ are nearly identical, the residuals are neatly centered around a median value $(-0.4998)$ close to 0, with the min being slightly closer than the max value. The t-value is smaller $(-0.279)$, and the p-value is weak $(0.781)$, along with the Multiple R-squared value of$0.0005056$ and Adjusted R-squared value of $-0.005985$, which show that the relationship between healthcare expenditure and mortality rate are not linear either. There are definitely other factors at play that influence the crude mortality rate, such as wars/street violence/drug deaths/etc., and we’d need a more refined model to understand what factors and their weights are on the mortality rate in countries.

Week 11 Discussion

Ron Balaban

2024-04-05

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Data: Adult Mortality Rate (2019-2021)

The data contains the following columns:

Load data

Check relationship between average population vs average crude mortality rate

Check residuals

Quantile-vs-Quantile

Analysis

Let’s check to see if average health expenditure has a better influence on mortality rate

Check residuals

Quantile-vs-Quantile

Analysis