May 2, 2020

Part 1 - Introduction


Does the GDP per capita of a country in a given year predict the number of doctors per thousand people?

In this presentation, we’ll explore the idea that richer countries have more doctors.

It seems like a fair assumption, but let’s use data to determine whether this actually true or not.

If there is a relationship, we can then evaulate the strength of the association.

Part 2 - Data


The data we’re using today was sourced from the World Bank. We’ll be looking at two datasets:

  • GDP per capita in US dollars, adjusted for inflation
  • Doctors per 1000 people

Both datasets include these metrics by country and year. After we combine both datasets, we have just over 4000 country-year observations.






I discovered the datasets by exploring the Gapminder website. I became familiar with Gapminder through one of Hans Rosling’s TED talks years ago, which can be found here.

Part 3 - Exploratory data analysis (1)


Distribution of Values

Both distributions are right skewed. For GDP, there’s a greater concentration of poor country-year observations compared to observations with few doctors.

Part 3 - Exploratory data analysis (2)


Observed Relationship Between Metrics

Visually, it seems like there might be a correlation between these two variables. The observation points on this scatter plot seem to hug the left side. If we use a log scale, that might help make this easier to interpret.

Part 3 - Exploratory data analysis (3)


Log Scale Scatter Plots

The possibility of correlation is clearer when GDP is placed on a log scale. After seeing this plot on the left, it looks like the points are now condensed near the bottom of the y axis. The plot on the right shows both variables on a log scale, which seems like an obvious correlation.

Part 3 - Exploratory data analysis (4)


Correlation between GDP per Capita and Doctors/1000:

## [1] 0.473999

Correlation between log(GDP per capita) and Doctors/1000:

## [1] 0.5748664

Correlation between log(GDP per capita) and log(Doctors/1000):

## [1] 0.7120528


Looking at these variables a log scale results in a relatively higher correlation. This may be due to the fact that there are many poor country-year observations and fewer rich countries, as seen in the right skewed distribution.

Part 4 - Inference (1)

# Linear Regression Model
lm <- lm(docs_per_k ~ gdp_per_capita, data = df)
summary(lm)
## 
## Call:
## lm(formula = docs_per_k ~ gdp_per_capita, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9621 -0.9244 -0.3681  0.5752  6.8405 
## 
## Coefficients:
##                    Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept)    1.1255680906 0.0236038824   47.69 <0.0000000000000002 ***
## gdp_per_capita 0.0000335925 0.0000009867   34.05 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.197 on 4000 degrees of freedom
## Multiple R-squared:  0.2247, Adjusted R-squared:  0.2245 
## F-statistic:  1159 on 1 and 4000 DF,  p-value: < 0.00000000000000022

The p-value is small enough to determine that the GDP variable is statistically significant. The \(R^2\) value is .22, so it doesn’t seem like a strong model.

Part 4 - Inference (2)

# Linear Regression Model
lm_lx <- lm(docs_per_k ~ log(gdp_per_capita), data = df)
summary(lm_lx)
## 
## Call:
## lm(formula = docs_per_k ~ log(gdp_per_capita), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8556 -0.7507 -0.3191  0.4433  6.4841 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)         -2.78415    0.10035  -27.74 <0.0000000000000002 ***
## log(gdp_per_capita)  0.50995    0.01148   44.43 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.112 on 4000 degrees of freedom
## Multiple R-squared:  0.3305, Adjusted R-squared:  0.3303 
## F-statistic:  1974 on 1 and 4000 DF,  p-value: < 0.00000000000000022

The \(R^2\) value is now .33, which is an improvement over the last model, but only a third of the variance is accounted for by this model.

Part 4 - Inference (3)

# Linear Regression Model
lm_lxly <- lm(log(docs_per_k) ~ log(gdp_per_capita), data = df)
summary(lm_lxly)
## 
## Call:
## lm(formula = log(docs_per_k) ~ log(gdp_per_capita), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5946 -0.6035 -0.0948  0.5646  2.8751 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)         -5.87927    0.09051  -64.96 <0.0000000000000002 ***
## log(gdp_per_capita)  0.66387    0.01035   64.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.003 on 4000 degrees of freedom
## Multiple R-squared:  0.507,  Adjusted R-squared:  0.5069 
## F-statistic:  4114 on 1 and 4000 DF,  p-value: < 0.00000000000000022

Finally, the \(R^2\) value for this model is .507. This is the best model out of the three.

Part 4 - Inference (4)

Residuals of Fitted Doctors per Thousand People

Note: the model is based on log values.

There’s higher variance for lower levels of GDP when predicting the amount of doctors. This isn’t surprising as there’s much more data for poorer countries.

Part 5 - Conclusion

There’s a clear association between GDP per capita and the number of doctors any country may have in a given year. As a country produces more good and services, its likely that we’ll find more doctors there. There may be colinearity involved, since the existence of doctors, hospitals, and the healthcare industry is included in GDP. It would make sense that as a country grows, the country is likely to invest more in healthcare to a certain point– but there’s no guarantee that will happen. Its entirely possible that a country could swell in size and spend almost nothing on the welfare of the people and perhaps even invest heavily in its military.