There is a global ethos that the more financial stability people have, the better off and happier they are. This idea could be debated in some circles, but this project will move ahead with this notion being true.
The World Bank’s stated goal is to reduce poverty. The main way that they tackle this problem is by lending money to developing countries for capital improvement projects. The World Bank measures poverty by a lot of indicators ranging from environmental issues to educational metrics. A common indicator of a country’s health when it comes to measuring the standard of living and a country’s strength or weakness is its gross national income per capita (GNI). According to the World Bank, GNI is the gross national income, converted to U.S. dollars using the World Bank Atlas method, divided by the midyear population. GNI is the sum of value added by all resident producers plus any product taxes (less subsidies) not included in the valuation of output plus net receipts of primary income (compensation of employees and property income) from abroad.
In order for a GNI to increase, other factors may need to increase or possibly decrease. The aim of this project is to find these factors, which can be manipulated through investment and education, to move a country out of poverty by using regression analysis on the effects of population growth on GNI.
The World Bank has a large range of data sources (e.g.: APIs, UIs and tables) made available through their Open Data Initiative. The Country Profiles dataset was chosen due to its large set of economic indicators. The csv file was saved to aGitHub repository and imported to a data frame, which was transformed due to its unstructured state.
The Country Profiles dataset provides economic and social indicators for 214 countries for the years 1990, 2000 and 2015. All of these indicator variables are not available for all countries throughout all years, but what is available is ample for analysis. In order to ensure that every observation is independent, we will only focus on the indicators for the year 2015, which will be stored in wb2015. After picking the variables to use, there are 167 cases of countries in the model.
The response variable will be GNI, which is a continuous numeric in units of US$ per individual in the population. A log-transform was necessary in order to normalize the data (process in Part 3).
After much research into the correlations, sample sizes and conditions necessary for inference, population growth was chosen as the explanatory variable (Note the Appendix for further information in this process). As defined by the World Bank, the annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage . Population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. Population growth was mainly chosen because it would be a relatively easy issue to curb in the form of contraception and education. rather than vast investments into infrastructure and other financial investments.
The data within the Country Profiles dataset is historic data, so it is observational; we have no control over the variables.
The population of interest are the populations in underdeveloped countries. The conclusions of this regression analysis should be able to be carried out on all underdeveloped countries. The World Bank attempts to gather at least 2000 observations to obtain one population data point. Due to this ambitious number, it may make very underdeveloped countries difficult to get a well rounded sample from; they may find it easier stick to the larger towns to collect data rather than venturing deep into the rural areas. Assuming that World Bank tries their best to get a well represented sample, and from reading their literature and site it seems that they take this data collection very seriously, there should not be any strong bias in the data.
There are numerous factors that cause large portions of the population to be living in poverty. There is, as will be shown, a negative correlation between a large population growth and a low GNI. It would be difficult to argue for causation for one leading to the other because there are too many other variables that can be causing it.
The GNI for countries should have a very wide range due to the vast differences between developed and underdeveloped economies. According to the World Bank, countries with a GNI of less than $4,036 are considered to be low\lower-middle income countries (or underdeveloped countries). Taking a look at GNI’s summary statistics, we can see that most countries are developed:
psych::describe(wb2015$`GNI per capita, Atlas method (current US$)`)
## vars n mean sd median trimmed mad min max range skew
## X1 1 167 13715.45 18637.02 6030 9826.07 7205.44 260 93820 93560 2.06
## kurtosis se
## X1 4.04 1442.18
Also, we can see that the mean is much larger than the median, which implies that the data is heavily skewed to the right. Because of this, we really should be concerned about GNI’s distribution and need to look at its normality.
qqnorm(wb2015$`GNI per capita, Atlas method (current US$)`, main = "Normal Plot of GNI")
qqline(wb2015$`GNI per capita, Atlas method (current US$)`)
It is obvious that GNI is not normally distributed and it is heavily skewed to the right. A log-transform may fix this issue:
wb2015$`log_GNI` <- log10(
wb2015$`GNI per capita, Atlas method (current US$)`)
qqnorm(wb2015$log_GNI, main = "Normal Plot of Log(GNI)" )
qqline(wb2015$log_GNI)
With the log-transform, the distribution does follow the normal distribution. There are, however, some outliers, but due to the size of the sample (167 when dropping NAs), they should not have a significant effect on the regression.
First, we look at the summary statistics and normality of population growth:
psych::describe(wb2015$`Population growth (annual %)`)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 213 1.25 1.19 1.2 1.24 1.19 -3.2 5.8 9 0.12 0.84
## se
## X1 0.08
qqnorm(wb2015$`Population growth (annual %)`, main = "Normal Plot of Population Growth")
qqline(wb2015$`Population growth (annual %)`)
The distribution of population growth appears to follow the normal distribution and has very few outliers; with the large sample size, it should not affect the regression. The median and mean show that there is an overall positive growth in population sizes for most countries.
Let us take an initial look at GNI ~ population growth to see what the relationship is:
plot(wb2015$log_GNI~jitter(wb2015$`Population growth (annual %)`),
main = "GNI ~ Populaiton Growth",
xlab = "Population Growth (%)",
ylab = "log10(GNI) US$")
From the scatter plot, there does appear to be a negative linear relationship between a larger population growth and a lower GNI. This may imply that countries with growing populations, maybe due to increased birth rates due to lack of contraception, do tend to have a lower GNI. We will investigate in the next part to see if this is a reasonable inference.
gni_pop <- lm(wb2015$log_GNI~wb2015$`Population growth (annual %)`)
plot(wb2015$log_GNI~wb2015$`Population growth (annual %)`,
xlab = "Populaiton Growth (%)",
ylab = "log10(GNI) US$")
abline(gni_pop)
qqnorm(gni_pop$residuals, main = "Residuals of GNI ~ Populaiton Growth")
qqline(gni_pop$residuals)
plot(gni_pop$residuals~jitter(gni_pop$fitted.values),
main = "GNI ~ Population Residuals",
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, lty = 3)
For our 167 cases, the summary output for the least squares fit model for GNI ~ population growth is below.
summary(gni_pop)
##
## Call:
## lm(formula = wb2015$log_GNI ~ wb2015$`Population growth (annual %)`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0713 -0.3904 -0.0798 0.3094 1.6712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.11292 0.06372 64.543 < 2e-16
## wb2015$`Population growth (annual %)` -0.26823 0.03588 -7.476 4.28e-12
##
## (Intercept) ***
## wb2015$`Population growth (annual %)` ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5407 on 165 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.253, Adjusted R-squared: 0.2485
## F-statistic: 55.89 on 1 and 165 DF, p-value: 4.281e-12
We need to perform a hypothesis test to see if it is reasonable to infer that there is a linear relationship between these variables, GNI and population growth. Forming a hypothesis test:
In order to test this, we need to establish a confidence interval value that would be reasonable to accept or reject the null hypothesis, \(H_0\). It will let us know that within a normal distribution of our mean value of the coefficient being 0, \(\beta_1 = 0\), what is the likelihood of our interval catching our value of \(\beta_1\) from this distribution. A cut-off interval is usually defined as a 95% interval (i.e. \(\alpha = 0.05\)). We will make \(\alpha =0.025\) since it is a one sided test. If our \(\alpha\) is less than this, there is strong evidence to reject \(H_0\), in favor of \(H_A\).
From the output above, our p-value, \(Pr(>|t|)\) is really small for our estimated slope coefficient, \(\beta_1\). Our \(\alpha\) value is \(<<\) than 0.0025. Therefore, there is strong evidence to reject \(H_0\) and conclude that there is strong evidence that there is a negative linear relationship between GNI and population growth.
Using the estimated coefficients to construct the equation for the regression model from the summary output above: \[\widehat{log_{10}(GNI)} = 4.112924 - 0.2682326\times \widehat{population\_growth}\]
Looking at the plot again:
Our intercept from our regression equation above tells us that if the exponential population growth rate is 0, that \(log_{10}(GNI)\) would be 4.112924 (GNI = $1.38518210^{6}), which seems reasonable from the plot above. The slope tells us that for every 1% change in the population’s exponential growth rate, \(log_{10}(GNI)\) will decrease by 0.2682326.
The inference that the regression leads us to is that there is a negative correlation between GNI and population growth. The \(R^2\) value for the model tells us that ~25% of the variability of GNI can be attributed to population growth. That does not make our model very reliable.
To explore a different path if our model is reliable, we can compare the models for the other years (1990 and 2000) to see if this 2015 model stands true throughout time. Looking at the fitted models for the previous years (even though the conditions for regression may not hold true for these years) and overlaying our least square fitted line from 2015 on top of the model, it does show the same behavior as the other models, but the linear relationships, like our model, do not appear very strong.
wb2000 <- subset(worldbank, year == 2000)
plot.default(log10(wb2000$`GNI per capita, Atlas method (current US$)`)~wb2000$`Population growth (annual %)`, main = "2000: log10(GNI) ~ Population growth", xlab = "Population Growth (%)", ylab = "log10(GNI)")
abline(gni_pop, lty = 3)
text(5.1, 3,"2015", col = "red")
abline(lm(log10(wb2000$`GNI per capita, Atlas method (current US$)`)~wb2000$`Population growth (annual %)`))
text(-3.6, 4.4,"2000", col = "blue")
wb1990 <- subset(worldbank, year == 1990)
plot.default(log10(wb1990$`GNI per capita, Atlas method (current US$)`)~wb1990$`Population growth (annual %)`, main = "1990: log10(GNI) ~ Population growth", xlab = "Population Growth (%)", ylab = "log10(GNI)")
abline(gni_pop, lty = 3)
text(6, 2.8,"2015", col = "red")
abline(lm(log10(wb1990$`GNI per capita, Atlas method (current US$)`)~wb1990$`Population growth (annual %)`))
text(5.5, 2.2,"1990", col = "blue")
Multiple linear regression may be a better tool to pin point more economic or social predictors that affect GNI. However, poverty is a very complex issue. Reducing the population growth through counter measures may improve GNI by allowing for less stressful household situations and less people dependent on an under performing economy where resources may be rare. The problem with poverty is that it is not a one fix solution and many factors need to be known and understood before the problem can be seriously addressed.
One of the things learned during this project reminds me of a quote by a renowned economics professor, Ronald H. Coase, who said, “If you torture the data long enough, it will confess.” Meeting the conditions for regression and normality were very difficult to achieve. It felt that I had to wrangle the data and then backtrack on numerous occasions. The model was not forced to be a very strong indicator of GNI. I chose the strongest model I could come up with and let the model speak for itself.
The data dictionary for all variables is available from this data catlog, where you can click on the info icons. It is also available from this repo for download.
There are a lot of descriptive variables in our data (62). In order to see what variables could be used in our regression, the correlations in respect to GNI and the amount of variables attributed to each (due to the varying amount of NAs) were used as a starting point for the project. Due to a lot of trial and error, e.g., heavy skews in distribution and sample sizes available, population growth rate was one of the few viable options to go with. The correlations in regards to \(log_{10}(GNI)\) are provided below.
gni_cor <- as.data.frame(cor(wb2015[, 3:length(wb2015)],
use = "pairwise.complete.obs",
method = "pearson"))
gni_cor <- gni_cor[order(gni_cor$log_GNI, decreasing = TRUE),]
gni_cor <- subset(gni_cor, !is.na(gni_cor[,"log_GNI"]),
select ="log_GNI")
colnames(gni_cor)[1] <- "cor(log_GNI)"
for (i in 3:ncol(wb2015)) {
if (colnames(wb2015[i]) %in% rownames(gni_cor)) {
tmp <- sum(!is.na(wb2015[,i]))
gni_cor[colnames(wb2015[i]),'data_points'] <- tmp
}
}
datatable(gni_cor)