In this section, let’s create models to see if we can predict certain behaviors based on our basic education data.
Pairs Plot

Its no surprise that literacy rate and primary school completion rate have a strong correlation. Reading is taught in primary school and we can make the assumption that even with the gap in time between when a person completes primary school and when they’re considered an adult, that the primary school completion rate can be a proxy for the effective value of basic education within a nation in any given year.
Interestingly enough, there’s a weak correlation between adult literacy rate and the suicide rate. It’s a bit scary to think that there’s a link between being able to read and suicide. There’s also a weak negative correlation between the primary school completion rate and the murder rate.
Murder
Let’s see if murder can be predicted by basic education statistics. We’ll try multiple regression and then use backwards elimination to get our final model.
## [1] 0.03867512
It looks like this model is worthless. It’s not exactly a surprise that literacy and information about primary school are weak indicators for murder.
What would a model look like with just primary school completion rate, which showed a weak correlation?
## [1] 0.1062743
It looks like this simple linear regression has better results, but its still not robust enough to be useful. With an \(R^2\) of .1, only 10% of the variance is explained by the model.


Murder with Random Forest
Earlier, we saw that there was a negative correlation between and murder rate and primary school completion rate. Let’s try and see if a random forest model will produce a model with better predictions.
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .


## Random Forest
##
## 53 samples
## 2 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.867724 0.1040575 3.835524
##
## Tuning parameter 'mtry' was held constant at a value of 2
Based on the \(R^2\) returned by the random forest, this model isn’t quite up to par either.
Suicide
What about suicide? Let’s use a similar procedure to determine whether the data we have available can predict suicide rates based on basic education.
##
## Call:
## lm(formula = suicide_rate ~ literacy_rate + pschool_crate + pschool_erate,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3292 -5.2300 -0.3022 3.1134 14.1280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -140.85195 67.29326 -2.093 0.0493 *
## literacy_rate 1.53450 0.68922 2.226 0.0376 *
## pschool_crate 0.03788 0.24999 0.152 0.8811
## pschool_erate -0.09900 0.20825 -0.475 0.6396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.158 on 20 degrees of freedom
## (46971 observations deleted due to missingness)
## Multiple R-squared: 0.2171, Adjusted R-squared: 0.09969
## F-statistic: 1.849 on 3 and 20 DF, p-value: 0.1708
It seems like this model is stronger than the one that predicts murder, since the \(R^2\) is higher. Let’s use backwards elimination to see if we can reduce the variance in the next model.
## [1] 0.2156279
It looks like the linear regression using one explanatory variable results in the best relative model. Let’s take a look at how this model looks.


It looks like this multiple regression model isn’t very robust.
Let’s try to build a random forest model using literacy and primary school completion rate as the dependent variable. Using all 3 variables wouldn’t leave us with enough observations to run a reasonable model.
Random Forest
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .


## Random Forest
##
## 53 samples
## 2 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.23475 0.2286956 5.465412
##
## Tuning parameter 'mtry' was held constant at a value of 2
It looks like a random forest model with literacy rate and primary school completion rate is a better predictor of suicides compared to a multiple linear regression. Still, even with the best model, the predictions won’t be very convincing.
Vaccination
As we saw above, the rates of MCV and DTP vaccination appear very correlated. Lets measure to what degree:
## [1] 0.8002264

Reviewing these metrics, we do indeed find that these are very correlated. Our \(R^2\) is .80.
Lets now look at the predicting power of education on vaccination rate starting with DTP.
##
## Call:
## lm(formula = DTP_rate ~ literacy_rate + pschool_crate + pschool_erate,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.038 -3.260 2.437 6.253 21.570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.30339 4.38329 12.161 < 2e-16 ***
## literacy_rate 0.07993 0.08763 0.912 0.363264
## pschool_crate 0.27844 0.07928 3.512 0.000599 ***
## pschool_erate 0.28100 0.14464 1.943 0.054056 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.63 on 140 degrees of freedom
## (46851 observations deleted due to missingness)
## Multiple R-squared: 0.3235, Adjusted R-squared: 0.309
## F-statistic: 22.31 on 3 and 140 DF, p-value: 7.209e-12
##
## Call:
## lm(formula = DTP_rate ~ pschool_crate + pschool_erate, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.402 -2.945 1.847 5.661 26.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.24570 1.67128 31.261 < 2e-16 ***
## pschool_crate 0.38964 0.01899 20.514 < 2e-16 ***
## pschool_erate 0.16083 0.04821 3.336 0.000889 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.99 on 793 degrees of freedom
## (46199 observations deleted due to missingness)
## Multiple R-squared: 0.3885, Adjusted R-squared: 0.3869
## F-statistic: 251.9 on 2 and 793 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = DTP_rate ~ pschool_crate, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.436 -7.055 4.026 9.830 43.092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.20107 1.05214 32.51 <2e-16 ***
## pschool_crate 0.56091 0.01251 44.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.24 on 2987 degrees of freedom
## (44006 observations deleted due to missingness)
## Multiple R-squared: 0.4021, Adjusted R-squared: 0.4019
## F-statistic: 2009 on 1 and 2987 DF, p-value: < 2.2e-16
Starting with each of our predictor variables, seems things initially promising: an \(R^2\) of 32% with the P Value of the variable Literacy Rate very high at .36. When we remove Literacy Rate, our \(R^2\) climbed further. When we limited our variables to Public School Completion rate, our \(R^2\) increased yet again to 40%, implying that this alone is a stronger predictor than any other combination.


Looking towards measles we initially see a higher \(R^2\), yet, this only decreases as we remove variables:
## [1] 0.4293597
## [1] 0.3778673
Vaccinations with Random Forest
Lets create a test a model with Random Forest and see if this fares any better. We will begin by filtering the dataframe and removing country/year pairings with any null values.
## [1] 144 12
This only leaves us with 144 values - not enough to train our dataset. As Literacy Rate fared the worst in our linear regression, lets remove that and reassess.
## [1] 796 12
This seems appropriate - lets move forward.
## parsnip model object
##
## Fit time: 421ms
##
## Call:
## randomForest(x = as.data.frame(x), y = y, ntree = ~500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 115.1834
## % Var explained: 34.84
The \(R^2\) is close to, but a little worse than the one which we arrived at with Linear modeling above. That being said - it is in the same realm. Lets test this model against our testing sample:



These are some interesting graphs. At first glance our model seems very successful. The predicted values are on average very close to the actual values for our testing sample. That being said - the model becomes significantly less accurate as the actual vaccination rate moves down from 100%. This could be said to have failed the condition of constant variability.
Gini coefficient
Initially, the logic in searching for correlations between education and inequality was that, as a population becomes more educated, the GDP would increase and the population more engaged in democracy, which independently might lead to a move from a rigid class structure. After viewing the data, however, the erraticness of this response variable gives me little hope in finding correlations:
## [1] 0.142503
## [1] 0.1097555
Immediately we see a very low \(R^2\).
As linear modeling does not seem to work with a variable of this nature, lets see if a Random Forest approach might be more effective:
## parsnip model object
##
## Fit time: 701ms
##
## Call:
## randomForest(x = as.data.frame(x), y = y, ntree = ~500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 55.56297
## % Var explained: 22.07
We find that this model give us significantly more explanation into Gini coefficient as a response variable. Lets take a look at the correlation and the distribution of residuals:



While it is clear the correlation is weak, the trend line is not awful. The distribution of the residuals is fairly interesting: it seems that our model is a decent predictor of the Gini Coefficient for values between -20 and 20. For values higher than this, we see some outliers. This raises questions about our testing sample - it is possible that an out-sized number of these larger Gini values ended up in our testing set.