Introduction

Prediction is one most the main goal of building a regression model. In one my blogs ( read here), I spoke about mathematics transformation for linear regression. Today, we are going to use the same data set and go over a transformed model to predict some values.

Just to recall, it ’s a real-world data from 2008 collected by WHO, and can be accessed on my github repository here

The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

We are going to focus on predicting “LifeExp” based on “TotExp”.

Data Exploration

First, let load the data and have a look on that.

df <- read.csv('https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/who.csv')

glimpse(df)
## Rows: 190
## Columns: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola~
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,~
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,~
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,~
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0~
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0~
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0~
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1~
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761~
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907~

We are going to build a simple linear regression of LifeExp as a function of TotExp

The Model

We are going here to apply directly a transformation to our data knowing that a simple model not transformed is not fitting the data well (See the blog link mentioned above).

Let transform the variable by applying some power and then build the linear model.

# Transform the data

df$LifeExp <- df$LifeExp**4.6
df$TotExp <- df$TotExp**0.06
df_lm <- lm(data = df, LifeExp~TotExp)
summary(df_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Here’s what we’ve got by transforming the data:

R square = 0.7298; The model explains 72.98% of variability of the response data (LifeExp) around its mean. rse = 90490000; Larger deviation from the true regression line F-stat = 507.7 on 1 and 188 DF p-value = 2.2e-16; Reject the null hypothesis

Visualize the new model…

# Scatter plot
df %>%
  ggplot(aes(x = TotExp, y = LifeExp)) +
  geom_point()

Residuals plot…

qqnorm(df_lm$residuals)
qqline(df_lm$residuals)

The residuals plot shows there is more constant variability. Looking at different parameters(r square, F-Stat), the transformed model fits the data better.

Forecasting

Now we are going to forest life expectancy when sum of personal and government expenditures TotExp = 14

We know from the model that:

LifeExp = -736527910 + (620060216)*(TotExp)

Let do the predictions:

TotExp2 <- 1.5
LifeExp11 <- -736527910 + (620060216)*(TotExp2)

paste0("Life expectancy when  TotExp^.06 = 1.5 is ", LifeExp11)
## [1] "Life expectancy when  TotExp^.06 = 1.5 is 193562414"
TotExp4 <- 2.5
LifeExp22 = -736527910 + (620060216)*(TotExp4)

paste0("Life expectancy when  TotExp^.06 = 2.5 is ", LifeExp22)
## [1] "Life expectancy when  TotExp^.06 = 2.5 is 813622630"

Transforming back LifeExp = LifeExp^(1/4.6)…

LifeExp21 <- round(LifeExp11**(1/4.6), 0)

paste0("Life expectancy when  TotExp^.06 = 1.5 and  transorming back LifeExp is ", LifeExp21)
## [1] "Life expectancy when  TotExp^.06 = 1.5 and  transorming back LifeExp is 63"
LifeExp41 <- round(LifeExp22**(1/4.6), 0)

paste0("Life expectancy when  TotExp^.06 = 1.5 and  transorming back LifeExp is ", LifeExp41)
## [1] "Life expectancy when  TotExp^.06 = 1.5 and  transorming back LifeExp is 87"

And these predictions seems realistic based on WHO data.

LS0tDQp0aXRsZTogIkxpbmVhciBSZWdyZXNzaW9uIEFwcGxpY2F0aW9uIC0gRm9yZWNhc3RpbmciDQphdXRob3I6ICJKZXJlZCBBdGFreSINCm91dHB1dDogDQogIG9wZW5pbnRybzo6bGFiX3JlcG9ydDogZGVmYXVsdA0KICBodG1sX2RvY3VtZW50Og0KICAgIG51bWJlcl9zZWN0aW9uczogeWVzDQotLS0NCg0KYGBge3IgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChldmFsID0gVFJVRSwgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0UpDQpgYGANCg0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQoNCg0KIyMgSW50cm9kdWN0aW9uDQoNCg0KUHJlZGljdGlvbiBpcyBvbmUgbW9zdCB0aGUgbWFpbiBnb2FsIG9mIGJ1aWxkaW5nIGEgcmVncmVzc2lvbiBtb2RlbC4NCkluIG9uZSBteSBibG9ncyAoIHJlYWQgW2hlcmVdKGh0dHBzOi8vcnB1YnMuY29tL2puYXRha3kvNzcyODU3KSksIEkgc3Bva2UgYWJvdXQgbWF0aGVtYXRpY3MgdHJhbnNmb3JtYXRpb24gZm9yIGxpbmVhciByZWdyZXNzaW9uLg0KVG9kYXksIHdlIGFyZSBnb2luZyB0byB1c2UgdGhlIHNhbWUgZGF0YSBzZXQgYW5kIGdvIG92ZXIgYSB0cmFuc2Zvcm1lZCBtb2RlbCB0byBwcmVkaWN0IHNvbWUgdmFsdWVzLg0KDQpKdXN0IHRvIHJlY2FsbCwgaXQgJ3MgYSByZWFsLXdvcmxkIGRhdGEgZnJvbSAyMDA4IGNvbGxlY3RlZCBieSBXSE8sIGFuZCBjYW4gYmUgYWNjZXNzZWQNCm9uIG15IGdpdGh1YiByZXBvc2l0b3J5IFtoZXJlXShodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9Db21wdXRhdGlvbmFsX01hdGhlbWF0aWNzL21haW4vQXNzaWdubWVudHMvd2hvLmNzdikNCg0KVGhlIHZhcmlhYmxlcyBpbmNsdWRlZCBmb2xsb3cuDQpDb3VudHJ5OiBuYW1lIG9mIHRoZSBjb3VudHJ5DQpMaWZlRXhwOiBhdmVyYWdlIGxpZmUgZXhwZWN0YW5jeSBmb3IgdGhlIGNvdW50cnkgaW4geWVhcnMNCkluZmFudFN1cnZpdmFsOiBwcm9wb3J0aW9uIG9mIHRob3NlIHN1cnZpdmluZyB0byBvbmUgeWVhciBvciBtb3JlDQpVbmRlcjVTdXJ2aXZhbDogcHJvcG9ydGlvbiBvZiB0aG9zZSBzdXJ2aXZpbmcgdG8gZml2ZSB5ZWFycyBvciBtb3JlDQpUQkZyZWU6IHByb3BvcnRpb24gb2YgdGhlIHBvcHVsYXRpb24gd2l0aG91dCBUQi4NClByb3BNRDogcHJvcG9ydGlvbiBvZiB0aGUgcG9wdWxhdGlvbiB3aG8gYXJlIE1Ecw0KUHJvcFJOOiBwcm9wb3J0aW9uIG9mIHRoZSBwb3B1bGF0aW9uIHdobyBhcmUgUk5zDQpQZXJzRXhwOiBtZWFuIHBlcnNvbmFsIGV4cGVuZGl0dXJlcyBvbiBoZWFsdGhjYXJlIGluIFVTIGRvbGxhcnMgYXQgYXZlcmFnZSBleGNoYW5nZSByYXRlDQpHb3Z0RXhwOiBtZWFuIGdvdmVybm1lbnQgZXhwZW5kaXR1cmVzIHBlciBjYXBpdGEgb24gaGVhbHRoY2FyZSwgVVMgZG9sbGFycyBhdCBhdmVyYWdlIGV4Y2hhbmdlIHJhdGUNClRvdEV4cDogc3VtIG9mIHBlcnNvbmFsIGFuZCBnb3Zlcm5tZW50IGV4cGVuZGl0dXJlcy4NCg0KDQpXZSBhcmUgZ29pbmcgdG8gZm9jdXMgb24gcHJlZGljdGluZyAiTGlmZUV4cCIgYmFzZWQgb24gIlRvdEV4cCIuDQoNCg0KDQpgYGB7ciBlY2hvID0gRkFMU0UsIG1lc3NhZ2U9RkFMU0UsIGluY2x1ZGU9RkFMU0V9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCg0KDQpgYGAgIA0KDQoNCg0KIyMgRGF0YSBFeHBsb3JhdGlvbg0KDQoNCkZpcnN0LCBsZXQgbG9hZCB0aGUgZGF0YSBhbmQgaGF2ZSBhIGxvb2sgb24gdGhhdC4NCg0KYGBge3J9DQoNCmRmIDwtIHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9Db21wdXRhdGlvbmFsX01hdGhlbWF0aWNzL21haW4vQXNzaWdubWVudHMvd2hvLmNzdicpDQoNCmdsaW1wc2UoZGYpDQoNCmBgYA0KDQoNCldlIGFyZSBnb2luZyB0byBidWlsZCBhIHNpbXBsZSBsaW5lYXIgcmVncmVzc2lvbiBvZiBMaWZlRXhwIGFzIGEgZnVuY3Rpb24gb2YgVG90RXhwDQoNCg0KDQojIyBUaGUgTW9kZWwNCg0KV2UgYXJlIGdvaW5nIGhlcmUgdG8gYXBwbHkgZGlyZWN0bHkgYSB0cmFuc2Zvcm1hdGlvbiB0byBvdXIgZGF0YSBrbm93aW5nDQp0aGF0IGEgc2ltcGxlIG1vZGVsIG5vdCB0cmFuc2Zvcm1lZCBpcyBub3QgZml0dGluZyB0aGUgZGF0YSB3ZWxsDQooU2VlIHRoZSBibG9nIGxpbmsgbWVudGlvbmVkIGFib3ZlKS4NCg0KTGV0IHRyYW5zZm9ybSB0aGUgdmFyaWFibGUgYnkgYXBwbHlpbmcgc29tZSBwb3dlciBhbmQgdGhlbiBidWlsZCB0aGUgbGluZWFyIG1vZGVsLg0KDQoNCmBgYHtyfQ0KDQojIFRyYW5zZm9ybSB0aGUgZGF0YQ0KDQpkZiRMaWZlRXhwIDwtIGRmJExpZmVFeHAqKjQuNg0KZGYkVG90RXhwIDwtIGRmJFRvdEV4cCoqMC4wNg0KDQpgYGANCg0KDQoNCmBgYHtyfQ0KDQpkZl9sbSA8LSBsbShkYXRhID0gZGYsIExpZmVFeHB+VG90RXhwKQ0Kc3VtbWFyeShkZl9sbSkNCg0KYGBgDQoNCg0KDQpIZXJlJ3Mgd2hhdCB3ZSd2ZSBnb3QgYnkgdHJhbnNmb3JtaW5nIHRoZSBkYXRhOg0KDQpSIHNxdWFyZSA9IDAuNzI5ODsgVGhlIG1vZGVsIGV4cGxhaW5zIDcyLjk4JSBvZiB2YXJpYWJpbGl0eSBvZiB0aGUgcmVzcG9uc2UgZGF0YSAoTGlmZUV4cCkgYXJvdW5kIGl0cyBtZWFuLg0KcnNlID0gOTA0OTAwMDA7IExhcmdlciBkZXZpYXRpb24gZnJvbSB0aGUgdHJ1ZSByZWdyZXNzaW9uIGxpbmUNCkYtc3RhdCA9IDUwNy43IG9uIDEgYW5kIDE4OCBERg0KcC12YWx1ZSA9IDIuMmUtMTY7IFJlamVjdCB0aGUgbnVsbCBoeXBvdGhlc2lzDQoNCg0KVmlzdWFsaXplIHRoZSBuZXcgbW9kZWwuLi4NCg0KYGBge3J9DQoNCiMgU2NhdHRlciBwbG90DQpkZiAlPiUNCiAgZ2dwbG90KGFlcyh4ID0gVG90RXhwLCB5ID0gTGlmZUV4cCkpICsNCiAgZ2VvbV9wb2ludCgpDQoNCg0KYGBgDQoNClJlc2lkdWFscyBwbG90Li4uDQoNCmBgYHtyfQ0KDQpxcW5vcm0oZGZfbG0kcmVzaWR1YWxzKQ0KcXFsaW5lKGRmX2xtJHJlc2lkdWFscykNCg0KDQpgYGANCg0KDQpUaGUgcmVzaWR1YWxzIHBsb3Qgc2hvd3MgdGhlcmUgaXMgbW9yZSBjb25zdGFudCB2YXJpYWJpbGl0eS4gTG9va2luZyBhdCBkaWZmZXJlbnQgcGFyYW1ldGVycyhyIHNxdWFyZSwgRi1TdGF0KSwgdGhlIHRyYW5zZm9ybWVkIG1vZGVsIGZpdHMgdGhlIGRhdGEgYmV0dGVyLg0KDQoNCiMjIEZvcmVjYXN0aW5nIA0KDQpOb3cgd2UgYXJlIGdvaW5nIHRvIGZvcmVzdCBsaWZlIGV4cGVjdGFuY3kgd2hlbiBzdW0gb2YgcGVyc29uYWwgYW5kIGdvdmVybm1lbnQgZXhwZW5kaXR1cmVzIA0KVG90RXhwID0gMTQNCg0KV2Uga25vdyBmcm9tIHRoZSBtb2RlbCB0aGF0Og0KDQpMaWZlRXhwID0gLTczNjUyNzkxMCArICg2MjAwNjAyMTYpKihUb3RFeHApDQoNCkxldCBkbyB0aGUgcHJlZGljdGlvbnM6DQoNCmBgYHtyfQ0KDQpUb3RFeHAyIDwtIDEuNQ0KTGlmZUV4cDExIDwtIC03MzY1Mjc5MTAgKyAoNjIwMDYwMjE2KSooVG90RXhwMikNCg0KcGFzdGUwKCJMaWZlIGV4cGVjdGFuY3kgd2hlbiAgVG90RXhwXi4wNiA9IDEuNSBpcyAiLCBMaWZlRXhwMTEpDQoNClRvdEV4cDQgPC0gMi41DQpMaWZlRXhwMjIgPSAtNzM2NTI3OTEwICsgKDYyMDA2MDIxNikqKFRvdEV4cDQpDQoNCnBhc3RlMCgiTGlmZSBleHBlY3RhbmN5IHdoZW4gIFRvdEV4cF4uMDYgPSAyLjUgaXMgIiwgTGlmZUV4cDIyKQ0KDQoNCmBgYA0KDQoNClRyYW5zZm9ybWluZyBiYWNrIExpZmVFeHAgPSBMaWZlRXhwXigxLzQuNikuLi4NCg0KYGBge3J9DQoNCkxpZmVFeHAyMSA8LSByb3VuZChMaWZlRXhwMTEqKigxLzQuNiksIDApDQoNCnBhc3RlMCgiTGlmZSBleHBlY3RhbmN5IHdoZW4gIFRvdEV4cF4uMDYgPSAxLjUgYW5kICB0cmFuc29ybWluZyBiYWNrIExpZmVFeHAgaXMgIiwgTGlmZUV4cDIxKQ0KDQoNCkxpZmVFeHA0MSA8LSByb3VuZChMaWZlRXhwMjIqKigxLzQuNiksIDApDQoNCnBhc3RlMCgiTGlmZSBleHBlY3RhbmN5IHdoZW4gIFRvdEV4cF4uMDYgPSAxLjUgYW5kICB0cmFuc29ybWluZyBiYWNrIExpZmVFeHAgaXMgIiwgTGlmZUV4cDQxKQ0KDQoNCmBgYA0KDQoNCkFuZCB0aGVzZSBwcmVkaWN0aW9ucyBzZWVtcyByZWFsaXN0aWMgYmFzZWQgb24gV0hPIGRhdGEuDQo=