Overview
Here, we will demonstrate the power of mathematics transformation when performing linear regression.
We are going to use a real-world data from 2008 collected by WHO, and can be accessed on my github repository here
The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
Data Exploration
Let first load the data, and take a look on how it looks like:
df <- read.csv('https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/who.csv')
glimpse(df)
## Rows: 190
## Columns: 10
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola~
## $ LifeExp <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,~
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,~
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,~
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0~
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0~
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0~
## $ PersExp <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1~
## $ GovtExp <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761~
## $ TotExp <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907~
We are going to build a simple linear regression of LifeExp as a function of TotExp
LifeExp~TotExp: Simple linear regression
It is good to visualize the regression first
df %>%
ggplot(aes(x = TotExp, y = LifeExp)) +
geom_point()

Here, we are going to build the model and have a look on its summary
# Linear model
df_lm <- lm(data = df, LifeExp~TotExp)
summary(df_lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
From the summary above, we realize that:
R square = 0.2577; The model explains 25.77% of variability of the response data (LifeExp) around its mean. rse = 9.371; spread is lower F-stat = 65.26 on 1 and 188 DF p-value = 7.714e-14; Reject the null hypothesis
Let now check on linear model assumptions and verify if a linear model is suitable for this data set.
Assumptions
Let take a look on each variable and its residuals…
- Normality
qqnorm(df_lm$residuals)
qqline(df_lm$residuals)

Residuals are not nearly normal.
- Linearity
The scatter plot will have to check the linearity…
# Scatter plot
df %>%
ggplot(aes(x = TotExp, y = LifeExp)) +
geom_point()

Looking at the above plot, we can say at first glace that the data don’t look linear.
- Constant variability
ggplot(data = df_lm, aes(x = .fitted, y = .resid)) +
geom_point()+
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

The variability on the residuals is not constant.
At the end, we can say that the assumptions of simple linear regression are not met. That said, it is a good time to do some transformation. We are going to add power on the variables to transfrom the data.
Take Away
It is important to think of mathematics transformations when a linear model can’t fit the data. Here, I choose some value arbitrary to power the variables but you can use PowerTransform function of “car” library for not guessing on which power you need for your variables.
As we can see, the transformed variables has significantly improve the model. The residuals plot shows there is more constant variability and also comparing different parameters(r square, F-Stat), that’s it, the transformed model fits the data better.
LS0tDQp0aXRsZTogIkxpbmVhciBSZWdyZXNzaW9uIC0gTWF0aGVtYXRpY3MgVHJhbnNmb3JtYXRpb24iDQphdXRob3I6ICJKZXJlZCBBdGFreSINCm91dHB1dDogDQogIG9wZW5pbnRybzo6bGFiX3JlcG9ydDogZGVmYXVsdA0KICBodG1sX2RvY3VtZW50Og0KICAgIG51bWJlcl9zZWN0aW9uczogeWVzDQotLS0NCg0KYGBge3IgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChldmFsID0gVFJVRSwgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0UpDQpgYGANCg0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQoNCg0KDQojIyBPdmVydmlldw0KDQpIZXJlLCB3ZSB3aWxsIGRlbW9uc3RyYXRlIHRoZSBwb3dlciBvZiBtYXRoZW1hdGljcyB0cmFuc2Zvcm1hdGlvbiB3aGVuDQpwZXJmb3JtaW5nIGxpbmVhciByZWdyZXNzaW9uLg0KDQpXZSBhcmUgZ29pbmcgdG8gdXNlIGEgcmVhbC13b3JsZCBkYXRhIGZyb20gMjAwOCBjb2xsZWN0ZWQgYnkgV0hPLCBhbmQgY2FuIGJlIGFjY2Vzc2VkDQpvbiBteSBnaXRodWIgcmVwb3NpdG9yeSBbaGVyZV0oaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2puYXRha3kvQ29tcHV0YXRpb25hbF9NYXRoZW1hdGljcy9tYWluL0Fzc2lnbm1lbnRzL3doby5jc3YpDQoNClRoZSB2YXJpYWJsZXMgaW5jbHVkZWQgZm9sbG93Lg0KQ291bnRyeTogbmFtZSBvZiB0aGUgY291bnRyeQ0KTGlmZUV4cDogYXZlcmFnZSBsaWZlIGV4cGVjdGFuY3kgZm9yIHRoZSBjb3VudHJ5IGluIHllYXJzDQpJbmZhbnRTdXJ2aXZhbDogcHJvcG9ydGlvbiBvZiB0aG9zZSBzdXJ2aXZpbmcgdG8gb25lIHllYXIgb3IgbW9yZQ0KVW5kZXI1U3Vydml2YWw6IHByb3BvcnRpb24gb2YgdGhvc2Ugc3Vydml2aW5nIHRvIGZpdmUgeWVhcnMgb3IgbW9yZQ0KVEJGcmVlOiBwcm9wb3J0aW9uIG9mIHRoZSBwb3B1bGF0aW9uIHdpdGhvdXQgVEIuDQpQcm9wTUQ6IHByb3BvcnRpb24gb2YgdGhlIHBvcHVsYXRpb24gd2hvIGFyZSBNRHMNClByb3BSTjogcHJvcG9ydGlvbiBvZiB0aGUgcG9wdWxhdGlvbiB3aG8gYXJlIFJOcw0KUGVyc0V4cDogbWVhbiBwZXJzb25hbCBleHBlbmRpdHVyZXMgb24gaGVhbHRoY2FyZSBpbiBVUyBkb2xsYXJzIGF0IGF2ZXJhZ2UgZXhjaGFuZ2UgcmF0ZQ0KR292dEV4cDogbWVhbiBnb3Zlcm5tZW50IGV4cGVuZGl0dXJlcyBwZXIgY2FwaXRhIG9uIGhlYWx0aGNhcmUsIFVTIGRvbGxhcnMgYXQgYXZlcmFnZSBleGNoYW5nZSByYXRlDQpUb3RFeHA6IHN1bSBvZiBwZXJzb25hbCBhbmQgZ292ZXJubWVudCBleHBlbmRpdHVyZXMuDQoNCiAgDQoNCmBgYHtyIGVjaG8gPSBGQUxTRSwgbWVzc2FnZT1GQUxTRSwgaW5jbHVzZT1GQUxTRX0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KDQpgYGANCiAgDQoNCiMjIERhdGEgRXhwbG9yYXRpb24NCg0KTGV0IGZpcnN0IGxvYWQgdGhlIGRhdGEsIGFuZCB0YWtlIGEgbG9vayBvbiBob3cgaXQgbG9va3MgbGlrZToNCmBgYHtyfQ0KDQpkZiA8LSByZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2puYXRha3kvQ29tcHV0YXRpb25hbF9NYXRoZW1hdGljcy9tYWluL0Fzc2lnbm1lbnRzL3doby5jc3YnKQ0KDQpnbGltcHNlKGRmKQ0KDQpgYGANCg0KDQpXZSBhcmUgZ29pbmcgdG8gYnVpbGQgYSBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gb2YgTGlmZUV4cCBhcyBhIGZ1bmN0aW9uIG9mIFRvdEV4cA0KDQojIyBMaWZlRXhwflRvdEV4cDogU2ltcGxlIGxpbmVhciByZWdyZXNzaW9uDQoNCg0KSXQgaXMgZ29vZCB0byB2aXN1YWxpemUgdGhlIHJlZ3Jlc3Npb24gZmlyc3QNCg0KYGBge3J9DQoNCmRmICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBUb3RFeHAsIHkgPSBMaWZlRXhwKSkgKw0KICBnZW9tX3BvaW50KCkNCg0KDQpgYGANCg0KDQpIZXJlLCB3ZSBhcmUgZ29pbmcgdG8gYnVpbGQgdGhlIG1vZGVsIGFuZCBoYXZlIGEgbG9vayBvbiBpdHMgc3VtbWFyeQ0KDQpgYGB7cn0NCg0KIyBMaW5lYXIgbW9kZWwNCg0KZGZfbG0gPC0gbG0oZGF0YSA9IGRmLCBMaWZlRXhwflRvdEV4cCkNCnN1bW1hcnkoZGZfbG0pDQoNCg0KYGBgDQoNCg0KDQoNCkZyb20gdGhlIHN1bW1hcnkgYWJvdmUsIHdlIHJlYWxpemUgdGhhdDoNCg0KUiBzcXVhcmUgPSAwLjI1Nzc7IFRoZSBtb2RlbCBleHBsYWlucyAyNS43NyUgb2YgdmFyaWFiaWxpdHkgb2YgdGhlIHJlc3BvbnNlIGRhdGEgKExpZmVFeHApIGFyb3VuZCBpdHMgbWVhbi4NCnJzZSA9IDkuMzcxOyBzcHJlYWQgaXMgbG93ZXINCkYtc3RhdCA9IDY1LjI2IG9uIDEgYW5kIDE4OCBERg0KcC12YWx1ZSA9IDcuNzE0ZS0xNDsgUmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMNCg0KDQpMZXQgbm93IGNoZWNrIG9uIGxpbmVhciBtb2RlbCBhc3N1bXB0aW9ucyBhbmQgdmVyaWZ5IGlmIGEgbGluZWFyIG1vZGVsDQppcyBzdWl0YWJsZSBmb3IgdGhpcyBkYXRhIHNldC4NCg0KKipBc3N1bXB0aW9ucyoqDQoNCg0KTGV0IHRha2UgYSBsb29rIG9uIGVhY2ggdmFyaWFibGUgYW5kIGl0cyByZXNpZHVhbHMuLi4NCg0KDQoxLiBOb3JtYWxpdHkgDQoNCmBgYHtyfQ0KDQpxcW5vcm0oZGZfbG0kcmVzaWR1YWxzKQ0KcXFsaW5lKGRmX2xtJHJlc2lkdWFscykNCg0KDQpgYGANCg0KDQpSZXNpZHVhbHMgYXJlIG5vdCBuZWFybHkgbm9ybWFsLg0KDQoNCjIuIExpbmVhcml0eQ0KDQpUaGUgc2NhdHRlciBwbG90IHdpbGwgaGF2ZSB0byBjaGVjayB0aGUgbGluZWFyaXR5Li4uDQoNCmBgYHtyfQ0KDQojIFNjYXR0ZXIgcGxvdA0KZGYgJT4lDQogIGdncGxvdChhZXMoeCA9IFRvdEV4cCwgeSA9IExpZmVFeHApKSArDQogIGdlb21fcG9pbnQoKQ0KDQoNCmBgYA0KDQpMb29raW5nIGF0IHRoZSBhYm92ZSBwbG90LCB3ZSBjYW4gc2F5IGF0IGZpcnN0IGdsYWNlIHRoYXQgdGhlIGRhdGEgZG9uJ3QgbG9vayBsaW5lYXIuDQoNCg0KDQozLiBDb25zdGFudCB2YXJpYWJpbGl0eQ0KDQpgYGB7cn0NCg0KZ2dwbG90KGRhdGEgPSBkZl9sbSwgYWVzKHggPSAuZml0dGVkLCB5ID0gLnJlc2lkKSkgKw0KICBnZW9tX3BvaW50KCkrDQogIA0KICBnZW9tX2hsaW5lKHlpbnRlcmNlcHQgPSAwLCBsaW5ldHlwZSA9ICJkYXNoZWQiKSArDQogIHhsYWIoIkZpdHRlZCB2YWx1ZXMiKSArDQogIHlsYWIoIlJlc2lkdWFscyIpDQoNCg0KYGBgDQoNCg0KVGhlIHZhcmlhYmlsaXR5IG9uIHRoZSByZXNpZHVhbHMgaXMgbm90IGNvbnN0YW50Lg0KDQoNCg0KQXQgdGhlIGVuZCwgd2UgY2FuIHNheSB0aGF0IHRoZSBhc3N1bXB0aW9ucyBvZiBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gYXJlIG5vdCBtZXQuDQpUaGF0IHNhaWQsIGl0IGlzIGEgZ29vZCB0aW1lIHRvIGRvIHNvbWUgdHJhbnNmb3JtYXRpb24uIFdlIGFyZSBnb2luZyB0byBhZGQgcG93ZXIgb24NCnRoZSB2YXJpYWJsZXMgdG8gdHJhbnNmcm9tIHRoZSBkYXRhLg0KDQoNCiMjIE1vZGVsIHRyYW5zZm9ybWVkIA0KDQpgYGB7cn0NCg0KIyBUcmFuc2Zvcm0gdGhlIGRhdGENCg0KZGYkTGlmZUV4cCA8LSBkZiRMaWZlRXhwKio0LjYNCmRmJFRvdEV4cCA8LSBkZiRUb3RFeHAqKjAuMDYNCg0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgUmUtcnVuIHRoZSBtb2RlbA0KDQpkZl9sbTIgPC0gbG0oZGF0YSA9IGRmLCBMaWZlRXhwflRvdEV4cCkNCnN1bW1hcnkoZGZfbG0yKQ0KDQpgYGANCg0KDQoNCkhlcmUncyB3aGF0IHdlJ3ZlIGdvdCBieSB0cmFuc2Zvcm1pbmcgdGhlIGRhdGE6DQoNClIgc3F1YXJlID0gMC43Mjk4OyBUaGUgbW9kZWwgZXhwbGFpbnMgNzIuOTglIG9mIHZhcmlhYmlsaXR5IG9mIHRoZSByZXNwb25zZSBkYXRhIChMaWZlRXhwKSBhcm91bmQgaXRzIG1lYW4uDQpyc2UgPSA5MDQ5MDAwMDsgTGFyZ2VyIGRldmlhdGlvbiBmcm9tIHRoZSB0cnVlIHJlZ3Jlc3Npb24gbGluZQ0KRi1zdGF0ID0gNTA3Ljcgb24gMSBhbmQgMTg4IERGDQpwLXZhbHVlID0gMi4yZS0xNjsgUmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMNCg0KDQpWaXN1YWxpemUgdGhlIG5ldyBtb2RlbC4uLg0KDQpgYGB7cn0NCg0KIyBTY2F0dGVyIHBsb3QNCmRmICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBUb3RFeHAsIHkgPSBMaWZlRXhwKSkgKw0KICBnZW9tX3BvaW50KCkNCg0KDQpgYGANCg0KUmVzaWR1YWxzIHBsb3QuLi4NCg0KYGBge3J9DQoNCnFxbm9ybShkZl9sbTIkcmVzaWR1YWxzKQ0KcXFsaW5lKGRmX2xtMiRyZXNpZHVhbHMpDQoNCg0KYGBgDQoNCg0KDQojIyBUYWtlIEF3YXkNCg0KSXQgaXMgaW1wb3J0YW50IHRvIHRoaW5rIG9mIG1hdGhlbWF0aWNzIHRyYW5zZm9ybWF0aW9ucyB3aGVuIGEgbGluZWFyIG1vZGVsIGNhbid0IGZpdCB0aGUgZGF0YS4NCkhlcmUsIEkgY2hvb3NlIHNvbWUgdmFsdWUgYXJiaXRyYXJ5IHRvIHBvd2VyIHRoZSB2YXJpYWJsZXMgYnV0IHlvdSBjYW4gdXNlIFBvd2VyVHJhbnNmb3JtIGZ1bmN0aW9uIG9mICJjYXIiIGxpYnJhcnkgZm9yIG5vdCBndWVzc2luZyBvbiB3aGljaCBwb3dlciB5b3UgbmVlZCBmb3IgeW91ciB2YXJpYWJsZXMuDQoNCkFzIHdlIGNhbiBzZWUsIHRoZSB0cmFuc2Zvcm1lZCB2YXJpYWJsZXMgaGFzIHNpZ25pZmljYW50bHkgaW1wcm92ZSB0aGUgbW9kZWwuDQpUaGUgcmVzaWR1YWxzIHBsb3Qgc2hvd3MgdGhlcmUgaXMgbW9yZSBjb25zdGFudCB2YXJpYWJpbGl0eSBhbmQgYWxzbyBjb21wYXJpbmcgZGlmZmVyZW50IHBhcmFtZXRlcnMociBzcXVhcmUsIEYtU3RhdCksIHRoYXQncyBpdCwgdGhlIHRyYW5zZm9ybWVkIG1vZGVsIGZpdHMgdGhlIGRhdGEgYmV0dGVyLg0KDQo=