Overview

Here, we will demonstrate the power of mathematics transformation when performing linear regression.

We are going to use a real-world data from 2008 collected by WHO, and can be accessed on my github repository here

The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

Data Exploration

Let first load the data, and take a look on how it looks like:

df <- read.csv('https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/who.csv')

glimpse(df)
## Rows: 190
## Columns: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola~
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63,~
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979,~
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976,~
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, 0~
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.0~
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.0~
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, 1~
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 18761~
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 1907~

We are going to build a simple linear regression of LifeExp as a function of TotExp

LifeExp~TotExp: Simple linear regression

It is good to visualize the regression first

df %>%
  ggplot(aes(x = TotExp, y = LifeExp)) +
  geom_point()

Here, we are going to build the model and have a look on its summary

# Linear model

df_lm <- lm(data = df, LifeExp~TotExp)
summary(df_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

From the summary above, we realize that:

R square = 0.2577; The model explains 25.77% of variability of the response data (LifeExp) around its mean. rse = 9.371; spread is lower F-stat = 65.26 on 1 and 188 DF p-value = 7.714e-14; Reject the null hypothesis

Let now check on linear model assumptions and verify if a linear model is suitable for this data set.

Assumptions

Let take a look on each variable and its residuals…

  1. Normality
qqnorm(df_lm$residuals)
qqline(df_lm$residuals)

Residuals are not nearly normal.

  1. Linearity

The scatter plot will have to check the linearity…

# Scatter plot
df %>%
  ggplot(aes(x = TotExp, y = LifeExp)) +
  geom_point()

Looking at the above plot, we can say at first glace that the data don’t look linear.

  1. Constant variability
ggplot(data = df_lm, aes(x = .fitted, y = .resid)) +
  geom_point()+
  
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

The variability on the residuals is not constant.

At the end, we can say that the assumptions of simple linear regression are not met. That said, it is a good time to do some transformation. We are going to add power on the variables to transfrom the data.

Model transformed

# Transform the data

df$LifeExp <- df$LifeExp**4.6
df$TotExp <- df$TotExp**0.06
# Re-run the model

df_lm2 <- lm(data = df, LifeExp~TotExp)
summary(df_lm2)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Here’s what we’ve got by transforming the data:

R square = 0.7298; The model explains 72.98% of variability of the response data (LifeExp) around its mean. rse = 90490000; Larger deviation from the true regression line F-stat = 507.7 on 1 and 188 DF p-value = 2.2e-16; Reject the null hypothesis

Visualize the new model…

# Scatter plot
df %>%
  ggplot(aes(x = TotExp, y = LifeExp)) +
  geom_point()

Residuals plot…

qqnorm(df_lm2$residuals)
qqline(df_lm2$residuals)

Take Away

It is important to think of mathematics transformations when a linear model can’t fit the data. Here, I choose some value arbitrary to power the variables but you can use PowerTransform function of “car” library for not guessing on which power you need for your variables.

As we can see, the transformed variables has significantly improve the model. The residuals plot shows there is more constant variability and also comparing different parameters(r square, F-Stat), that’s it, the transformed model fits the data better.

LS0tDQp0aXRsZTogIkxpbmVhciBSZWdyZXNzaW9uIC0gTWF0aGVtYXRpY3MgVHJhbnNmb3JtYXRpb24iDQphdXRob3I6ICJKZXJlZCBBdGFreSINCm91dHB1dDogDQogIG9wZW5pbnRybzo6bGFiX3JlcG9ydDogZGVmYXVsdA0KICBodG1sX2RvY3VtZW50Og0KICAgIG51bWJlcl9zZWN0aW9uczogeWVzDQotLS0NCg0KYGBge3IgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChldmFsID0gVFJVRSwgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0UpDQpgYGANCg0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQoNCg0KDQojIyBPdmVydmlldw0KDQpIZXJlLCB3ZSB3aWxsIGRlbW9uc3RyYXRlIHRoZSBwb3dlciBvZiBtYXRoZW1hdGljcyB0cmFuc2Zvcm1hdGlvbiB3aGVuDQpwZXJmb3JtaW5nIGxpbmVhciByZWdyZXNzaW9uLg0KDQpXZSBhcmUgZ29pbmcgdG8gdXNlIGEgcmVhbC13b3JsZCBkYXRhIGZyb20gMjAwOCBjb2xsZWN0ZWQgYnkgV0hPLCBhbmQgY2FuIGJlIGFjY2Vzc2VkDQpvbiBteSBnaXRodWIgcmVwb3NpdG9yeSBbaGVyZV0oaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2puYXRha3kvQ29tcHV0YXRpb25hbF9NYXRoZW1hdGljcy9tYWluL0Fzc2lnbm1lbnRzL3doby5jc3YpDQoNClRoZSB2YXJpYWJsZXMgaW5jbHVkZWQgZm9sbG93Lg0KQ291bnRyeTogbmFtZSBvZiB0aGUgY291bnRyeQ0KTGlmZUV4cDogYXZlcmFnZSBsaWZlIGV4cGVjdGFuY3kgZm9yIHRoZSBjb3VudHJ5IGluIHllYXJzDQpJbmZhbnRTdXJ2aXZhbDogcHJvcG9ydGlvbiBvZiB0aG9zZSBzdXJ2aXZpbmcgdG8gb25lIHllYXIgb3IgbW9yZQ0KVW5kZXI1U3Vydml2YWw6IHByb3BvcnRpb24gb2YgdGhvc2Ugc3Vydml2aW5nIHRvIGZpdmUgeWVhcnMgb3IgbW9yZQ0KVEJGcmVlOiBwcm9wb3J0aW9uIG9mIHRoZSBwb3B1bGF0aW9uIHdpdGhvdXQgVEIuDQpQcm9wTUQ6IHByb3BvcnRpb24gb2YgdGhlIHBvcHVsYXRpb24gd2hvIGFyZSBNRHMNClByb3BSTjogcHJvcG9ydGlvbiBvZiB0aGUgcG9wdWxhdGlvbiB3aG8gYXJlIFJOcw0KUGVyc0V4cDogbWVhbiBwZXJzb25hbCBleHBlbmRpdHVyZXMgb24gaGVhbHRoY2FyZSBpbiBVUyBkb2xsYXJzIGF0IGF2ZXJhZ2UgZXhjaGFuZ2UgcmF0ZQ0KR292dEV4cDogbWVhbiBnb3Zlcm5tZW50IGV4cGVuZGl0dXJlcyBwZXIgY2FwaXRhIG9uIGhlYWx0aGNhcmUsIFVTIGRvbGxhcnMgYXQgYXZlcmFnZSBleGNoYW5nZSByYXRlDQpUb3RFeHA6IHN1bSBvZiBwZXJzb25hbCBhbmQgZ292ZXJubWVudCBleHBlbmRpdHVyZXMuDQoNCiAgDQoNCmBgYHtyIGVjaG8gPSBGQUxTRSwgbWVzc2FnZT1GQUxTRSwgaW5jbHVzZT1GQUxTRX0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KDQpgYGANCiAgDQoNCiMjIERhdGEgRXhwbG9yYXRpb24NCg0KTGV0IGZpcnN0IGxvYWQgdGhlIGRhdGEsIGFuZCB0YWtlIGEgbG9vayBvbiBob3cgaXQgbG9va3MgbGlrZToNCmBgYHtyfQ0KDQpkZiA8LSByZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2puYXRha3kvQ29tcHV0YXRpb25hbF9NYXRoZW1hdGljcy9tYWluL0Fzc2lnbm1lbnRzL3doby5jc3YnKQ0KDQpnbGltcHNlKGRmKQ0KDQpgYGANCg0KDQpXZSBhcmUgZ29pbmcgdG8gYnVpbGQgYSBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gb2YgTGlmZUV4cCBhcyBhIGZ1bmN0aW9uIG9mIFRvdEV4cA0KDQojIyBMaWZlRXhwflRvdEV4cDogU2ltcGxlIGxpbmVhciByZWdyZXNzaW9uDQoNCg0KSXQgaXMgZ29vZCB0byB2aXN1YWxpemUgdGhlIHJlZ3Jlc3Npb24gZmlyc3QNCg0KYGBge3J9DQoNCmRmICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBUb3RFeHAsIHkgPSBMaWZlRXhwKSkgKw0KICBnZW9tX3BvaW50KCkNCg0KDQpgYGANCg0KDQpIZXJlLCB3ZSBhcmUgZ29pbmcgdG8gYnVpbGQgdGhlIG1vZGVsIGFuZCBoYXZlIGEgbG9vayBvbiBpdHMgc3VtbWFyeQ0KDQpgYGB7cn0NCg0KIyBMaW5lYXIgbW9kZWwNCg0KZGZfbG0gPC0gbG0oZGF0YSA9IGRmLCBMaWZlRXhwflRvdEV4cCkNCnN1bW1hcnkoZGZfbG0pDQoNCg0KYGBgDQoNCg0KDQoNCkZyb20gdGhlIHN1bW1hcnkgYWJvdmUsIHdlIHJlYWxpemUgdGhhdDoNCg0KUiBzcXVhcmUgPSAwLjI1Nzc7IFRoZSBtb2RlbCBleHBsYWlucyAyNS43NyUgb2YgdmFyaWFiaWxpdHkgb2YgdGhlIHJlc3BvbnNlIGRhdGEgKExpZmVFeHApIGFyb3VuZCBpdHMgbWVhbi4NCnJzZSA9IDkuMzcxOyBzcHJlYWQgaXMgbG93ZXINCkYtc3RhdCA9IDY1LjI2IG9uIDEgYW5kIDE4OCBERg0KcC12YWx1ZSA9IDcuNzE0ZS0xNDsgUmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMNCg0KDQpMZXQgbm93IGNoZWNrIG9uIGxpbmVhciBtb2RlbCBhc3N1bXB0aW9ucyBhbmQgdmVyaWZ5IGlmIGEgbGluZWFyIG1vZGVsDQppcyBzdWl0YWJsZSBmb3IgdGhpcyBkYXRhIHNldC4NCg0KKipBc3N1bXB0aW9ucyoqDQoNCg0KTGV0IHRha2UgYSBsb29rIG9uIGVhY2ggdmFyaWFibGUgYW5kIGl0cyByZXNpZHVhbHMuLi4NCg0KDQoxLiBOb3JtYWxpdHkgDQoNCmBgYHtyfQ0KDQpxcW5vcm0oZGZfbG0kcmVzaWR1YWxzKQ0KcXFsaW5lKGRmX2xtJHJlc2lkdWFscykNCg0KDQpgYGANCg0KDQpSZXNpZHVhbHMgYXJlIG5vdCBuZWFybHkgbm9ybWFsLg0KDQoNCjIuIExpbmVhcml0eQ0KDQpUaGUgc2NhdHRlciBwbG90IHdpbGwgaGF2ZSB0byBjaGVjayB0aGUgbGluZWFyaXR5Li4uDQoNCmBgYHtyfQ0KDQojIFNjYXR0ZXIgcGxvdA0KZGYgJT4lDQogIGdncGxvdChhZXMoeCA9IFRvdEV4cCwgeSA9IExpZmVFeHApKSArDQogIGdlb21fcG9pbnQoKQ0KDQoNCmBgYA0KDQpMb29raW5nIGF0IHRoZSBhYm92ZSBwbG90LCB3ZSBjYW4gc2F5IGF0IGZpcnN0IGdsYWNlIHRoYXQgdGhlIGRhdGEgZG9uJ3QgbG9vayBsaW5lYXIuDQoNCg0KDQozLiBDb25zdGFudCB2YXJpYWJpbGl0eQ0KDQpgYGB7cn0NCg0KZ2dwbG90KGRhdGEgPSBkZl9sbSwgYWVzKHggPSAuZml0dGVkLCB5ID0gLnJlc2lkKSkgKw0KICBnZW9tX3BvaW50KCkrDQogIA0KICBnZW9tX2hsaW5lKHlpbnRlcmNlcHQgPSAwLCBsaW5ldHlwZSA9ICJkYXNoZWQiKSArDQogIHhsYWIoIkZpdHRlZCB2YWx1ZXMiKSArDQogIHlsYWIoIlJlc2lkdWFscyIpDQoNCg0KYGBgDQoNCg0KVGhlIHZhcmlhYmlsaXR5IG9uIHRoZSByZXNpZHVhbHMgaXMgbm90IGNvbnN0YW50Lg0KDQoNCg0KQXQgdGhlIGVuZCwgd2UgY2FuIHNheSB0aGF0IHRoZSBhc3N1bXB0aW9ucyBvZiBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gYXJlIG5vdCBtZXQuDQpUaGF0IHNhaWQsIGl0IGlzIGEgZ29vZCB0aW1lIHRvIGRvIHNvbWUgdHJhbnNmb3JtYXRpb24uIFdlIGFyZSBnb2luZyB0byBhZGQgcG93ZXIgb24NCnRoZSB2YXJpYWJsZXMgdG8gdHJhbnNmcm9tIHRoZSBkYXRhLg0KDQoNCiMjIE1vZGVsIHRyYW5zZm9ybWVkIA0KDQpgYGB7cn0NCg0KIyBUcmFuc2Zvcm0gdGhlIGRhdGENCg0KZGYkTGlmZUV4cCA8LSBkZiRMaWZlRXhwKio0LjYNCmRmJFRvdEV4cCA8LSBkZiRUb3RFeHAqKjAuMDYNCg0KYGBgDQoNCg0KDQpgYGB7cn0NCiMgUmUtcnVuIHRoZSBtb2RlbA0KDQpkZl9sbTIgPC0gbG0oZGF0YSA9IGRmLCBMaWZlRXhwflRvdEV4cCkNCnN1bW1hcnkoZGZfbG0yKQ0KDQpgYGANCg0KDQoNCkhlcmUncyB3aGF0IHdlJ3ZlIGdvdCBieSB0cmFuc2Zvcm1pbmcgdGhlIGRhdGE6DQoNClIgc3F1YXJlID0gMC43Mjk4OyBUaGUgbW9kZWwgZXhwbGFpbnMgNzIuOTglIG9mIHZhcmlhYmlsaXR5IG9mIHRoZSByZXNwb25zZSBkYXRhIChMaWZlRXhwKSBhcm91bmQgaXRzIG1lYW4uDQpyc2UgPSA5MDQ5MDAwMDsgTGFyZ2VyIGRldmlhdGlvbiBmcm9tIHRoZSB0cnVlIHJlZ3Jlc3Npb24gbGluZQ0KRi1zdGF0ID0gNTA3Ljcgb24gMSBhbmQgMTg4IERGDQpwLXZhbHVlID0gMi4yZS0xNjsgUmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMNCg0KDQpWaXN1YWxpemUgdGhlIG5ldyBtb2RlbC4uLg0KDQpgYGB7cn0NCg0KIyBTY2F0dGVyIHBsb3QNCmRmICU+JQ0KICBnZ3Bsb3QoYWVzKHggPSBUb3RFeHAsIHkgPSBMaWZlRXhwKSkgKw0KICBnZW9tX3BvaW50KCkNCg0KDQpgYGANCg0KUmVzaWR1YWxzIHBsb3QuLi4NCg0KYGBge3J9DQoNCnFxbm9ybShkZl9sbTIkcmVzaWR1YWxzKQ0KcXFsaW5lKGRmX2xtMiRyZXNpZHVhbHMpDQoNCg0KYGBgDQoNCg0KDQojIyBUYWtlIEF3YXkNCg0KSXQgaXMgaW1wb3J0YW50IHRvIHRoaW5rIG9mIG1hdGhlbWF0aWNzIHRyYW5zZm9ybWF0aW9ucyB3aGVuIGEgbGluZWFyIG1vZGVsIGNhbid0IGZpdCB0aGUgZGF0YS4NCkhlcmUsIEkgY2hvb3NlIHNvbWUgdmFsdWUgYXJiaXRyYXJ5IHRvIHBvd2VyIHRoZSB2YXJpYWJsZXMgYnV0IHlvdSBjYW4gdXNlIFBvd2VyVHJhbnNmb3JtIGZ1bmN0aW9uIG9mICJjYXIiIGxpYnJhcnkgZm9yIG5vdCBndWVzc2luZyBvbiB3aGljaCBwb3dlciB5b3UgbmVlZCBmb3IgeW91ciB2YXJpYWJsZXMuDQoNCkFzIHdlIGNhbiBzZWUsIHRoZSB0cmFuc2Zvcm1lZCB2YXJpYWJsZXMgaGFzIHNpZ25pZmljYW50bHkgaW1wcm92ZSB0aGUgbW9kZWwuDQpUaGUgcmVzaWR1YWxzIHBsb3Qgc2hvd3MgdGhlcmUgaXMgbW9yZSBjb25zdGFudCB2YXJpYWJpbGl0eSBhbmQgYWxzbyBjb21wYXJpbmcgZGlmZmVyZW50IHBhcmFtZXRlcnMociBzcXVhcmUsIEYtU3RhdCksIHRoYXQncyBpdCwgdGhlIHRyYW5zZm9ybWVkIG1vZGVsIGZpdHMgdGhlIGRhdGEgYmV0dGVyLg0KDQo=