R Notebook

First, I imported the data.

df <- as.data.frame(read.csv("who.csv"))
df

We can see from the scatterplot, that this data is probably not associated linearly, although there appears to be a strong positive correlation of some kind. Additionally, in the Q-Q plot, we can see a systematic over and underestimation, with the former happening at the middle, and the latter at the tails.

plot(df$LifeExp ~ df$TotExp)

model1 <- lm(df$LifeExp ~ df$TotExp)
summary(model1)


Call:
lm(formula = df$LifeExp ~ df$TotExp)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.764  -4.778   3.154   7.116  13.292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
df$TotExp   6.297e-05  7.795e-06   8.079 7.71e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.371 on 188 degrees of freedom
Multiple R-squared:  0.2577,    Adjusted R-squared:  0.2537 
F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

plot(model1)

We can examing this further by looking at the summary of the model as below. On a 67 year life span, the residual error is 9 years. That’s a significant deviation! Additionally, this model only explains 25% of the variance, as denoted by the $R^2$, and adjusted $R^2$ values. When $\alpha = .05$, we can reject the null hypothesis because the p-value of our test is $8 \cdot 10 ^{-14}$.

mean(df$LifeExp)
summary(model1)
critical.f <- qf(p = .05, df1 = 1, df2 = 188)
critical.f

Below, I transformed the data and plotted it. From this plot, we can obsever a far more linear relationships between the new vectors.

life.exp.new <- df$LifeExp**4.6
total.expenditures.new <- df$TotExp**.06
plot(life.exp.new~total.expenditures.new)

plot(life.exp.new~ total.expenditures.new)

model2 <- lm(life.exp.new ~ total.expenditures.new)
plot(model2)

The residual plots are far more random, the Q-Q plot has a better fit near the middle, and even the scalled residuals seem to be randomly distributed. Furthermore, we can examine the summary from this second model.

summary(model2)


Call:
lm(formula = life.exp.new ~ total.expenditures.new)

Residuals:
       Min         1Q     Median         3Q        Max 
-308616089  -53978977   13697187   59139231  211951764 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -736527910   46817945  -15.73   <2e-16 ***
total.expenditures.new  620060216   27518940   22.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 90490000 on 188 degrees of freedom
Multiple R-squared:  0.7298,    Adjusted R-squared:  0.7283 
F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

This model has a superior F-statistic and the R^2 value says it explains 75% rather than 25% of the variance in the model. However, the residual standard error is astronomical, particuarly given the scale of our transformed data. Using the results from this second, model, we can predict life expectancy as below.

total.exp1 = 1.5
total.exp2 = 2.5
life.expectancy1 = -736527910 + total.exp1 * 620060216
life.expectancy2 = -736527910 + total.exp2 * 620060216
life.expectancy1**(1/4.6)

[1] 63.31153

life.expectancy2**(1/4.6)

[1] 86.50645

The first scenario predicts 63 years and the second predict 86.5.

Below, I build a multiple regression model as below

model3 <- lm(df$LifeExp ~ df$PropMD+df$TotExp+df$PropMD*df$TotExp)
this <- summary(model3)
this


Call:
lm(formula = df$LifeExp ~ df$PropMD + df$TotExp + df$PropMD * 
    df$TotExp)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.320  -4.132   2.098   6.540  13.074 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          6.277e+01  7.956e-01  78.899  < 2e-16 ***
df$PropMD            1.497e+03  2.788e+02   5.371 2.32e-07 ***
df$TotExp            7.233e-05  8.982e-06   8.053 9.39e-14 ***
df$PropMD:df$TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.765 on 186 degrees of freedom
Multiple R-squared:  0.3574,    Adjusted R-squared:  0.3471 
F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

mean(df$TotExp)

[1] 41695.49

This third model explains 35% of the variance, but has a similar F-statistic as model 3. This model appears to be overfitting.

b0 <- this$coefficients[1]

b1 <- this$coefficients[2]
b2 <- this$coefficients[3]
b3 <- this$coefficients[4]

life.exp <- b0 + b1*.03 + b2 *14 + b3*14*.03
life.exp

This does not seem realisitic because 107 is not a real life expectancy anywhere, government and personal expenditures are only $14 and 1/30 people are doctors in this scenario.

LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQpGaXJzdCwgSSBpbXBvcnRlZCB0aGUgZGF0YS4KYGBge3J9CmRmIDwtIGFzLmRhdGEuZnJhbWUocmVhZC5jc3YoIndoby5jc3YiKSkKZGYKYGBgCldlIGNhbiBzZWUgZnJvbSB0aGUgc2NhdHRlcnBsb3QsIHRoYXQgdGhpcyBkYXRhIGlzIHByb2JhYmx5IG5vdCBhc3NvY2lhdGVkIGxpbmVhcmx5LCBhbHRob3VnaCB0aGVyZSBhcHBlYXJzIHRvIGJlIGEgc3Ryb25nIHBvc2l0aXZlIGNvcnJlbGF0aW9uIG9mIHNvbWUga2luZC4gQWRkaXRpb25hbGx5LCBpbiB0aGUgUS1RIHBsb3QsIHdlIGNhbiBzZWUgYSBzeXN0ZW1hdGljIG92ZXIgYW5kIHVuZGVyZXN0aW1hdGlvbiwgd2l0aCB0aGUgZm9ybWVyIGhhcHBlbmluZyBhdCB0aGUgbWlkZGxlLCBhbmQgdGhlIGxhdHRlciBhdCB0aGUgdGFpbHMuCmBgYHtyfQpwbG90KGRmJExpZmVFeHAgfiBkZiRUb3RFeHApCm1vZGVsMSA8LSBsbShkZiRMaWZlRXhwIH4gZGYkVG90RXhwKQpwbG90KG1vZGVsMSkKYGBgCldlIGNhbiBleGFtaW5nIHRoaXMgZnVydGhlciBieSBsb29raW5nIGF0IHRoZSBzdW1tYXJ5IG9mIHRoZSBtb2RlbCBhcyBiZWxvdy4gT24gYSA2NyB5ZWFyIGxpZmUgc3BhbiwgdGhlIHJlc2lkdWFsIGVycm9yIGlzIDkgeWVhcnMuIFRoYXQncyBhIHNpZ25pZmljYW50IGRldmlhdGlvbiEgQWRkaXRpb25hbGx5LCB0aGlzIG1vZGVsIG9ubHkgZXhwbGFpbnMgMjUlIG9mIHRoZSB2YXJpYW5jZSwgYXMgZGVub3RlZCBieSB0aGUgJFJeMiQsIGFuZCBhZGp1c3RlZCAkUl4yJCB2YWx1ZXMuIFdoZW4gJFxhbHBoYSA9IC4wNSQsIHdlIGNhbiByZWplY3QgdGhlIG51bGwgaHlwb3RoZXNpcyBiZWNhdXNlIHRoZSBwLXZhbHVlIG9mIG91ciB0ZXN0IGlzICQ4IFxjZG90IDEwIF57LTE0fSQuCmBgYHtyfQptZWFuKGRmJExpZmVFeHApCnN1bW1hcnkobW9kZWwxKQpjcml0aWNhbC5mIDwtIHFmKHAgPSAuMDUsIGRmMSA9IDEsIGRmMiA9IDE4OCkKY3JpdGljYWwuZgpgYGAKQmVsb3csIEkgdHJhbnNmb3JtZWQgdGhlIGRhdGEgYW5kIHBsb3R0ZWQgaXQuIEZyb20gdGhpcyBwbG90LCB3ZSBjYW4gb2JzZXZlciBhIGZhciBtb3JlIGxpbmVhciByZWxhdGlvbnNoaXBzIGJldHdlZW4gdGhlIG5ldyB2ZWN0b3JzLgoKYGBge3J9CmxpZmUuZXhwLm5ldyA8LSBkZiRMaWZlRXhwKio0LjYKdG90YWwuZXhwZW5kaXR1cmVzLm5ldyA8LSBkZiRUb3RFeHAqKi4wNgpwbG90KGxpZmUuZXhwLm5ld350b3RhbC5leHBlbmRpdHVyZXMubmV3KQpgYGAKYGBge3J9CnBsb3QobGlmZS5leHAubmV3fiB0b3RhbC5leHBlbmRpdHVyZXMubmV3KQptb2RlbDIgPC0gbG0obGlmZS5leHAubmV3IH4gdG90YWwuZXhwZW5kaXR1cmVzLm5ldykKcGxvdChtb2RlbDIpCmBgYApUaGUgcmVzaWR1YWwgcGxvdHMgYXJlIGZhciBtb3JlIHJhbmRvbSwgdGhlIFEtUSBwbG90IGhhcyBhIGJldHRlciBmaXQgbmVhciB0aGUgbWlkZGxlLCBhbmQgZXZlbiB0aGUgc2NhbGxlZCByZXNpZHVhbHMgc2VlbSB0byBiZSByYW5kb21seSBkaXN0cmlidXRlZC4gRnVydGhlcm1vcmUsIHdlIGNhbiBleGFtaW5lIHRoZSBzdW1tYXJ5IGZyb20gdGhpcyBzZWNvbmQgbW9kZWwuCmBgYHtyfQpzdW1tYXJ5KG1vZGVsMikKYGBgClRoaXMgbW9kZWwgaGFzIGEgc3VwZXJpb3IgRi1zdGF0aXN0aWMgYW5kIHRoZSBSXjIgdmFsdWUgc2F5cyBpdCBleHBsYWlucyA3NSUgcmF0aGVyIHRoYW4gMjUlIG9mIHRoZSB2YXJpYW5jZSBpbiB0aGUgbW9kZWwuIEhvd2V2ZXIsIHRoZSByZXNpZHVhbCBzdGFuZGFyZCBlcnJvciBpcyBhc3Ryb25vbWljYWwsIHBhcnRpY3Vhcmx5IGdpdmVuIHRoZSBzY2FsZSBvZiBvdXIgdHJhbnNmb3JtZWQgZGF0YS4gVXNpbmcgdGhlIHJlc3VsdHMgZnJvbSB0aGlzIHNlY29uZCwgbW9kZWwsIHdlIGNhbiBwcmVkaWN0IGxpZmUgZXhwZWN0YW5jeSBhcyBiZWxvdy4KYGBge3J9CnRvdGFsLmV4cDEgPSAxLjUKdG90YWwuZXhwMiA9IDIuNQpsaWZlLmV4cGVjdGFuY3kxID0gLTczNjUyNzkxMCArIHRvdGFsLmV4cDEgKiA2MjAwNjAyMTYKbGlmZS5leHBlY3RhbmN5MiA9IC03MzY1Mjc5MTAgKyB0b3RhbC5leHAyICogNjIwMDYwMjE2CmxpZmUuZXhwZWN0YW5jeTEqKigxLzQuNikKbGlmZS5leHBlY3RhbmN5MioqKDEvNC42KQpgYGAKVGhlIGZpcnN0IHNjZW5hcmlvIHByZWRpY3RzIDYzIHllYXJzIGFuZCB0aGUgc2Vjb25kIHByZWRpY3QgODYuNS4KCkJlbG93LCBJIGJ1aWxkIGEgbXVsdGlwbGUgcmVncmVzc2lvbiBtb2RlbCBhcyBiZWxvdwpgYGB7cn0KbW9kZWwzIDwtIGxtKGRmJExpZmVFeHAgfiBkZiRQcm9wTUQrZGYkVG90RXhwK2RmJFByb3BNRCpkZiRUb3RFeHApCnRoaXMgPC0gc3VtbWFyeShtb2RlbDMpCnRoaXMKbWVhbihkZiRUb3RFeHApCmBgYApUaGlzIHRoaXJkIG1vZGVsIGV4cGxhaW5zIDM1JSBvZiB0aGUgdmFyaWFuY2UsIGJ1dCBoYXMgYSBzaW1pbGFyIEYtc3RhdGlzdGljIGFzIG1vZGVsIDMuIFRoaXMgbW9kZWwgYXBwZWFycyB0byBiZSBvdmVyZml0dGluZy4KCmBgYHtyfQpiMCA8LSB0aGlzJGNvZWZmaWNpZW50c1sxXQoKYjEgPC0gdGhpcyRjb2VmZmljaWVudHNbMl0KYjIgPC0gdGhpcyRjb2VmZmljaWVudHNbM10KYjMgPC0gdGhpcyRjb2VmZmljaWVudHNbNF0KCmxpZmUuZXhwIDwtIGIwICsgYjEqLjAzICsgYjIgKjE0ICsgYjMqMTQqLjAzCmxpZmUuZXhwCmBgYAoKVGhpcyBkb2VzIG5vdCBzZWVtIHJlYWxpc2l0aWMgYmVjYXVzZSAxMDcgaXMgbm90IGEgcmVhbCBsaWZlIGV4cGVjdGFuY3kgYW55d2hlcmUsIGdvdmVybm1lbnQgYW5kIHBlcnNvbmFsIGV4cGVuZGl0dXJlcyBhcmUgb25seSAkMTQgYW5kIDEvMzAgcGVvcGxlIGFyZSBkb2N0b3JzIGluIHRoaXMgc2NlbmFyaW8u