First, I imported the data.
df <- as.data.frame(read.csv("who.csv"))
df
We can see from the scatterplot, that this data is probably not associated linearly, although there appears to be a strong positive correlation of some kind. Additionally, in the Q-Q plot, we can see a systematic over and underestimation, with the former happening at the middle, and the latter at the tails.
plot(df$LifeExp ~ df$TotExp)
model1 <- lm(df$LifeExp ~ df$TotExp)
summary(model1)
Call:
lm(formula = df$LifeExp ~ df$TotExp)
Residuals:
Min 1Q Median 3Q Max
-24.764 -4.778 3.154 7.116 13.292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
df$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.371 on 188 degrees of freedom
Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
plot(model1)
We can examing this further by looking at the summary of the model as below. On a 67 year life span, the residual error is 9 years. That’s a significant deviation! Additionally, this model only explains 25% of the variance, as denoted by the \(R^2\), and adjusted \(R^2\) values. When \(\alpha = .05\), we can reject the null hypothesis because the p-value of our test is \(8 \cdot 10 ^{-14}\).
mean(df$LifeExp)
summary(model1)
critical.f <- qf(p = .05, df1 = 1, df2 = 188)
critical.f
Below, I transformed the data and plotted it. From this plot, we can obsever a far more linear relationships between the new vectors.
life.exp.new <- df$LifeExp**4.6
total.expenditures.new <- df$TotExp**.06
plot(life.exp.new~total.expenditures.new)
plot(life.exp.new~ total.expenditures.new)
model2 <- lm(life.exp.new ~ total.expenditures.new)
plot(model2)
The residual plots are far more random, the Q-Q plot has a better fit near the middle, and even the scalled residuals seem to be randomly distributed. Furthermore, we can examine the summary from this second model.
summary(model2)
Call:
lm(formula = life.exp.new ~ total.expenditures.new)
Residuals:
Min 1Q Median 3Q Max
-308616089 -53978977 13697187 59139231 211951764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -736527910 46817945 -15.73 <2e-16 ***
total.expenditures.new 620060216 27518940 22.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 90490000 on 188 degrees of freedom
Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
This model has a superior F-statistic and the R^2 value says it explains 75% rather than 25% of the variance in the model. However, the residual standard error is astronomical, particuarly given the scale of our transformed data. Using the results from this second, model, we can predict life expectancy as below.
total.exp1 = 1.5
total.exp2 = 2.5
life.expectancy1 = -736527910 + total.exp1 * 620060216
life.expectancy2 = -736527910 + total.exp2 * 620060216
life.expectancy1**(1/4.6)
[1] 63.31153
life.expectancy2**(1/4.6)
[1] 86.50645
The first scenario predicts 63 years and the second predict 86.5.
Below, I build a multiple regression model as below
model3 <- lm(df$LifeExp ~ df$PropMD+df$TotExp+df$PropMD*df$TotExp)
this <- summary(model3)
this
Call:
lm(formula = df$LifeExp ~ df$PropMD + df$TotExp + df$PropMD *
df$TotExp)
Residuals:
Min 1Q Median 3Q Max
-27.320 -4.132 2.098 6.540 13.074
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
df$PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
df$TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
df$PropMD:df$TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.765 on 186 degrees of freedom
Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
mean(df$TotExp)
[1] 41695.49
This third model explains 35% of the variance, but has a similar F-statistic as model 3. This model appears to be overfitting.
b0 <- this$coefficients[1]
b1 <- this$coefficients[2]
b2 <- this$coefficients[3]
b3 <- this$coefficients[4]
life.exp <- b0 + b1*.03 + b2 *14 + b3*14*.03
life.exp
This does not seem realisitic because 107 is not a real life expectancy anywhere, government and personal expenditures are only $14 and 1/30 people are doctors in this scenario.