who_data <- read.csv(text = getURL("https://raw.githubusercontent.com/charlsjoseph/data605wk12/master/who.csv"), header = T, stringsAsFactors =F)
kable(who_data[sample(nrow(who_data), 10), ], align='l', row.names=FALSE)
| Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
|---|---|---|---|---|---|---|---|---|---|
| Brazil | 72 | 0.981 | 0.980 | 0.99945 | 0.0010466 | 0.0034814 | 371 | 13940 | 14311 |
| Namibia | 61 | 0.955 | 0.939 | 0.99342 | 0.0002921 | 0.0030020 | 165 | 3888 | 4053 |
| Belarus | 69 | 0.994 | 0.992 | 0.99929 | 0.0047587 | 0.0124571 | 204 | 11315 | 11519 |
| Turkey | 73 | 0.976 | 0.974 | 0.99968 | 0.0015694 | 0.0029448 | 383 | 18632 | 19015 |
| Armenia | 69 | 0.979 | 0.976 | 0.99920 | 0.0036987 | 0.0049189 | 88 | 1856 | 1944 |
| Chad | 46 | 0.876 | 0.791 | 0.99430 | 0.0000330 | 0.0002387 | 22 | 234 | 256 |
| Bulgaria | 73 | 0.990 | 0.988 | 0.99959 | 0.0002535 | 0.0045532 | 272 | 11550 | 11822 |
| Venezuela (Bolivarian Republic of) | 74 | 0.982 | 0.979 | 0.99948 | 0.0017653 | 0.0010298 | 247 | 10528 | 10775 |
| Slovenia | 78 | 0.997 | 0.996 | 0.99985 | 0.0023603 | 0.0078516 | 1495 | 55233 | 56728 |
| South Africa | 51 | 0.944 | 0.931 | 0.99002 | 0.0007214 | 0.0038205 | 437 | 10920 | 11357 |
summary(who_data)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
who.lm <- lm(LifeExp ~ TotExp, data = who_data)
plot(who_data$TotExp, who_data$LifeExp, xlab = 'Total Expenditure - $', ylab = 'Average Life Expectancy- in Years', main='Avg. Life Expectancy Vs. Expenditure')
abline(who.lm, col="red")
Lets find the summary of linear regression and analyze the key parameters.
summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 2e-16 ***
## TotExp 0.000062970 0.000007795 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Regression model
\[ Avg Life Expectancy=64.75337453+0.00006297∗Total Expenditure \] The model explains on every unit of increase in Total expenditure, the life expenditure goes up by 0.00006297.
plot(who.lm, which=c(1,1))
The resduals are not scattererd, but having an unsual pattern which tells that relation doesnt hold the linear behaviour. P-value is very less which explains there is a strong relation between LifeExp and TotExp. Adjusted R^2 and R^2 is very less which explains the totalExp is not only an explanatory variable and model can be possibly more accurate or better.
#raise total expenditure to the power of 0.06
who_data$TotExp_1 <- (who_data$TotExp)^0.06
#raise life expectancy to the power of 4.6
who_data$LifeExp_1 <- (who_data$LifeExp)^4.6
#build new linear model
who_1.lm <- lm(LifeExp_1 ~ TotExp_1, data = who_data)
plot(who_data$TotExp_1, who_data$LifeExp_1, xlab = 'Total Expenditure(in US Dollars) raised to 0.06', ylab = 'Average Life Expectancy(in Years) raised to 4.6', main='Avg. Life Expectancy Vs. Expenditure')
abline(who_1.lm, col="red")
This explains a better linear relation. Lets check the model key paramaters.
summary(who_1.lm)
##
## Call:
## lm(formula = LifeExp_1 ~ TotExp_1, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_1 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Regression model is
\[LifeExp1=−736527909+620060216∗TotExp1 \]
plot(who_1.lm, which=c(1,1))
The residual looks to be scatter and model looks to follow the linear pattern. Adjusted R^2 and R^2 looks to be closer to 1 and better than previous model.
result1 <- (-736527909 + (620060216 * 1.5))^(1/4.6)
result2 <- (-736527909 + (620060216 * 2.5))^(1/4.6)
result1
## [1] 63.31153
result2
## [1] 86.50645
Estimated life expectancy is 63.31153 years when TotExp(0.06) is 1.5. Estimated life expectancy is 86.50645 years when TotExp(0.06) is 2.5.
who_2.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
summary(who_2.lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.772703255 0.795605238 78.899 < 2e-16 ***
## PropMD 1497.493952519 278.816879652 5.371 2.32e-07 ***
## TotExp 0.000072333 0.000008982 8.053 9.39e-14 ***
## PropMD:TotExp -0.006025686 0.001472357 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Regression model is \[ Average Life Expectancy=62.77270326+1497.49395252×PropMD+0.00007233×TotExp−0.00602569×PropMD∗TotExp \]
p-Value for PropMD, TotExp and PropMD * TotExp are highly relevant to estimate average life expectancy. R2 and Adj. R2 values decreased compared to earlier
TotExp = 14
PropMD = 0.03
forcast.life.exp = 62.77270326 + (1497.49395252 * PropMD) + (0.00007233 * TotExp) - (0.00602569 * PropMD * TotExp)
forcast.life.exp
## [1] 107.696
The result doesnt looks to be realistic. This could be due to in-efficiency of the model variables.