Data set :
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Reading data :
who <- read.csv("who.csv")
head(who)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5
life_exp <- function(x)
{ y <- -736527910 + 620060216 * (x)
y <- y^(1/4.6)
print(y)
}
#Life expectancy when TotExp^.06 =1.5
life_exp(1.5)
## [1] 63.31153
#Life expectancy when TotExp^.06 =2.5
life_exp(2.5)
## [1] 86.50645
Conclusion:
When TotExp=1.5, the forecast life expectancy is 63.31 years and when the TotExp=2.5, the life expectancy is 86.51 years.
4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?: LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
REGRESSION MODEL :
m3 = lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
# linear regression
ggplot(m3, aes(TotExp, LifeExp)) + geom_point(colour="orange", size=2) + geom_abline(aes(slope=round(m1$coefficients[2], 4), intercept=round(m1$coefficients[1], 4))) + labs(title = "Total Expenditures vs. Life Expetancy") + xlab("Total Expenditures") + ylab("Life Expectancy")

RESIDUAL PLOT :
# residual plot
ggplot(m3, aes(.fitted, .resid)) + geom_point(color = "red", size=2) +labs(title = "Fitted Values vs Residuals") +labs(x = "Fitted Values") +labs(y = "Residuals")

# normal plot
qqnorm(resid(m3))
qqline(resid(m3))

CONCLUSION :
summary(m3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F−Statistic is 34.49 and the Standard Error is 8.765. The p−value is again nearly 0. The R2 is 0.3574. The model explains only 35.74% of variability.
In this new model, we notice that the residuals and Q-Q plot are not normally distributed. This model is not a good model to describe the relationships between variables TotExp, PropMd and LifeExp.
5.Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
summary(m3)$coefficients[1] + .03* summary(m3)$coefficients[2] + 14*summary(m3)$coefficients[3] + (.03*14)*summary(m3)$coefficients[4]
## [1] 107.696
CONCLUSION :
When PropMd=0.03 and TotExp=14, the forecast value of LifeExp is 107.69 years which is unrealistic because the highest life expectancy in the dataset is 83 years. Therefore, we conclude that this is unrealistic.