real-world data from 2008. The variables included follow:
who <- read_csv("who.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_double(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_double(),
## GovtExp = col_double(),
## TotExp = col_double()
## )
head(who)
who %>% ggplot(mapping = aes(x=TotExp, y = jitter(LifeExp))) +
geom_point() +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
scale_y_continuous( labels = scales::number)+ ggtitle("LifeExp~TotExp") +
ylab("LifeExp")
#simple linear regression
lm_slt<- lm( LifeExp~TotExp,who)
summary(lm_slt)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
par(mfrow=c(2,2))
plot(lm_slt)
par(mfrow=c(2,2))
hist(lm_slt$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_slt$residuals, fitted(lm_slt))
The Plot of Adjust vs Residuals are seems to be following some pattern and making this not meeting the conditon of normal distrubited Residuals. Adjusted R^2 is just 25% , which is not so much encourging as we are only able to explain 25% of the variation within the data The p-value is less than the significance level (usually 0.05) i.e. 7.714e-14, which indicate that our model fit the data well.
head(who)
who$LifeExp_p <- who$LifeExp^4.6
who$TotExp_p <- who$TotExp^.06
who %>% ggplot(mapping = aes(x=TotExp_p, y = (LifeExp_p))) +
geom_point() +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
scale_y_continuous( labels = scales::number)+ ggtitle("LifeExp~TotExp") +
ylab("LifeExp")
#simple linear regression
lm_sltp<- lm( LifeExp_p~TotExp_p,who)
summary(lm_sltp)
##
## Call:
## lm(formula = LifeExp_p ~ TotExp_p, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_p 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_sltp)
par(mfrow=c(2,2))
hist(lm_sltp$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_sltp$residuals, fitted(lm_sltp))
Surprised! the new transformation with Power gave us the very good result almost 72% of variance can be exmaplined in the data, and P-vlaue is still les than .05, and hence very good fit for our dataset. Since LifeExp_p is positive powered and very large number, we have very high standard error with 188 degree of freedom.
test_dt= data.frame(TotExp_p = 1.5)
test_result <- predict(lm_sltp,test_dt)
# logb(29305338,4.6)*10
print(paste('life expectancy when TotExp^.06 =1.5. is :->',logb(test_result,4.6)))
## [1] "life expectancy when TotExp^.06 =1.5. is :-> 12.5035429738939"
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
lm_slt_m <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp),who)
summary(lm_slt_m)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_slt_m)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
+ All the Coefficients are looking significant + Adjusted R-squared is just 34% which says that we are only able to explain 34 % variations of the data. + P-value is well below .05 and indicating a well suited for the data.
Overall this model may not explain all the data points , as we can see from the “Residuals vs Leverage” plot that we have few important data points which are pulling the model to some extreme result and they are impacting the overall model output.
test_dt2= data.frame(PropMD=.03 ,TotExp = 14)
test_result <- predict(lm_slt_m,test_dt2)
# logb(29305338,4.6)*10
print(paste('LifeExp when PropMD=.03 and TotExp = 14. is :->',test_result))
## [1] "LifeExp when PropMD=.03 and TotExp = 14. is :-> 107.696003708062"
(who[which(who$PropMD>.029 | who$TotExp < 15),])
Based on data points , with who\(PropMD>.029 | who\)TotExp < 15 , it looks like 107 years of Life expectancy may be too much here, since sum of personal and government expenditures is less and number of MDs are less we don’t expect good life expectancy.