1.Download the dataset from the source.
lifexp <- read.table ("https://raw.githubusercontent.com/angus001/Data605/master/Assign12_rawdata.csv",header =T, sep =",")
colnames(lifexp)
## [1] "Country" "LifeExp" "InfantSurvival" "Under5Survival"
## [5] "TBFree" "PropMD" "PropRN" "PersExp"
## [9] "GovtExp" "TotExp"
head(lifexp)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
head(lifexp)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
pairs(lifexp,gap=0.5, col = 'navyblue')
lifexplm <- lm(LifeExp ~ TotExp, data = lifexp )
summary(lifexplm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = lifexp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.669 -4.096 3.443 7.260 13.450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.466e+01 9.070e-01 71.291 < 2e-16 ***
## TotExp 6.093e-05 9.484e-06 6.424 1.99e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.763 on 138 degrees of freedom
## Multiple R-squared: 0.2302, Adjusted R-squared: 0.2246
## F-statistic: 41.27 on 1 and 138 DF, p-value: 1.993e-09
colnames(lifexp)
## [1] "Country" "LifeExp" "InfantSurvival" "Under5Survival"
## [5] "TBFree" "PropMD" "PropRN" "PersExp"
## [9] "GovtExp" "TotExp"
{plot(lifexp$TotExp,lifexp$LifeExp,main = "Life Expectency vs. Total Medical Expenditure",
col = 'navyblue', pch = 16, xlab = "Total Expenditure (Adjusted Dollars)", ylab = "Life Expectency (Years)")
abline(lifexplm, col = "red")}
Question 2. Raise the life expectency to 4.6 power and TotExp to the power of 0.6 then perform another linear regression of the same two variables.
Firstly, the regression line is a perfect fit for the scatterplot. The R-sqaured value now explains 71 percent of all the variance between different countries. F-statistic is becoming larger as we expect the difference between countries are significant and the p-value for F-statistic is also support the validity.
lifexp$LifeExp46 <- (lifexp$LifeExp)^4.6
lifexp$TotExp06<-(lifexp$TotExp)^.06
lifexplm2 <-lm(LifeExp46 ~ TotExp06, data = lifexp )
summary(lifexplm2)
##
## Call:
## lm(formula = LifeExp46 ~ TotExp06, data = lifexp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -306134143 -57556675 17946146 62597037 210341349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -720554749 56186881 -12.82 <2e-16 ***
## TotExp06 609874086 33137634 18.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92060000 on 138 degrees of freedom
## Multiple R-squared: 0.7105, Adjusted R-squared: 0.7084
## F-statistic: 338.7 on 1 and 138 DF, p-value: < 2.2e-16
{plot(lifexp$TotExp06, lifexp$LifeExp46,main = "Version 2 Life Expectency vs. Total Medical Expenditure",
col = 'navyblue', pch = 16, xlab = "Total Expenditure (Power Raised to 0.06)", ylab = "Life Expectency (Power Raised to 4.6)")
abline(lifexplm2, col = "red")
}
colnames(lifexp)
## [1] "Country" "LifeExp" "InfantSurvival" "Under5Survival"
## [5] "TBFree" "PropMD" "PropRN" "PersExp"
## [9] "GovtExp" "TotExp" "LifeExp46" "TotExp06"
With Multivariate regression, the 92% of the variance can be explained by the variables. Interestingly, the TotExp has large values and it was removed in subsequent backward elimination. The residuals also remain close to zero, suggesting valid model.
Also, the residual Q-Q plot follows a straight line. Thus the model quite robust.
lifexplm3 <-lm(LifeExp ~ TotExp+GovtExp+InfantSurvival+Under5Survival+TBFree+PropMD+PropRN+PersExp, data = lifexp )
summary(lifexplm3)
##
## Call:
## lm(formula = LifeExp ~ TotExp + GovtExp + InfantSurvival + Under5Survival +
## TBFree + PropMD + PropRN + PersExp, data = lifexp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8274 -1.1990 0.4179 1.7742 6.8482
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.073e+03 1.611e+02 -6.658 6.79e-10 ***
## TotExp 1.093e-03 7.970e-04 1.372 0.1725
## GovtExp -1.085e-03 8.090e-04 -1.342 0.1820
## InfantSurvival 1.094e+02 4.878e+01 2.242 0.0267 *
## Under5Survival 5.525e+01 2.870e+01 1.925 0.0564 .
## TBFree 9.836e+02 1.668e+02 5.898 2.91e-08 ***
## PropMD 6.546e+02 2.606e+02 2.512 0.0132 *
## PropRN -3.257e+02 1.311e+02 -2.485 0.0142 *
## PersExp NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.16 on 132 degrees of freedom
## Multiple R-squared: 0.9229, Adjusted R-squared: 0.9188
## F-statistic: 225.7 on 7 and 132 DF, p-value: < 2.2e-16
Remove variables (TotExp & GovtExp) with large p value (p value > 0.005)
lifexplm4 <- update(lifexplm3, .~. -TotExp, data = lifexp)
lifexplm4 <- update(lifexplm4, .~. -GovtExp, data = lifexp)
summary(lifexplm4)
##
## Call:
## lm(formula = LifeExp ~ InfantSurvival + Under5Survival + TBFree +
## PropMD + PropRN + PersExp, data = lifexp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8016 -1.1772 0.4368 1.8917 6.8701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.070e+03 1.607e+02 -6.659 6.62e-10 ***
## InfantSurvival 1.099e+02 4.866e+01 2.258 0.0256 *
## Under5Survival 5.467e+01 2.862e+01 1.910 0.0582 .
## TBFree 9.808e+02 1.663e+02 5.898 2.88e-08 ***
## PropMD 6.583e+02 2.599e+02 2.533 0.0125 *
## PropRN -3.184e+02 1.302e+02 -2.445 0.0158 *
## PersExp 1.555e-03 2.639e-04 5.891 2.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.152 on 133 degrees of freedom
## Multiple R-squared: 0.9227, Adjusted R-squared: 0.9192
## F-statistic: 264.4 on 6 and 133 DF, p-value: < 2.2e-16
Residual analysis.
plot(fitted(lifexplm4),resid(lifexplm4))
qqnorm(resid(lifexplm4), col = "blue")
qqline(resid(lifexplm4), col = "red")