The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
who <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DATA605_homework/master/data605_week12/who.csv?token=AX_Wu-t3_xslaF9_0F270eMa2nmi-hwEks5aG1hawA%3D%3D")
plot(who$TotExp,who$LifeExp)
From the plot, it is hard to find a trend showing there is correlation between average life expectancy for the country( LifeExp) and sum of personal and government expenditures( TotExp).
Simple linear regression
lifeexp.lm <- lm(who$LifeExp ~ who$TotExp, data=who)
summary(lifeexp.lm)
##
## Call:
## lm(formula = who$LifeExp ~ who$TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## who$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
ANOVA
lifeexp_lm_anova <- anova(lifeexp.lm)
lifeexp_lm_anova
## Analysis of Variance Table
##
## Response: who$LifeExp
## Df Sum Sq Mean Sq F value Pr(>F)
## who$TotExp 1 5731.3 5731.3 65.264 7.714e-14 ***
## Residuals 188 16509.5 87.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear regression model: \(LifeExp \sim 6.475 + 6.297^{5} \times TotExp\)
F value : 65.264
R^2: 0.2577
standard error: 7.795e-06
p-values: 7.714e-14
R square indicates that there is weak correlation between LifeExp and TotExp.
Residual Analysis
Use the residual analysis techniques to check the validity of the assumptions.
plot(fitted(lifeexp.lm),resid(lifeexp.lm))
abline(0,0)
qqnorm(resid(lifeexp.lm))
qqline(resid(lifeexp.lm))
In the scatter plot for residules, the variances of residuals are not uniformly scattered about zero. Additionally, The residuals tend to decrease as moving to the right. Overall, this plot tells us that using the sum of personal and government expenditures as the sole predictor in the regression model does not sufficiently or fully explain the data.
The Q-Q plot shows that the residuals did not tightly follow the indicated line. The two ends diverge significantly from that line, This behavior indicates that the residuals are not normally distributed,suggesting that using only the sum of personal and government expenditures as a predictor in the model is insufficient to explain the data.
We may be able to construct a model that produces tighter residual values and better predictions by including more predictors.
t_LifeExp<- who$LifeExp^4.6
t_TotExp<- who$TotExp^.06
plot(t_TotExp,t_LifeExp)
From the plot, we can see a trend showing there is correlation between the transformed average life expectancy for the country(t_LifeExp) and the transformed sum of personal and government expenditures(t_TotExp).
Simple linear regression
tlifeexp.lm <- lm(t_LifeExp ~ t_TotExp, data=who)
summary(tlifeexp.lm)
##
## Call:
## lm(formula = t_LifeExp ~ t_TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## t_TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
ANOVA
tlifeexp_lm_anova <- anova(tlifeexp.lm)
tlifeexp_lm_anova
## Analysis of Variance Table
##
## Response: t_LifeExp
## Df Sum Sq Mean Sq F value Pr(>F)
## t_TotExp 1 4.1575e+18 4.1575e+18 507.7 < 2.2e-16 ***
## Residuals 188 1.5395e+18 8.1889e+15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear regression model: \(LifeExp ~ -736527910 + 620060216 \times TotExp\)
F value : 507.7
R^2: 0.7298
standard error: 27518940
p-values: < 2.2e-16
R square indicates that there is strong correlation between LifeExp and TotExp.
Residual Analysis
Use the residual analysis techniques to check the validity of the assumptions.
plot(fitted(tlifeexp.lm),resid(tlifeexp.lm))
abline(0,0)
qqnorm(resid(tlifeexp.lm))
qqline(resid(tlifeexp.lm))
The variances of residuals areUniformly scattered about zero.
The Q-Q plot shows that the residuals follow the indicated line.
Taken together, the model based on the transformed varaibles are better.
# split the dataset into train and test sets
rows <- nrow(who)
f <- 0.5
upper_bound <- floor(f * rows)
permuted_who.dat <- who[sample(rows), ]
train.dat <- permuted_who.dat[1:upper_bound, ]
test.dat <- permuted_who.dat[(upper_bound+1):rows, ]
#transform vairaibles as #2
t_LifeExp <- train.dat$LifeEx^4.6
t_TotExp <- train.dat$TotExp^0.06
#computing the model's coefficients as training the regression model.
tnlifeexp.lm <- lm(t_LifeExp ~ t_TotExp, data=train.dat )
#uses the model obtained from above to compute the predicted outputs
predicted.dat <- predict(tnlifeexp.lm, newdata=test.dat)
#finc the difference between the predicted and measured performance
delta <- predicted.dat - test.dat$LifeEx^4.6
#use t-test to see how well a model trained on the train.dat data set predicted the performance of the processors in the test.dat data set
t.test(delta, conf.level = 0.95)
##
## One Sample t-test
##
## data: delta
## t = 0.65251, df = 94, p-value = 0.5157
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -31902844 63136054
## sample estimates:
## mean of x
## 15616605
we obtain a 95 percent confidence interval of [-67662359, 24551445]. This is a reasonably tight confidence interval that includes zero. Thus, we conclude that the model is reasonably good at predicting values in the test.dat data set when trained on the train.dat data set.
When TotExp^.06 =1.5
LifeExp_1.5 <- -736527910 + 620060216 * 1.5
cat("life expectancy when TotExp^.06 =1.5 is ",LifeExp_1.5^(1/4.6))
## life expectancy when TotExp^.06 =1.5 is 63.31153
When TotExp^.06 =2.5
LifeExp_2.5 <- -736527910 + 620060216 * 2.5
cat("life expectancy when TotExp^.06 =2.5 is ",LifeExp_2.5^(1/4.6))
## life expectancy when TotExp^.06 =2.5 is 86.50645
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
multiple regression model
mllifeexp.lm <- lm(LifeExp ~ PropMD+TotExp+(PropMD*TotExp), data=who)
summary(mllifeexp.lm )
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
ANOVA
mllifeexp_lm_anova <- anova(mllifeexp.lm)
mllifeexp_lm_anova
## Analysis of Variance Table
##
## Response: LifeExp
## Df Sum Sq Mean Sq F value Pr(>F)
## PropMD 1 2966.5 2966.5 38.610 3.313e-09 ***
## TotExp 1 3696.2 3696.2 48.106 6.471e-11 ***
## PropMD:TotExp 1 1286.9 1286.9 16.749 6.353e-05 ***
## Residuals 186 14291.1 76.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear regression model: \(LifeExp \sim 6.277 + 1.497^3 \times PropMd + 7.233^{-5} \times TotExp -6.026e^{-3} \times PropMD x TotExp\)
F-statistic: 34.49 on 3 and 186 DF
R^2: 0.3574
standard error: 27518940
p-values: < 2.2e-16
R square indicates that there is strong correlation between LifeExp and TotExp.
Residual Analysis
Use the residual analysis techniques to check the validity of the assumptions.
plot(fitted(mllifeexp.lm),resid(mllifeexp.lm))
abline(0,0)
qqnorm(resid(mllifeexp.lm))
qqline(resid(mllifeexp.lm))
In the scatter plot for residules, the variances of residuals are not uniformly scattered about zero. Additionally, The residuals tend to decrease as moving to the right. Overall, this plot tells us that using the sum of personal and government expenditures, proportion of the population who are MDs, and the products of these two variables as the predictors in the regression model does not sufficiently or fully explain the data.
The Q-Q plot shows that the residuals did not tightly follow the indicated line. The two ends diverge significantly from that line, This behavior indicates that the residuals are not normally distributed,suggesting that using sum of personal and government expenditures, proportion of the population who are MDs, and the products of these two variables as a predictors in the model is insufficient to explain the data.
We may be able to construct a model that produces tighter residual values and better predictions by including more predictors.
lifeexp<-6.277 + 1.497^3 * .03 + 7.233^(-5) * 14 -6.026^(-3) * .03 * 14
The forecast, average life expectancy for the country in years, being only 6 does not seem realistic. It is not hard to understand that a not valid repression model is not appropreate to be used in predicting average life expectancy for the country.