Data 605 HW 12
library(knitr)
library(rmdformats)
## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=31)
# Loading packages
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data Dictionary
The attached who.csv dataset contains real-world data from 2008.
| Field Name | Description |
|---|---|
| Country | name of the country |
| LifeExp | average life expectancy for the country in years |
| InfantSurvival | proportion of those surviving to one year or more |
| Under5Survival | proportion of those surviving to five years or more |
| TBFree | proportion of the population without TB. |
| PropMD | proportion of the population who are MDs |
| PropRN | proportion of the population who are RNs |
| PersExp | mean personal expenditures on healthcare in US dollars at average exchange rate |
| GovtExp | mean government expenditures per capita on healthcare, US dollars at average exchange rate |
| TotExp | sum of personal and government expenditures. |
Data Prep
Country LifeExp InfantSurvival Under5Survival TBFree PropMD
1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
2 Albania 71 0.985 0.983 0.99974 0.001143127
3 Algeria 71 0.967 0.962 0.99944 0.001060478
4 Andorra 82 0.997 0.996 0.99983 0.003297297
5 Angola 41 0.846 0.740 0.99656 0.000070400
PropRN PersExp GovtExp TotExp
1 0.000572294 20 92 112
2 0.004614439 169 3128 3297
3 0.002091362 108 5184 5292
4 0.003500000 2589 169725 172314
5 0.001146162 36 1620 1656
| Name | who_df |
| Number of rows | 190 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Country | 0 | 1 | FALSE | 190 | Afg: 1, Alb: 1, Alg: 1, And: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| LifeExp | 0 | 1 | 67.38 | 10.85 | 40.00 | 61.25 | 70.00 | 75.00 | 83.00 | ▂▂▃▇▅ |
| InfantSurvival | 0 | 1 | 0.96 | 0.04 | 0.84 | 0.94 | 0.98 | 0.99 | 1.00 | ▁▁▂▂▇ |
| Under5Survival | 0 | 1 | 0.95 | 0.06 | 0.73 | 0.93 | 0.97 | 0.99 | 1.00 | ▁▁▁▂▇ |
| TBFree | 0 | 1 | 1.00 | 0.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▂▇ |
| PropMD | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | ▇▁▁▁▁ |
| PropRN | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.01 | 0.07 | ▇▁▁▁▁ |
| PersExp | 0 | 1 | 742.00 | 1354.00 | 3.00 | 36.25 | 199.50 | 515.25 | 6350.00 | ▇▁▁▁▁ |
| GovtExp | 0 | 1 | 40953.49 | 86140.65 | 10.00 | 559.50 | 5385.00 | 25680.25 | 476420.00 | ▇▁▁▁▁ |
| TotExp | 0 | 1 | 41695.49 | 87449.85 | 13.00 | 584.00 | 5541.00 | 26331.00 | 482750.00 | ▇▁▁▁▁ |
Question 1
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Ans:
Scatter plot of LifeExp~TotExp
ggplot(who_df, aes(x = TotExp, y = LifeExp)) + geom_point() + labs(title = "Average Life Expectancy according to Total Expenditures",
x = "TotExp (personal + government spendings)", y = "LifeExp (in years)")
Call:
lm(formula = LifeExp ~ TotExp, data = who_df)
Residuals:
Min 1Q Median 3Q Max
-24.764 -4.778 3.154 7.116 13.292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.371 on 188 degrees of freedom
Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Check for normality of the response variable LifeExp
Shapiro-Wilk normality test
data: who_df$LifeExp
W = 0.92182, p-value = 1.558e-08
[1] 3.891398
The linear model can be expressed as
\(LifeExp = 64.75 + 6.30e^(-5) \times TotExp\).
Interpret the F statistic, R2, standard error, and p-value value.
We do have a F-statistic of 65.26, which is a lot higher than the critical value. With p-value that is less than 0.05, at 7.71e-14 , we understand that there is a convincing evidence to the reject the null hypothesis, which means the linear model fits the data better than an intercept only model.
The standard error of the regression provides the absolute measure of the typical distance that the data points fall from the regression line. Comparatively, we have some very tiny standard errors as compared to the coefficients from this model, i.e. 7.535e-01 to 6.475e+01 for the intercept and 7.795e-06 to 6.297e-05 for the TotExp predictor.
R-squared provides the relative measure of the percentage of the dependent variable variance. We’ve an adjusted R-sqaured of 25.4%, which is not excellent. It tells us that the model only explains 25.4% of the variance of the dependent variable in LifeExp. Take R2 = .2577. We can extract the value of R by taking the square root of it. Thus R is 0.50. The larger the number of R the stronger the relationship between the variables. Generally, the R lied between 0.5 and 0.7 is considered a moderate correlation between the dependent and independent variables.
Next, let’s check for the assumptions of the linear model.
Call:
lm(formula = LifeExp ~ TotExp, data = who_df)
Coefficients:
(Intercept) TotExp
6.475e+01 6.297e-05
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma.lm(lmobj = life_exp.lm)
Value p-value Decision
Global Stat 56.737011 1.405e-11 Assumptions NOT satisfied!
Skewness 30.532757 3.283e-08 Assumptions NOT satisfied!
Kurtosis 0.002804 9.578e-01 Assumptions acceptable.
Link Function 26.074703 3.285e-07 Assumptions NOT satisfied!
Heteroscedasticity 0.126747 7.218e-01 Assumptions acceptable.
So, as seen in the results above, I don’t think the model passes the basic assumptions of simple linear model. In particular, the Link function, Skewness, resulting in Global Stat to fail.
Question 2
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is better?
Ans:
who_df2 <- transmute(who_df, LifeExp = LifeExp^4.6, TotExp = TotExp^0.06)
life_exp.lm2 <- lm(LifeExp ~ TotExp, data = who_df2)
summary(life_exp.lm2)
Call:
lm(formula = LifeExp ~ TotExp, data = who_df2)
Residuals:
Min 1Q Median 3Q Max
-308616089 -53978977 13697187 59139231 211951764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -736527910 46817945 -15.73 <2e-16 ***
TotExp 620060216 27518940 22.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 90490000 on 188 degrees of freedom
Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
[1] 3.891398
The linear model is as follows,
\(LifeExp = -736527910 + 620060216 \times TotExp\)
Interpret the F statistic, R2, standard error, and p-value value.
- Compared to the first model, this one gives us a F-statistics of 508, which is definitely greater than the F critical value of 3.89 and it is a lot larger than the F-statistics seen in the first model. With a p-value < 2.2e-16, we can see it’s definitely much smaller than the p-value seen in the first model. That tells me that this model has a stronger relationship between the dependent variable and independent variable.
- The coefficient of TotExp is 22.53 times of the standard error of TotExp. Likewise for the coefficient estimate of the intercept, it is 15.73 times bigger than the standard error. This ensures that there is relatively less variability in the slope and intercept estimates.
- R2 is .7283, which means the correlation coefficient R is 0.85. This indicates the transformed variables has a much stronger relationship compared to the previous model with the original variables.
par(mfrow = c(2, 2))
# Scatter plot of Life Expectations (in yrs.) against Total Expenditures.
plot(who_df2$TotExp, who_df2$LifeExp, xlab = "Expenditures", ylab = "Years", main = "Life Expectation vs Total Expenditures",
col = "orange", pch = 8)
abline(life_exp.lm2)
# Residuals plot
plot(life_exp.lm2$fitted.values, life_exp.lm2$residuals, xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, lty = 4, col = "blue")
# Distribution of residuals
hist(life_exp.lm2$residuals, xlab = "Residuals")
# Normal Q-Q Plot
qqnorm(life_exp.lm2$residuals)
qqline(life_exp.lm2$residuals)With the exception of some skewness of residuals, the model is acceptable in terms of all other criteria for assumptions.
Question 3
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Ans:
Life Expectancy is 63.31 years when TotExp(.06) = 1.5
# TotExp^.06 = 1.5
life_exp_fc = 620060216 * 1.5 - 736527910
life_exp_back_transformed <- life_exp_fc^(1/4.6)
life_exp_back_transformed[1] 63.31153
Life Expectancy is 86.51 years when TotExp(.06) = 2.5
# TotExp^.06 = 2.5
life_exp_fc = 620060216 * 2.5 - 736527910
life_exp_back_transformed <- life_exp_fc^(1/4.6)
life_exp_back_transformed[1] 86.50645
Question 4
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
\(LifeExp = b_0+b_1 \times PropMd + b2 \times TotExp +b_3 \times (PropMD*TotExp)\)
Ans:
life_exp.reg <- lm(LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who_df)
summary(life_exp.reg)
Call:
lm(formula = LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who_df)
Residuals:
Min 1Q Median 3Q Max
-27.320 -4.132 2.098 6.540 13.074
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.765 on 186 degrees of freedom
Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Find the F critical value.
[1] 2.653165
The multiple regression model is as follows,
\(LifeExp = 6.277e+1 + 1.497e+3 \times PropMD + 7.233e-5 \times TotExp - 6.026e-3 \times (PropMD*TotExp)\)
We see that F-statistics is 34.49, which is much greater than 2.62. Meaning we reject the null hypothesis at 5% significance level, which is then interpreted as this model is better fitted than an intercept-only model. The p-value <2.2e-16 also confirms that.
The standard error of intercept, and all 3 variables are significantly smaller than their coefficient estimates. That’s another good sign for a good model fit.
The R-squared explains 35.74 % of the data’s variation in the dependent variable. R, which is .5978, indicates that the relationship between the dependent variable and the independent variables is moderately correlated.
par(mfrow = c(2, 2))
# Residuals plot
plot(life_exp.reg$fitted.values, life_exp.reg$residuals, xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, lty = 3, col = "blue")
# Distirbution of residtuals
hist(life_exp.reg$residuals, xlab = "Residuals")
# Normality check
qqnorm(life_exp.reg$residuals)
qqline(life_exp.reg$residuals)Independence of Errors: From the residuals plot, we can see that the distribution of errors is not random and there is a pattern. Thus, we can say there is an autocorrelation effect among the residuals and the independence assumption is not valid.
Homoscedasticity: From the residuals plot, we can see that variation of residuals does not seem to be constant along the line drawn from 0 residual.
Normality of Error Distribution: The residuals histogram shows a left-skewed distribution and the qq plot also demonstrates that the residuals are skewed and not normally distributed.
Linearity: There is no linearity of residuals as shown in the Normal Q-Q plot.
So the model violates all assumptions. I don’t think this is an appropriate model.
Question 5
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
Ans:
The forecast is 107.68 years given PropMD = .03 and TotExp = 14.
I think the forecast is realistic as long as the input, namely, independent variables that are given are realistic. Think about having 3% of population to be MDs, coupled with an elevated level of total expenditures of 14 Trillions or whichever unit it is. Since the previous Life Expectancy was 86.51 years given the total expenditures is 2.5 Trillions or whichever unit it is. Forecast is always leveraged to predict something that we are currently not at. So I do think there is a possibility that the human beings, or population, at large, reaches a life expectancy of 120 years old at some point in the future.
[1] 107.6785