Data 605 HW 12

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=31)
# Loading packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(utils)
library(ggplot2)
library(skimr)

Data Dictionary

The attached who.csv dataset contains real-world data from 2008.

Field Name Description
Country name of the country
LifeExp average life expectancy for the country in years
InfantSurvival proportion of those surviving to one year or more
Under5Survival proportion of those surviving to five years or more
TBFree proportion of the population without TB.
PropMD proportion of the population who are MDs
PropRN proportion of the population who are RNs
PersExp mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp sum of personal and government expenditures.

Data Prep

setwd("/Users/dpong/Data 605/Week 12/homework/")
who_df <- read.csv("who.csv")
head(who_df, 5)
      Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
1 Afghanistan      42          0.835          0.743 0.99769 0.000228841
2     Albania      71          0.985          0.983 0.99974 0.001143127
3     Algeria      71          0.967          0.962 0.99944 0.001060478
4     Andorra      82          0.997          0.996 0.99983 0.003297297
5      Angola      41          0.846          0.740 0.99656 0.000070400
       PropRN PersExp GovtExp TotExp
1 0.000572294      20      92    112
2 0.004614439     169    3128   3297
3 0.002091362     108    5184   5292
4 0.003500000    2589  169725 172314
5 0.001146162      36    1620   1656
skim(who_df)
Data summary
Name who_df
Number of rows 190
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Country 0 1 FALSE 190 Afg: 1, Alb: 1, Alg: 1, And: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LifeExp 0 1 67.38 10.85 40.00 61.25 70.00 75.00 83.00 ▂▂▃▇▅
InfantSurvival 0 1 0.96 0.04 0.84 0.94 0.98 0.99 1.00 ▁▁▂▂▇
Under5Survival 0 1 0.95 0.06 0.73 0.93 0.97 0.99 1.00 ▁▁▁▂▇
TBFree 0 1 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ▁▁▁▂▇
PropMD 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.04 ▇▁▁▁▁
PropRN 0 1 0.00 0.01 0.00 0.00 0.00 0.01 0.07 ▇▁▁▁▁
PersExp 0 1 742.00 1354.00 3.00 36.25 199.50 515.25 6350.00 ▇▁▁▁▁
GovtExp 0 1 40953.49 86140.65 10.00 559.50 5385.00 25680.25 476420.00 ▇▁▁▁▁
TotExp 0 1 41695.49 87449.85 13.00 584.00 5541.00 26331.00 482750.00 ▇▁▁▁▁

Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Ans:

Scatter plot of LifeExp~TotExp

ggplot(who_df, aes(x = TotExp, y = LifeExp)) + geom_point() + labs(title = "Average Life Expectancy according to Total Expenditures", 
    x = "TotExp (personal + government spendings)", y = "LifeExp (in years)")

life_exp.lm <- lm(LifeExp ~ TotExp, data = who_df)
summary(life_exp.lm)

Call:
lm(formula = LifeExp ~ TotExp, data = who_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.764  -4.778   3.154   7.116  13.292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.371 on 188 degrees of freedom
Multiple R-squared:  0.2577,    Adjusted R-squared:  0.2537 
F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Check for normality of the response variable LifeExp

shapiro.test(who_df$LifeExp)

    Shapiro-Wilk normality test

data:  who_df$LifeExp
W = 0.92182, p-value = 1.558e-08
# getting the F critical value
qf(0.95, 1, 188)
[1] 3.891398

The linear model can be expressed as

\(LifeExp = 64.75 + 6.30e^(-5) \times TotExp\).

Interpret the F statistic, R2, standard error, and p-value value.

  • We do have a F-statistic of 65.26, which is a lot higher than the critical value. With p-value that is less than 0.05, at 7.71e-14 , we understand that there is a convincing evidence to the reject the null hypothesis, which means the linear model fits the data better than an intercept only model.

  • The standard error of the regression provides the absolute measure of the typical distance that the data points fall from the regression line. Comparatively, we have some very tiny standard errors as compared to the coefficients from this model, i.e. 7.535e-01 to 6.475e+01 for the intercept and 7.795e-06 to 6.297e-05 for the TotExp predictor.

  • R-squared provides the relative measure of the percentage of the dependent variable variance. We’ve an adjusted R-sqaured of 25.4%, which is not excellent. It tells us that the model only explains 25.4% of the variance of the dependent variable in LifeExp. Take R2 = .2577. We can extract the value of R by taking the square root of it. Thus R is 0.50. The larger the number of R the stronger the relationship between the variables. Generally, the R lied between 0.5 and 0.7 is considered a moderate correlation between the dependent and independent variables.

Next, let’s check for the assumptions of the linear model.

# Global Validation of Linear Models Assumptions (Gvlma Object)
gvlma::gvlma.lm(life_exp.lm)

Call:
lm(formula = LifeExp ~ TotExp, data = who_df)

Coefficients:
(Intercept)       TotExp  
  6.475e+01    6.297e-05  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma.lm(lmobj = life_exp.lm) 

                       Value   p-value                   Decision
Global Stat        56.737011 1.405e-11 Assumptions NOT satisfied!
Skewness           30.532757 3.283e-08 Assumptions NOT satisfied!
Kurtosis            0.002804 9.578e-01    Assumptions acceptable.
Link Function      26.074703 3.285e-07 Assumptions NOT satisfied!
Heteroscedasticity  0.126747 7.218e-01    Assumptions acceptable.

So, as seen in the results above, I don’t think the model passes the basic assumptions of simple linear model. In particular, the Link function, Skewness, resulting in Global Stat to fail.

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is better?

Ans:

who_df2 <- transmute(who_df, LifeExp = LifeExp^4.6, TotExp = TotExp^0.06)
life_exp.lm2 <- lm(LifeExp ~ TotExp, data = who_df2)
summary(life_exp.lm2)

Call:
lm(formula = LifeExp ~ TotExp, data = who_df2)

Residuals:
       Min         1Q     Median         3Q        Max 
-308616089  -53978977   13697187   59139231  211951764 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -736527910   46817945  -15.73   <2e-16 ***
TotExp       620060216   27518940   22.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 90490000 on 188 degrees of freedom
Multiple R-squared:  0.7298,    Adjusted R-squared:  0.7283 
F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
# Getting the F critical value
qf(0.95, 1, 188)
[1] 3.891398

The linear model is as follows,

\(LifeExp = -736527910 + 620060216 \times TotExp\)

Interpret the F statistic, R2, standard error, and p-value value.

  • Compared to the first model, this one gives us a F-statistics of 508, which is definitely greater than the F critical value of 3.89 and it is a lot larger than the F-statistics seen in the first model. With a p-value < 2.2e-16, we can see it’s definitely much smaller than the p-value seen in the first model. That tells me that this model has a stronger relationship between the dependent variable and independent variable.
  • The coefficient of TotExp is 22.53 times of the standard error of TotExp. Likewise for the coefficient estimate of the intercept, it is 15.73 times bigger than the standard error. This ensures that there is relatively less variability in the slope and intercept estimates.
  • R2 is .7283, which means the correlation coefficient R is 0.85. This indicates the transformed variables has a much stronger relationship compared to the previous model with the original variables.
par(mfrow = c(2, 2))

# Scatter plot of Life Expectations (in yrs.) against Total Expenditures.
plot(who_df2$TotExp, who_df2$LifeExp, xlab = "Expenditures", ylab = "Years", main = "Life Expectation vs Total Expenditures", 
    col = "orange", pch = 8)
abline(life_exp.lm2)

# Residuals plot
plot(life_exp.lm2$fitted.values, life_exp.lm2$residuals, xlab = "Fitted Values", 
    ylab = "Residuals")
abline(h = 0, lty = 4, col = "blue")

# Distribution of residuals
hist(life_exp.lm2$residuals, xlab = "Residuals")

# Normal Q-Q Plot
qqnorm(life_exp.lm2$residuals)
qqline(life_exp.lm2$residuals)

With the exception of some skewness of residuals, the model is acceptable in terms of all other criteria for assumptions.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5. 

Ans:

Life Expectancy is 63.31 years when TotExp(.06) = 1.5

# TotExp^.06 = 1.5
life_exp_fc = 620060216 * 1.5 - 736527910
life_exp_back_transformed <- life_exp_fc^(1/4.6)
life_exp_back_transformed
[1] 63.31153

Life Expectancy is 86.51 years when TotExp(.06) = 2.5

# TotExp^.06 = 2.5
life_exp_fc = 620060216 * 2.5 - 736527910
life_exp_back_transformed <- life_exp_fc^(1/4.6)
life_exp_back_transformed
[1] 86.50645

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

\(LifeExp = b_0+b_1 \times PropMd + b2 \times TotExp +b_3 \times (PropMD*TotExp)\)

Ans:

life_exp.reg <- lm(LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who_df)
summary(life_exp.reg)

Call:
lm(formula = LifeExp ~ PropMD + TotExp + (TotExp * PropMD), data = who_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.320  -4.132   2.098   6.540  13.074 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.765 on 186 degrees of freedom
Multiple R-squared:  0.3574,    Adjusted R-squared:  0.3471 
F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Find the F critical value.

qf(0.95, 3, 186)
[1] 2.653165

The multiple regression model is as follows,

\(LifeExp = 6.277e+1 + 1.497e+3 \times PropMD + 7.233e-5 \times TotExp - 6.026e-3 \times (PropMD*TotExp)\)

We see that F-statistics is 34.49, which is much greater than 2.62. Meaning we reject the null hypothesis at 5% significance level, which is then interpreted as this model is better fitted than an intercept-only model. The p-value <2.2e-16 also confirms that.

The standard error of intercept, and all 3 variables are significantly smaller than their coefficient estimates. That’s another good sign for a good model fit.

The R-squared explains 35.74 % of the data’s variation in the dependent variable. R, which is .5978, indicates that the relationship between the dependent variable and the independent variables is moderately correlated.

par(mfrow = c(2, 2))

# Residuals plot
plot(life_exp.reg$fitted.values, life_exp.reg$residuals, xlab = "Fitted Values", 
    ylab = "Residuals")
abline(h = 0, lty = 3, col = "blue")

# Distirbution of residtuals
hist(life_exp.reg$residuals, xlab = "Residuals")

# Normality check
qqnorm(life_exp.reg$residuals)
qqline(life_exp.reg$residuals)

  1. Independence of Errors: From the residuals plot, we can see that the distribution of errors is not random and there is a pattern. Thus, we can say there is an autocorrelation effect among the residuals and the independence assumption is not valid.

  2. Homoscedasticity: From the residuals plot, we can see that variation of residuals does not seem to be constant along the line drawn from 0 residual.

  3. Normality of Error Distribution: The residuals histogram shows a left-skewed distribution and the qq plot also demonstrates that the residuals are skewed and not normally distributed.

  4. Linearity: There is no linearity of residuals as shown in the Normal Q-Q plot.

So the model violates all assumptions. I don’t think this is an appropriate model.

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Ans:

The forecast is 107.68 years given PropMD = .03 and TotExp = 14.

I think the forecast is realistic as long as the input, namely, independent variables that are given are realistic. Think about having 3% of population to be MDs, coupled with an elevated level of total expenditures of 14 Trillions or whichever unit it is. Since the previous Life Expectancy was 86.51 years given the total expenditures is 2.5 Trillions or whichever unit it is. Forecast is always leveraged to predict something that we are currently not at. So I do think there is a possibility that the human beings, or population, at large, reaches a life expectancy of 120 years old at some point in the future.

life_exp.fc2 <- 62.77 + 1497 * 0.03 + 7.233e-05 * 14 - 0.006026 * (0.03 * 14)
life_exp.fc2
[1] 107.6785