ASSIGNMENT 12

Problem

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Answer

Load data from CSV file

data <- read.csv("who.csv")

# Show Entire dataset
DT::datatable(data)

Data Exploration

str(data)

## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : Factor w/ 190 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ LifeExp       : int  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  0.0002288 0.0011431 0.0010605 0.0032973 0.0000704 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : int  20 169 108 2589 36 503 484 88 3181 3788 ...
##  $ GovtExp       : int  92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
##  $ TotExp        : int  112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...

summary(data)

##                 Country       LifeExp      InfantSurvival   Under5Survival  
##  Afghanistan        :  1   Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Algeria            :  1   Median :70.00   Median :0.9785   Median :0.9745  
##  Andorra            :  1   Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##  Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##  Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##  (Other)            :184                                                    
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##                                                                            
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750  
##

1.

Provide a scatterplot of LifeExp~TotExp,

Scatterplot of Life Expectancy and Total Expenditure

# Linear regression model build
lm <- lm(LifeExp ~ TotExp, data=data)

plot(data$TotExp, data$LifeExp, xlab = "Total Expenditures", ylab = "Life Expectancy", main="Life Expectancy vs Total Expenditures")
abline(lm)

Run simple linear regression. Do not transform the variables.

Run summary function to provide the F statistics, R-squared, standard error, and p-values

LifeExp_simple.lm <- lm(LifeExp ~ TotExp, data = data)

Provide and interpret the F statistics, R^2, standard error,and p-values only.

summary(LifeExp_simple.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933  < 2e-16 ***
## TotExp       0.000062970  0.000007795   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-statistic is 65.26 with a p-value of 7.714e-14 (test for overall regression model). P-value for F-test and TotExp is less than 0.05 but Adjusted R-squared is way too low, 0.2537, which suggests that we can reject the null hypothesis (regression model is same as intercept-only model) and accept the alternative that this simple linear regression model provides a better fit than the intercept-only model. The $R^2$=0.2577 value is not strong, which tells us that the model only accounts for 25.77% of the variation in the data. Standard error, 9.371, is higher than what we want it to be.

About F-statistic in linear regression model: http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-f-test-of-overall-significance-in-regression-analysis

Discuss whether the assumptions of simple linear regression met.

# Residuals variability plot
plot(LifeExp_simple.lm$fitted.values, LifeExp_simple.lm$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot 1 for Linear Model")
abline(h=0,lm)

hist(resid(LifeExp_simple.lm), main = "Histogram of Residuals", xlab = "Residuals")

qqnorm(LifeExp_simple.lm$residuals)
qqline(LifeExp_simple.lm$residuals)

# Residual Analysis
ols_plot_resid_qq(LifeExp_simple.lm)

ols_plot_resid_hist(LifeExp_simple.lm)

ols_plot_resid_fit(LifeExp_simple.lm)

crPlots(LifeExp_simple.lm)

# Homoscedasticity 
ncvTest(LifeExp_simple.lm)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.599177, Df = 1, p = 0.10692

# Independence
durbinWatsonTest(LifeExp_simple.lm)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.06872515      1.802451   0.172
##  Alternative hypothesis: rho != 0

gvlma(LifeExp_simple.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Coefficients:
## (Intercept)       TotExp  
## 64.75337453   0.00006297  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = LifeExp_simple.lm) 
## 
##                        Value          p-value                   Decision
## Global Stat        56.737011 0.00000000001405 Assumptions NOT satisfied!
## Skewness           30.532757 0.00000003282766 Assumptions NOT satisfied!
## Kurtosis            0.002804 0.95777263030755    Assumptions acceptable.
## Link Function      26.074703 0.00000032845930 Assumptions NOT satisfied!
## Heteroscedasticity  0.126747 0.72182921484679    Assumptions acceptable.

Looking at residuals plots it is clear that there is no constant variability and that residuals are not normally distributed. This model is not a good model.

2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6).

Create new column with $LifeExp^{4.6}$ values

data$LifeExp_4.6 <- (data$LifeExp)^4.6

Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06).

Create new column with $TotExp^{4.6}$ values

data$TotExp_0.06 <- (data$TotExp)^0.06

Plot LifeExp^4.6 as a function of TotExp^.06.

Plot total expenditures vs life expectancy for who data set

lm <- lm(LifeExp_4.6 ~ TotExp_0.06, data = data)

plot(data$TotExp_0.06, data$LifeExp_4.6, xlab = "Total Expenditures ^ 0.06", ylab = "Life Expectancy ^ 4.6", main='Average Life Expectancy vs Expenditure')
abline(lm)

Re-run the simple regression model using the transformed variables

Run simple linear regression on transformed LifeExp and TotExp

data_transformed.lm <- lm(LifeExp_4.6 ~ TotExp_0.06, data = data)

Provide and interpret the F statistics, R^2, standard error, and p-values.

Run summary function and get the F statistics, R-squared, standard error, and p-values

summary(data_transformed.lm)

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_0.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Which model is “better?”

# Residuals variability plot
plot(data_transformed.lm$fitted.values, data_transformed.lm$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot 2 for Linear Model")
abline(h=0)

hist(resid(data_transformed.lm), main = "Histogram of Residuals", xlab = "Residuals")

qqnorm(data_transformed.lm$residuals)
qqline(data_transformed.lm$residuals)

# Residual Analysis
ols_plot_resid_qq(data_transformed.lm)

ols_plot_resid_hist(data_transformed.lm)

ols_plot_resid_fit(data_transformed.lm)

crPlots(data_transformed.lm)

# Homoscedasticity 
ncvTest(data_transformed.lm)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.4278017, Df = 1, p = 0.51307

# Independence
durbinWatsonTest(data_transformed.lm)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.04038156      1.909075   0.536
##  Alternative hypothesis: rho != 0

gvlma(data_transformed.lm)

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = data)
## 
## Coefficients:
## (Intercept)  TotExp_0.06  
##  -736527909    620060216  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = data_transformed.lm) 
## 
##                      Value    p-value                   Decision
## Global Stat        27.8117 0.00001362 Assumptions NOT satisfied!
## Skewness           17.1372 0.00003478 Assumptions NOT satisfied!
## Kurtosis            7.4581 0.00631520 Assumptions NOT satisfied!
## Link Function       2.9866 0.08395884    Assumptions acceptable.
## Heteroscedasticity  0.2299 0.63158532    Assumptions acceptable.

As per QQ plot line and residual vs fitted graph suggests that at least model 2 is closer to normal distribution than model 1. However, Since mean < median, we can say that the model is still negatively skewed. Residual standard error is 90490000 but since variables are rescaled, we cannot really say residual standard error for model 2 is higher than model 1. Relative to the size of TotExp_power coefficient, residual standard error rather decreased in model 2.

The F-statistic for the transformed model is 507.7 (same degrees of freedom as model from 1.) is much better vs. the the previous model. The p-value is even better. In the transformed model the R2=0.7298, which is a lot better than the previous model (25.77%), as it means the explained variability is 72.9% compared to 25% for the 1st model. As can be inferred, the second model is better than the first one.

3.

Using the results from 2, forecast life expectancy when TotExp^.06 =1.5.

Transformed model from (2) using equation:

\[ LifeExp_{4.6} = -736527910 + 620060216*TotExp^{0.06} \]

forecast_LifeExp <- function(TotExp_0.06_value){
  y <- -736527910 + 620060216 *(TotExp_0.06_value)
  y <- y^(1/4.6)
  y
  #return (-736527910 + 620060216 * (TotExp_0.06_value))^(1/4.6)
}

Forecast life expectancy when $TotExp^{.06}$ =1.5

forecast_LifeExp(1.5)

## [1] 63.31153

Then forecast life expectancy when TotExp^.06=2.5.

Then forecast life expectancy when $TotExp^{.06}$=2.5

forecast_LifeExp(2.5)

## [1] 86.50645

Predicting the values at 1.5 and 2.5 provides the following results.

The prediction at 1.5 is 63 years.

The prediction at 2.5 is 87 years.

4.

Build the following multiple regression model

\[ LifeExp = b_0 + b_1 * PropMd + b_2 * TotExp + b_3 * PropMD * TotExp \]

#LifeExp_multiple.lm <- lm(data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD * data$TotExp))
data$PropMD_TotExp_0.06 <- data$PropMD * data$TotExp_0.06
#LifeExp_multiple.lm <- lm(data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 + data$PropMD:data$TotExp_0.06)
LifeExp_multiple.lm <- lm(data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 + data$PropMD_TotExp_0.06)

Interpret the F Statistics, R^2, standard error, and p-values.

summary(LifeExp_multiple.lm)

## 
## Call:
## lm(formula = data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 + 
##     data$PropMD_TotExp_0.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -296470018  -47729263   12183210   60285515  212311883 
## 
## Coefficients:
##                             Estimate   Std. Error t value Pr(>|t|)    
## (Intercept)               -724418697     50825046 -14.253   <2e-16 ***
## data$PropMD              47273338389  22577050673   2.094   0.0376 *  
## data$TotExp_0.06           604795792     30232703  20.005   <2e-16 ***
## data$PropMD_TotExp_0.06 -21214671638  11308191249  -1.876   0.0622 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88520000 on 186 degrees of freedom
## Multiple R-squared:  0.7441, Adjusted R-squared:   0.74 
## F-statistic: 180.3 on 3 and 186 DF,  p-value: < 2.2e-16

How good is the model?

hist(resid(LifeExp_multiple.lm), xlab = "Residuals")

plot(fitted(LifeExp_multiple.lm), resid(LifeExp_multiple.lm))

The F-statistic from our multiple regression model is 180.3. The p-value is strong for the model, and strong for all of the factors except PropMD x TotExp0.06 (0.0622). Finally, the $R^2$=0.7441, which means that the model accounts for 74.41% of variability in the data.

5.

Forecast LifeExp when PropMD=.03 and TotExp = 14.

LifeExp_multiple.lm$coefficients

##             (Intercept)             data$PropMD        data$TotExp_0.06 
##              -724418697             47273338389               604795792 
## data$PropMD_TotExp_0.06 
##            -21214671638

\[ LifeExp_{4.6} = 724418697 + 47273338389 * PropMD + 604795792 * TotExp_{0.06} + -21214671638 * PropMD * TotExp_{0.06} \]

forecast_LifeExp_2 <- function(PropMD, TotExp_0.06){
  LifeExp_4.6 <- 724418697 + 47273338389 * PropMD + 604795792 * TotExp_0.06 + -21214671638 * PropMD * TotExp_0.06
  return(LifeExp_4.6^(1/4.6))
}

forecast_LifeExp_2(.03, 14^0.06)

## [1] 106.3698

Does this forecast seem realistic? Why or why not?

This seems like an outlier as life expectancy of 106 does not look realistic. The problem with this model is that we are using the proportion of MDs living in a country as well as the total expenditure as independent variables for life expectancy, and we really do not know if these actually make a true difference in real life.

Life expentancy of propMD around 3%

LifeExp_propMD_.03 <- subset(data, data$PropMD >.02 & data$PropMD < .04, select = c("LifeExp", "PropMD", "TotExp"))
head(LifeExp_propMD_.03)

##     LifeExp     PropMD TotExp
## 45       80 0.03322813  40749
## 146      82 0.03512903 281653

The forecast for proportion of population that are medical doctors is 3% and total expenditure of $14 is about 82 years. Looking at the data below for propMD that falls around 3%, I do see that 80 to 82 years of life expectancy is present; however, the total expenditure is a lot higher than 14. So, I do not think that this is a realistic forecast.

Life expentancy of total expenditure around 14

The minimum total expenditure in the dataset is $13.

min(data$TotExp)

## [1] 13

To get an idea of the life expectancy of countries with total expenditure close to $14, below is a list of entries with total expenditures under $100. As you can see, only 1 entry has a total expenditure that is closest to $14, and the life expectancy of this country is 49 years. Also, for all the countries in this list, the propMD is very low (less than 1%).

LifeExp_TotExp_.04 <- subset(data, data$TotExp >= 13 & data$TotExp < 100, select = c("LifeExp", "PropMD", "TotExp"))
LifeExp_TotExp_.04

##     LifeExp      PropMD TotExp
## 14       63 0.000274894     87
## 28       49 0.000024500     13
## 47       47 0.000096100     71
## 56       63 0.000045800     88
## 58       56 0.000023900     70
## 70       53 0.000107505     87
## 118      62 0.000194783     80
## 122      42 0.000021500     94

Based on actual data, I would say that a life expectancy of 82 years for a country with propMD of 3% and total expenditure of 14 is not realistic. Countries with propMD of about 3% tend to have total expenditure that is way above $14.

DATA 605 : Homework #12

Debabrata Kabiraj

April 25, 2020