Data Discription

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures.

Libraries

library(readr)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ purrr   0.3.4     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)

Import data

who_data <- read_csv('who.csv')
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(who_data)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750
skim(who_data)
Data summary
Name who_data
Number of rows 190
Number of columns 10
_______________________
Column type frequency:
character 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1 4 41 0 190 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LifeExp 0 1 67.38 10.85 40.00 61.25 70.00 75.00 83.00 ▂▂▃▇▅
InfantSurvival 0 1 0.96 0.04 0.84 0.94 0.98 0.99 1.00 ▁▁▂▂▇
Under5Survival 0 1 0.95 0.06 0.73 0.93 0.97 0.99 1.00 ▁▁▁▂▇
TBFree 0 1 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ▁▁▁▂▇
PropMD 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.04 ▇▁▁▁▁
PropRN 0 1 0.00 0.01 0.00 0.00 0.00 0.01 0.07 ▇▁▁▁▁
PersExp 0 1 742.00 1354.00 3.00 36.25 199.50 515.25 6350.00 ▇▁▁▁▁
GovtExp 0 1 40953.49 86140.65 10.00 559.50 5385.00 25680.25 476420.00 ▇▁▁▁▁
TotExp 0 1 41695.49 87449.85 13.00 584.00 5541.00 26331.00 482750.00 ▇▁▁▁▁

Problem 1.

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Scatter plot
ggplot(who_data, aes(x = TotExp, y = LifeExp ))+geom_point(color = 'blue')+
  labs(x = 'Total Expenditure', y = 'Life Expectancy', title = 'Life Expectancy vs Total Expenditures')+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# Linear regression model
life_exp_lm <- lm(LifeExp ~ TotExp, data=who_data)

# summary of model
summary(life_exp_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
# residuals plot
hist(life_exp_lm$residuals, main = "histogram of residual")

plot(life_exp_lm$fitted.values, life_exp_lm$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot")

qqnorm(life_exp_lm$residuals)
qqline(life_exp_lm$residuals)

Conclusion:

  1. The adjusted R-squared shows that the model only explains 25.37% of variation in life expectancy

  2. The Standard Error is approximately 8x smaller then the corresponding coefficient.

  3. P value shows the total expenditure is statistically significant in this model. Therefore, reject the null hypothesis.

  4. F-Statistic is large and it is indicates the strong relationship exists between total expenditure and life expectancy.

  5. The residual plot shows residuals are leftly skewed.

Problem 2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# Create new variable - LifeExp^4.6 
who_data$LifeExp4.6 <- who_data$LifeExp^4.6

# Create new variable - TotExp^0.06 
who_data$TotExp0.06 <- who_data$TotExp^0.06


# Scatter plot
ggplot(who_data, aes(x = TotExp0.06, y = LifeExp4.6 ))+geom_point(color = 'blue')+
  labs(x = 'Total Expenditure^0.06', y = 'Life Expectancy^4.6', title = 'Life Expectancy vs Total Expenditures')+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# Linear regression model
life_exp_lm2 <- lm(LifeExp4.6 ~ TotExp0.06, data=who_data)

# summary of model
summary(life_exp_lm2)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp0.06, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp0.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
# residuals plot
hist(life_exp_lm2$residuals, main = "histogram of residual")

plot(life_exp_lm2$fitted.values, life_exp_lm2$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot")

qqnorm(life_exp_lm2$residuals)
qqline(life_exp_lm2$residuals)

Conclusion:

The model is greatly improved, the adjusted R-square increase to an acceptable range. This model is definitely better than the model in problem 1.

  1. The adjusted R-squared shows that the model only explains 72.83% of variation in life

  2. P value shows the total expenditure is statistically significant in this model. Therefore, reject the null hypothesis.

3.F-Statistic is large and it is indicates the strong relationship exists between total expenditure and life expectancy.

  1. The residual plot shows residuals are slightly left skewed.

Problem 3.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

forecast<-data.frame(TotExp0.06=c(1.5,2.5))
predict(life_exp_lm2, forecast, interval = "predict")^(1/4.6)
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

When TotExo^0.06 = 1.5, the forecast for life expectancy is 63.31 years. When TotExo^0.06 = 2.5, the forecast for life expectancy is 86.50 years.

Problem 4.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

life_exp_lm3 <- lm(LifeExp ~ PropMD + TotExp + TotExp * PropMD, who_data)
summary(life_exp_lm3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp * PropMD, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
# residuals plot
hist(life_exp_lm3$residuals, main = "histogram of residual")

plot(life_exp_lm3$fitted.values, life_exp_lm3$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot")

qqnorm(life_exp_lm3$residuals)
qqline(life_exp_lm3$residuals)

Conclusion:

  1. The adjusted R-squared shows that the model only explains 34.71% of variation in life expectancy

  2. The Standard Error 8.765 and F-statistic is 34.49.

  3. P value shows the total expenditure is statistically significant in this model. Therefore, reject the null hypothesis.

4.F-Statistic is indicates there is relationship exists between dependent and independent variables.

  1. The residual plot shows residuals are leftly skewed.

Eventhough the model included more variables, but the model performs worse than the model from the problem 2 (the scaled simple linear model).

Problem 5.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

forecast2 <- data.frame(PropMD=0.03, TotExp=14)
predict(life_exp_lm3,forecast2,interval="predict")
##       fit      lwr      upr
## 1 107.696 84.24791 131.1441

Conclusion:

The model predicts the life expectancy of 107.69 years when Total Expenditure is 14. According to the summay of data, this is not realistic since the max value of life expectancy is 83.