Getting Started

The attached who.csv dataset contains real-world Health Organization data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

library(readr)

#Import Dataset
who <- read_csv("~/who.csv")
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
knitr::kable(head(who[,c(1,2,6,8,9,10)]))
Country LifeExp PropMD PersExp GovtExp TotExp
Afghanistan 42 0.0002288 20 92 112
Albania 71 0.0011431 169 3128 3297
Algeria 71 0.0010605 108 5184 5292
Andorra 82 0.0032973 2589 169725 172314
Angola 41 0.0000704 36 1620 1656
Antigua and Barbuda 73 0.0001429 503 12543 13046
#View the first few rows of the data frame
head(who)
## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
#Check the structure of the imported data
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

Exercise 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Load the ggplot2 library
library(ggplot2)

# Create a scatterplot with ggplot
ggplot(data = who, aes(x = TotExp, y = LifeExp)) +
  geom_point() +
  labs(x = "Total Expenditure", y = "Life Expectancy", title = "Scatterplot of Life Expectancy vs Total Expenditure") +
  geom_smooth(method = "lm", se = FALSE)  # Add linear regression line without confidence interval
## `geom_smooth()` using formula = 'y ~ x'

# Run simple linear regression
slr_model <- lm(LifeExp ~ TotExp, data = who)

# Summary of the regression model
summary(slr_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
# Diagnostic plots for the regression model
plot(slr_model)

In the scatterplot, the points do not follow a linear trend, the relationship between the predictor (Total Expenditure) and the response variable (Life Expectancy) should be approximately non-linear. So, the assumptions of simple linear regression are violated in our model.

Exercise 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# Raise the life expectancy variable to the power of 4.6 and the total expenditures variable to the power of 0.6.
who$LifeExp_transformed <- who$LifeExp ^ 4.6
who$TotExp_transformed <- who$TotExp ^ 0.06

# Create a scatterplot with ggplot2
ggplot(who, aes(x = TotExp_transformed, y = LifeExp_transformed)) +
  geom_point() +
  labs(x = "Total Expenditure^0.06", y = "Life Expectancy^4.6", title = "Transformed Relationship")

# Run simple linear regression with transformed variables
slr_model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = who)

# Summary of the transformed regression model
summary(slr_model_transformed)
## 
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -736527910   46817945  -15.73   <2e-16 ***
## TotExp_transformed  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
# Extract residuals from the transformed linear regression model
residuals_transformed <- residuals(slr_model_transformed)

# Create a histogram of residuals
hist(residuals_transformed, breaks = 20, col = "skyblue", border = "black", 
     main = "Histogram of Transformed Residuals", xlab = "Residuals", ylab = "Frequency")

This histogram displays closer liner distribution of residuals from the transformed linear regression model to assess the normality assumption and the distribution of the residuals. It is clear that the simple regression model using the transformed variables is better than that of without transforming variables.

Exercise 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

newdata <- data.frame(TotExp_transformed = c(1.5, 2.5))

# Predict life expectancy for new data points
predicted_life_exp <- predict(slr_model_transformed, newdata = newdata, interval = "predict")^(1/4.6)
predicted_life_exp
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

The transformed model predicts a life expectancy of approximately 63.31 years for a Total Expenditure of $860.705 (transformed to TotExp^0.06 = 1.5) with 95% confidence, that the true average life expectancy for this Total Expenditure level falls between approximately 35.94 and 73.01 years.

The model predicts a life expectancy of approximately 86.51 years for a Total Expenditure of $4,288,777 (transformed to TotExp^0.06 = 2.5) with 95% confidence, that the true average life expectancy for this Total Expenditure level falls between approximately 81.81 and 90.43 years.

Exercise 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

The lm() function will be using to perform the multiple regression analysis

# Multiple linear regression model build
multi_lm <- lm(LifeExp ~ PropMD + TotExp + TotExp:PropMD, data=who)
# Linear regression model summary
summary(multi_lm)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp:PropMD, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
# Plot residual variability
plot(multi_lm, which = 1)

# Residuals Q-Q plot
qqnorm(multi_lm$residuals)
qqline(multi_lm$residuals)

A residual standard error of 8.765 suggests that, on average, the model’s predictions are off by around 8.765 years.F-statistic of 34.49 suggests that the model explains more variance than what would be expected by random chance. R-squared of 0.3574 means that only 35.74% of the variability in life expectancy. The residuals plot indicates that there is evidence of non-constant variability and non-normality in the residuals which violates one of the assumptions of linear regression.

Exercise 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Assuming you have the coefficients from your multiple regression model
# Replace b0, b1, b2, b3 with the coefficients from your model summary
b0 <- 6.277e+01 # replace with your intercept coefficient
b1 <- 1.497e+03 # replace with your PropMD coefficient
b2 <- 7.233e-05 # replace with your TotExp coefficient
b3 <- -6.026e-03 # replace with your interaction coefficient

# Given values
PropMD <- 0.03
TotExp <- 14

# Calculate the predicted Life Expectancy
LifeExp_Forecast <- b0 + b1 * PropMD + b2 * TotExp + b3 * PropMD * TotExp
LifeExp_Forecast
## [1] 107.6785

For a given values of PropMD=.03 and TotExp = 14, the estimated life expectancy is 107.

This forecast might be realistic because I think if we have strong enough psychological well beings, clean water, fresh food, nutritious intakes, daily workouts, less stress, overall physical health in our happy lives, we may live longer with the aid sophisticated medical technology and treatment.