HW 12 RBalaban

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures

Read in data

who_data <- read.csv("https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/who.csv")
head(who_data)

##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046

plot(who_data, main="WHO Variable Correlation")

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

# Scatterplot
ggplot(who_data, aes(x= LifeExp, y= TotExp))+
  geom_point() +
  labs(x = "Average Life Expectancy", y = "Total Average Healthcare Expenditure") +
  geom_smooth(method=lm)

## `geom_smooth()` using formula = 'y ~ x'

# Make simple linear regression
Lexp_Texp.lm <- lm(LifeExp~TotExp, data=who_data)

# Histogram of residual values
hist(resid(Lexp_Texp.lm), main = "Residuals Histogram", xlab = "Residuals")

#Q-Q plot
qqnorm(Lexp_Texp.lm$residuals)
qqline(Lexp_Texp.lm$residuals)

# Get summary of our model
summary(Lexp_Texp.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The Multiple R-squared value is \(0.2577\), and the Adjusted R-squared is \(0.2537\), meaning that this model accounts for only 25% of the variance within the data. Upon visual inspection we can see that we barely need to change Total Average Healthcare Expenditure as Average Life Expectancy rises, but that skyrockets as individuals get older. 25% isn’t a very good model for describing the measured data, so we need further analysis besides single linear regression.
For a good model, we’d want the standard error to be at least 5-10 times smaller than its corresponding coefficient (Linear Regression Using R, Pg. 21). The standard error for TotExp is \(7.795e-06\), and the coefficient is \(6.297e-05\). Dividing the coefficient by the standard error gives a ratio of \(8.079\). This ratio, known as the test-statistic or t-value, means that there is relatively little variability in the slope estimate, which is true apart from the high peak when we approach a value of 80 for LifeExp.
The p-value is \(7.71e-14\), which is very small, and that means there’s a tiny chance of observing a t-value equal to or greater than \(8.079\). Since this value is tiny, we can see there’s strong evidence of a linear relationship between total expenditure and life expectancy (for the most part, besides when we approach 80).
The F-statistic is \(65.26\), which compares the current model (using the variable ‘TotExp’), to a model that only has the intercept parameter (Linear Regression Using R, Pg. 23).
Looking at both the numbers above, as well as the plots, it’s sufficient to say there is no constant variability, and that the residuals aren’t normally distributed. Hence, a single linear regression is not a good model to describe the relationship between LifeExp and TotExp.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# Needed transformations
life4.6 <- who_data$LifeExp^4.6
texp.06 <- who_data$TotExp^0.06

# Re-run simple regression model with transformed variables
fit2.lm <- lm(life4.6 ~ texp.06)
summary(fit2.lm)

## 
## Call:
## lm(formula = life4.6 ~ texp.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## texp.06      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

# Scatterplot
plot(life4.6~texp.06, 
     xlab="Total Average Healthcare Expenditure", ylab="Average Life Expectancy",
     main="Total Expenditures vs Life Expectancy (Transformed)")
abline(fit2.lm)

# Residuals
hist(resid(fit2.lm), main = "Histogram of Residuals", xlab = "residuals")

plot(fitted(fit2.lm), resid(fit2.lm))

# QQ plot
qqnorm(fit2.lm$residuals)
qqline(fit2.lm$residuals)

Given that the Multiple R-squared and Adjusted R-squared values are now \(0.7298\) and \(0.7283\) respectively, I see this model accounts for more of the variance within the data for life expectancy. The p-value is also smaller than in the first model, at \(< 2.2e-16\), the F statistic is large at \(507.7\) on 188 degrees of freedom, much higher than in the first model. Looking at the residuals plots, I see the Q-Q plot seems to be closer to being normally distributed, as you can also see in the histogram. The variability is fairly constant, albeit with a couple of outliers. By all metrics, this seems to be a better model than the first.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

The slope follows the format of \(y= b + mx = -736527910 + 620060216x\)

lexp_forecast <- function(x)
  {   y <- -736527910 + 620060216*x
      y <- y^(1/4.6)  #Recall that we raised life4.6 <- who_data$LifeExp^4.6
    print(y)
  }


#Compute 
lexp_forecast(1.5) # 63.31153

## [1] 63.31153

lexp_forecast(2.5) # 86.50645

## [1] 86.50645

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = \(b_0\) + \(b_1\) x PropMD + \(b_2\) x TotExp + \(b_3\) x PropMD x TotExp

# Multiple regression model
fit3.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who_data)
summary(fit3.lm)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

# Plots
hist(resid(fit3.lm), xlab = "residuals")

plot(fitted(fit3.lm), resid(fit3.lm))

The model with additional predictors and interaction terms is better than the original model (Lexp_Texp.lm). The 3 variables; PropMD, TotExp, and PropMD*TotExp all have small p-values. However, the R^2 values are only 35%, and the F-statistic is not as large as the prior model (fit2). The residuals themselves have a strong right skew and show an inconsistent variance and non-normal distribution. The p-value is small, which is statically significant. This linear model is better than the first, but worse than the second.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Test date
test_data <- data.frame(TotExp = c(14), PropMD = c(.03))

# Predict life expectancy from model
predicted_life_exp <- predict(fit3.lm, newdata = test_data)
print(predicted_life_exp)

##       1 
## 107.696

A forecasted life expectancy of 107 years seems highly unlikely, and is an outlier even if we throw in a large amount of medical expenditure.