The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

library(readr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/Assignment12_dataset"
who<-read_csv(url(urlfile))
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(who)
## # A tibble: 6 × 10
##   Country  LifeExp Infan…¹ Under…² TBFree  PropMD  PropRN PersExp GovtExp TotExp
##   <chr>      <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 Afghani…      42   0.835   0.743  0.998 2.29e-4 5.72e-4      20      92    112
## 2 Albania       71   0.985   0.983  1.00  1.14e-3 4.61e-3     169    3128   3297
## 3 Algeria       71   0.967   0.962  0.999 1.06e-3 2.09e-3     108    5184   5292
## 4 Andorra       82   0.997   0.996  1.00  3.30e-3 3.5 e-3    2589  169725 172314
## 5 Angola        41   0.846   0.74   0.997 7.04e-5 1.15e-3      36    1620   1656
## 6 Antigua…      73   0.99    0.989  1.00  1.43e-4 2.77e-3     503   12543  13046
## # … with abbreviated variable names ¹​InfantSurvival, ²​Under5Survival
  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

The scatter plot of Life Expectancy vs Total Expenditure does not look linear but it rather increases rapidly from the left side of the plot until the trend plateus. Visually, a line would not best fit the entirety of the plot. It seems like there should be 2 different lines to fit this plot.

plot(who$TotExp, who$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')

lm_who <- lm(LifeExp~TotExp, data = who)
lm_who
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05

Provide and interpret the F statistics, R^2, standard error,and p-values only.

An F-statistic of 65.26 is reported with a p-value of 7.714e-14. Assuming a significance value of 0.05, a p-value of 7.714e-14 is statistically significant and so we can reject the null hypothesis that there is no relationship between Life Expectancy vs Total Expenditures. It also means that obtaining an F-value of 65.26 from chance alone if there was no relationship between the dependent and independent variables is significantly small. From this evidence alone, there is evidence to support that there is a relationship between these variables. This F-statistic indicates that there is a larger proportion of the variance in the dependent variable that is explained by the regression model, relative to the unexplained variance.

However, the F-statistic and p-value alone is not enough to say a linear regression is the best fit model. The R^2 of 0.2577 is reported, which means only 25% of the variance in the dependent variable is explained by this linear regression. The other 75% is unexplained, which seems significant.

The variance not captured by this linear regression is further seen in the plot below. It’s clear that a line may not be the best fit model to explain this data.

plot(who$TotExp, who$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')

abline(lm_who)

summary(lm_who)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Next, the assumptions are checked. The summary shows that the residuals are not generally centered around 0. The Q1 and Q3 values are not equal and max and min residuals also deviate from each other. The Q-Q plot confirms that the residuals do not seem to follow a normal distribution. The plot is bowed and does not follow a linear trend well. The Residual vs Fitted line shows that the spread of residuals are also not uniformly spread out and there appears to have more spread left of the plot. Additionally, the cook’s distance is high on the left of the residuals vs leverage plot which means there is disproportionately influential points in this regression model. Overall, the assumptions for a linear regression model are not met.

qqnorm(resid(lm_who))
qqline(resid(lm_who))

par(mfrow=c(2,2))
plot(lm_who)

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

The plot with the adjusted data visually shows a more linear plot. The equation of the line is:

\[LifeExp^{4.6}=620060216*TotExp^{0.06}-736527909\]

The F-statistic 507.7 for a p-value of < 2.2e-16. This means that with the assumption that there is no relationship between the dependent and independent variables, the chance of getting an F-statistic of 507.7 is very small. The p-value is significantly lower than the significance value of 0.05 and therefore this linear regression is statistically significant and the null hypothesis that there is no relationship can be rejected. The higher F-statistic of 507.7 is greater than the first model and therefore more variance of the dependent variable is captured by this model. Additionally, the R^2 increased to .73, which is a significant improvement in the amount of variation captured by this model than the 25% from the last. Overall, this model does a better job capturing the variance of the data.

who_1 = who
who_1$LifeExp = who$LifeExp^4.6
who_1$TotExp = who_1$TotExp^0.06

plot(who_1$TotExp, who_1$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')

lm_who_1 = lm(LifeExp~TotExp, data = who_1)
lm_who_1
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_1)
## 
## Coefficients:
## (Intercept)       TotExp  
##  -736527909    620060216
summary(lm_who_1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

This can be further seen when the regression line is plotted against the scatter plot. The line does a better job of explaining this relationship.

plot(who_1$TotExp, who_1$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')

abline(lm_who_1)

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

The linear equation is: \[LifeExp^{4.6}=620060216*TotExp^{0.06}-736527909\]

# Forecast for TotExp^0.06 = 1.5
LifeExp_1.5 = (620060216*1.5-736527909)^(1/4.6)
TotExp_1.5 = 1.5^(1/0.06)

print(paste0('For a total expenditure of ',round(TotExp_1.5,3), ' the forecasted life expectancy is ',round(LifeExp_1.5,3), ' years'))
## [1] "For a total expenditure of 860.705 the forecasted life expectancy is 63.312 years"
# Forecast for TotExp^.06 = 2.5
LifeExp_1.5 = (620060216*2.5-736527909)^(1/4.6)
TotExp_1.5 = 2.5^(1/0.06)

print(paste0('For a total expenditure of ',round(TotExp_1.5,3), ' the forecasted life expectancy is ',round(LifeExp_1.5,3), ' years'))
## [1] "For a total expenditure of 4288777.125 the forecasted life expectancy is 86.506 years"
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

The equation is: \[LifeExp=62.77+ 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\]

An F-statistic of 34.49 with 3 numerator degrees of freedom and 186 denominator degrees of freedom, and a p-value of < 2.2e-16 (which is basically 0), shows that the model is statistically significant and that the predictor variables are collectively contributing to the prediction of the outcome variable. The low p-value is less than 0.05, so the null hypothesis that there is no relationship between the variables and the outcome can be rejected. Although this F-statistic is smaller than when TotExp was the only predictor, the increase in DOF (3) means there is more variance from the residuals that is being divided in the denominator. There the R^2 is looked at next. The R^2 is 0.3574, which means 36% of the variance in the independent variable can be explained by this model. I would say adding more variables did not help the model to capture more variance. It seems having the TotExp and adjusting it through mathematical manipulation did a better job at capturing the variation in LifeExp.

lm_who_2 = lm(LifeExp~PropMD+TotExp+I(PropMD*TotExp), data = who)

lm_who_2
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), 
##     data = who)
## 
## Coefficients:
##        (Intercept)              PropMD              TotExp  I(PropMD * TotExp)  
##          6.277e+01           1.497e+03           7.233e-05          -6.026e-03
summary(lm_who_2)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), 
##     data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## I(PropMD * TotExp) -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

\[LifeExp=62.77+ 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\]

This model forecasts that the life expectancy is 108 years old if 3% of the population are MDs and total expenditure is 14. This is contrary to what I would expect since you would think it would require more than $14 from personal and goverment expenditure and more than a 3% population of doctors to push the average life expectancy to such a high value. This model does not make sense and I would not rely on its forecasting.

LifeExp_mult = 62.77+ 1497 * 0.03 + 0.00007233 * 14 - 0.006026 * 0.03 * 14
print(paste0('The life expectancy if 3% of the population are MDs and total expenditure is 14 is ',round(LifeExp_mult,3), ' years'))
## [1] "The life expectancy if 3% of the population are MDs and total expenditure is 14 is 107.678 years"