Task 1:

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.

Solution 1:

The scatter plot is shown below:

who_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/main/who.csv")

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(who_data, aes(x = TotExp, y = LifeExp)) + geom_point() + labs(title = "Life Expectancy vs. Expenditures", x = "Sum of Personal and Government Expenditures", y = "Average Life Expectancy (Years)")

The linear regression equation is calculated below:

simple_model <- lm(LifeExp ~ TotExp, data = who_data)
summary(simple_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The simple linear regression results in the equation \(y=64.75+0.00006297x\), where \(x\) represents the sum of personal and government expenditures and \(y\) represents life expectancy.

The multiple R-squared is 0.2577, indicating that about 25.77% of the variance in life expectancy is explained by variations in the sum of personal and government expenditures.

Both p-values are extremely small, far less than 0.001. For the y-intercept, this is strong evidence that the y-intercept is not 0\(^1\). For the slope, this is strong evidence of a linear relationship between life expectancy and the sum of personal and government expenditures.

The F-statistic provides no additional information in this model, since it has only one additional parameter.

The standard error for the intercept is about 0.7535, which is approximately 86 times smaller than the intercept estimate. This means that there is relatively little variability in the y-intercept estimate. The standard error for the slope is about 0.000007795, which is approximately 8 times smaller than the slope estimate. This indicates more uncertainty than for the intercept, but is still in the target range for a good model of 5-10.

Despite these strong indicators, this model fails the assumption of linearity. We can see from the scatterplot that the relationship between the variables does not appear to be linear.

\(^1\)Obviously.

Task 2:

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistic, R^2, standard error, and p-values. Which model is “better”?

Solution 2:

The data is transformed according to the given criteria below:

transformed_who_data <- who_data %>%
  mutate(LifeExp = LifeExp^4.6) %>%
  mutate(TotExp = TotExp^0.06)

The scatter plot is shown below:

ggplot(transformed_who_data, aes(x = TotExp, y = LifeExp)) + geom_point() + labs(title = "Adjusted Life Expectancy vs. Adjusted Expenditures", x = "Adjusted Sum of Personal and Government Expenditures", y = "Adjusted Average Life Expectancy (Years)")

The linear equation is calculated below:

adjusted_model <- lm(LifeExp ~ TotExp, data = transformed_who_data)
summary(adjusted_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = transformed_who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The simple linear regression results in the equation \(y=-736527909+620060216x\), where \(x\) represents the sum of personal and government expenditures raised to the power of 0.06 and \(y\) represents life expectancy raised to the power of 4.6.

Th multiple R-squared is 0.7298, indicating that about 72.98% of the variance in adjusted life expectancy is explained by variations in the adjusted sum of personal and government expenditures.

Both p-values are extremely small, far less than 0.001. For the y-intercept, this is strong evidence that the y-intercept is not 0. For the slope, this is strong evidence of a linear relationship between adjusted life expectancy and the adjusted sum of personal and government expenditures.

As before, the F-statistic provides no additional information in this model, since it has only one additional parameter.

The standard errors both result in t-values with a magnitude greater than 10, indicating there is a small amount of variability in these estimates.

I would say that this new adjusted model is better. Although both p-values are infinitesimal, the p-value for the adjusted model is more than two orders of magnitude smaller than the p-value for the original model. Additionally, the adjusted model has a significantly larger t-value for the slope. Our original simple model explains 25.77% of the variance in life expectancy, while our adjusted model explains almost 73% of the variance in adjusted life expectancy, once again a point in favor of the adjusted model.

Task 3:

Using the results from 3, forecast life expectancy when TotExp^.06=1.5. Then forecast life expectancy when TotExp^.06=2.5.

Solution 3:

When TotExp^.06 = 1.5, our regression equation would predict an adjusted life expectancy of \(y=-736527909+620060216(1.5)=193,562,415\). This number represents life expectancy raised to the power of 4.6, so we have to raise it to the power of \(\frac{1}{4.6}\) to determine life expectancy. \(193,562,415^\frac{1}{4.6}=63.31153\), so we would forecast a life expectancy of about 63 years for this adjusted total expenditure.

When TotExp^.06 = 2.5, our regression equation would predict an adjusted life expectancy of \(y=-736527909+620060216(2.5)=813,622,631\). This number represents life expectancy raised to the power of 4.6, so we have to raise it to the power of \(\frac{1}{4.6}\) to determine life expectancy. \(813,622,631^\frac{1}{4.6}=86.50645\), so we would forecast a life expectancy of about 86.5 years for this adjusted total expenditure.

Task 4:

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp

Solution 4:

multiple_data <- who_data %>% mutate(combined_statistic = PropMD * TotExp)

multiple_model <- lm(LifeExp ~ PropMD + TotExp + combined_statistic, data = multiple_data)

summary(multiple_model)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + combined_statistic, 
##     data = multiple_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## combined_statistic -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

This model results in the equation \(\text{LifeExp} = 62.77+1497\times \text{PropMD }+0.00007233\times \text{TotExp }-0.006026\times \text{PropMD }\times \text{TotExp}\)

The adjusted R-squared is 0.3471, indicating that about 34.71% of the variance in life expectancy is explained by this model.

All four p-values are less than 0.01. For the y-intercept, this is strong evidence that the y-intercept is not 0. For the coefficients, this is strong evidence of a linear relationship between life expectancy and each of the independent variables.

The F-statistic p-value is small, indicating that the current model is a better fit for the data than a model with one fewer predictor.

The standard errors for the coefficients result in t-values with a magnitude between 4 and 9, indicating some uncertainty in the estimates for these coefficients. Ideally, our t values would be between 5 and 10.

I would say this model is of moderate quality. The p-values are very small, but we would like to have greater t values, and accounting for only about 35% of the variance in the data is not ideal.

Task 5:

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Solution 5:

Using the equation above, these values would result in \(\text{LifeExp} = 62.77+1497(.03)+0.00007233(14)-0.006026(.03)(14)=107.6785\), so we would project a life expectancy of around 107.7 years. This is not a realistic life expectancy forecast, since a tiny percentage of humans live to this age. However, this does not necessarily reflect a problem with the model. The strange outcome simply reflects an unusual combination of factors. As shown in the table below, there are only two countries with PropMD as high as .03 (Cyprus and San Marino). Each of these countries has a TotExp far higher than 14, Cyprus is 40,749 and San Marino is 281,653. So, this combination of inputs produces an unrealistic outcome only because no countries with this combination of factors exists.

similar_cases <- multiple_data %>% filter(PropMD >= .03) %>% select(Country, LifeExp, PropMD, TotExp)

library(knitr)

kable(similar_cases, col.names = c("Country", "Life Expectancy", "MD Proportion", "Total Expenditures"), align = "lccc")
Country Life Expectancy MD Proportion Total Expenditures
Cyprus 80 0.0332281 40749
San Marino 82 0.0351290 281653

To further illustrate this, I have created a scatter plot of PropMD vs. TotExp below. The provided combination of PropMD = .03 and TotExp = 14 is shown as a red star.

ggplot(who_data, aes(x = TotExp, y = PropMD)) + 
  geom_point() + 
  labs(title = "MD Proportion vs. Total Expenditures", x = "Sum of Personal and Government Expenditures", y = "MD Proportion") +
  geom_point(aes(x = 14, y = 0.03), shape = 8, colour = "red")