library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)

# Load required libraries
library(readr)

# URL of the CSV file
url <- "https://tinyurl.com/4n73tm5u"

# Load the data from the URL
data <- read_csv(url, col_types = cols(
  Country = col_character(),
  LifeExp = col_double(),
  InfantSurvival = col_double(),
  Under5Survival = col_double(),
  TBFree = col_double(),
  PropMD = col_double(),
  PropRN = col_double(),
  PersExp = col_double(),
  GovtExp = col_double(),
  TotExp = col_double()
))

head(data)
## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Scatterplot
ggplot(data, aes(x = TotExp, y = LifeExp)) +
  geom_point() +
  labs(x = "Total Expenditures", y = "Life Expectancy") +
  ggtitle("Scatterplot of Life Expectancy vs Total Expenditures")

# Simple linear regression
lm_model <- lm(LifeExp ~ TotExp, data = data)
summary(lm_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The analysis shows that life expectancy might be higher or lower, based on healthcare spending and the number of doctors. this result seems to be reliable and not likely to be just by chance. About 36% of the reasons for different life expectancies among the studied group can be related to the model’s factors. the model isn’t perfect there are some variation it can’t explain. Each of the things we looked at how much money is spent on healthcare, how many doctors there are, and how these two factors work together makes a real difference to life expectancy.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data <- data %>%
  mutate(LifeExp_Transformed = LifeExp^4.6,
         TotExp_Transformed = TotExp^0.06)

# Scatterplot
ggplot(data, aes(x = TotExp_Transformed, y = LifeExp_Transformed)) +
  geom_point() +
  labs(x = "Transformed Expenditures", y = "Transformed Life Expectancy") +
  ggtitle("Transformed Life Expectancy vs Transformed Total Expenditures")

lm_model_transformed <- lm(LifeExp_Transformed ~ TotExp_Transformed, data = data)
summary(lm_model_transformed)
## 
## Call:
## lm(formula = LifeExp_Transformed ~ TotExp_Transformed, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -736527910   46817945  -15.73   <2e-16 ***
## TotExp_Transformed  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Its difficult to tell which is the better model, because the first model offer more subtlety , but the second model is simpler. However, the interpretability of the second model’s transformed variables could be less natural.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

forecast_1 <- predict(lm_model_transformed, newdata = data.frame(TotExp_Transformed = 1.5))
forecast_2 <- predict(lm_model_transformed, newdata = data.frame(TotExp_Transformed = 2.5))

print(paste("Forecast Life Expectancy for TotExp_Transformed = 1.5:", forecast_1))
## [1] "Forecast Life Expectancy for TotExp_Transformed = 1.5: 193562413.987495"
print(paste("Forecast Life Expectancy for TotExp_Transformed = 2.5:", forecast_2))
## [1] "Forecast Life Expectancy for TotExp_Transformed = 2.5: 813622629.638233"

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

lm_model_multiple <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = data)
summary(lm_model_multiple)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The model is decent but not perfect. It shows us that things like the number of doctors and how much money is spent on healthcare can give us clues about why people might live longer or shorter lives. But there’s still a lot we don’t understand, and the model doesn’t catch everything. We could probably get a better picture by looking at more factors,or using more complex methods.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Forecast life expectancy for PropMD = 0.03 and TotExp = 14
forecast_multiple <- predict(lm_model_multiple, newdata = data.frame(PropMD = 0.03, TotExp = 14))
print(paste("Forecasted Life Expectancy:", forecast_multiple))
## [1] "Forecasted Life Expectancy: 107.696003708063"

a life expectancy forecast of over 107 years is not realistic, predicting that people will live on average more than 107 years is not very believable with what we know today about how long people usually live.