library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
# Load required libraries
library(readr)
# URL of the CSV file
url <- "https://tinyurl.com/4n73tm5u"
# Load the data from the URL
data <- read_csv(url, col_types = cols(
Country = col_character(),
LifeExp = col_double(),
InfantSurvival = col_double(),
Under5Survival = col_double(),
TBFree = col_double(),
PropMD = col_double(),
PropRN = col_double(),
PersExp = col_double(),
GovtExp = col_double(),
TotExp = col_double()
))
head(data)
## # A tibble: 6 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
# Scatterplot
ggplot(data, aes(x = TotExp, y = LifeExp)) +
geom_point() +
labs(x = "Total Expenditures", y = "Life Expectancy") +
ggtitle("Scatterplot of Life Expectancy vs Total Expenditures")
# Simple linear regression
lm_model <- lm(LifeExp ~ TotExp, data = data)
summary(lm_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The analysis shows that life expectancy might be higher or lower, based on healthcare spending and the number of doctors. this result seems to be reliable and not likely to be just by chance. About 36% of the reasons for different life expectancies among the studied group can be related to the model’s factors. the model isn’t perfect there are some variation it can’t explain. Each of the things we looked at how much money is spent on healthcare, how many doctors there are, and how these two factors work together makes a real difference to life expectancy.
data <- data %>%
mutate(LifeExp_Transformed = LifeExp^4.6,
TotExp_Transformed = TotExp^0.06)
# Scatterplot
ggplot(data, aes(x = TotExp_Transformed, y = LifeExp_Transformed)) +
geom_point() +
labs(x = "Transformed Expenditures", y = "Transformed Life Expectancy") +
ggtitle("Transformed Life Expectancy vs Transformed Total Expenditures")
lm_model_transformed <- lm(LifeExp_Transformed ~ TotExp_Transformed, data = data)
summary(lm_model_transformed)
##
## Call:
## lm(formula = LifeExp_Transformed ~ TotExp_Transformed, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_Transformed 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Its difficult to tell which is the better model, because the first model offer more subtlety , but the second model is simpler. However, the interpretability of the second model’s transformed variables could be less natural.
forecast_1 <- predict(lm_model_transformed, newdata = data.frame(TotExp_Transformed = 1.5))
forecast_2 <- predict(lm_model_transformed, newdata = data.frame(TotExp_Transformed = 2.5))
print(paste("Forecast Life Expectancy for TotExp_Transformed = 1.5:", forecast_1))
## [1] "Forecast Life Expectancy for TotExp_Transformed = 1.5: 193562413.987495"
print(paste("Forecast Life Expectancy for TotExp_Transformed = 2.5:", forecast_2))
## [1] "Forecast Life Expectancy for TotExp_Transformed = 2.5: 813622629.638233"
lm_model_multiple <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = data)
summary(lm_model_multiple)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The model is decent but not perfect. It shows us that things like the number of doctors and how much money is spent on healthcare can give us clues about why people might live longer or shorter lives. But there’s still a lot we don’t understand, and the model doesn’t catch everything. We could probably get a better picture by looking at more factors,or using more complex methods.
# Forecast life expectancy for PropMD = 0.03 and TotExp = 14
forecast_multiple <- predict(lm_model_multiple, newdata = data.frame(PropMD = 0.03, TotExp = 14))
print(paste("Forecasted Life Expectancy:", forecast_multiple))
## [1] "Forecasted Life Expectancy: 107.696003708063"
a life expectancy forecast of over 107 years is not realistic, predicting that people will live on average more than 107 years is not very believable with what we know today about how long people usually live.