The attached who.csv dataset contains real-world Health Organization data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
library(readr)
#Import Dataset
who <- read_csv("~/who.csv")
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
knitr::kable(head(who[,c(1,2,6,8,9,10)]))
| Country | LifeExp | PropMD | PersExp | GovtExp | TotExp |
|---|---|---|---|---|---|
| Afghanistan | 42 | 0.0002288 | 20 | 92 | 112 |
| Albania | 71 | 0.0011431 | 169 | 3128 | 3297 |
| Algeria | 71 | 0.0010605 | 108 | 5184 | 5292 |
| Andorra | 82 | 0.0032973 | 2589 | 169725 | 172314 |
| Angola | 41 | 0.0000704 | 36 | 1620 | 1656 |
| Antigua and Barbuda | 73 | 0.0001429 | 503 | 12543 | 13046 |
#View the first few rows of the data frame
head(who)
## # A tibble: 6 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
#Check the structure of the imported data
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
# Load the ggplot2 library
library(ggplot2)
# Create a scatterplot with ggplot
ggplot(data = who, aes(x = TotExp, y = LifeExp)) +
geom_point() +
labs(x = "Total Expenditure", y = "Life Expectancy", title = "Scatterplot of Life Expectancy vs Total Expenditure") +
geom_smooth(method = "lm", se = FALSE) # Add linear regression line without confidence interval
## `geom_smooth()` using formula = 'y ~ x'
# Run simple linear regression
slr_model <- lm(LifeExp ~ TotExp, data = who)
# Summary of the regression model
summary(slr_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
# Diagnostic plots for the regression model
plot(slr_model)
In the scatterplot, the points do not follow a linear trend, the relationship between the predictor (Total Expenditure) and the response variable (Life Expectancy) should be approximately non-linear. So, the assumptions of simple linear regression are violated in our model.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
# Raise the life expectancy variable to the power of 4.6 and the total expenditures variable to the power of 0.6.
who$LifeExp_transformed <- who$LifeExp ^ 4.6
who$TotExp_transformed <- who$TotExp ^ 0.06
# Create a scatterplot with ggplot2
ggplot(who, aes(x = TotExp_transformed, y = LifeExp_transformed)) +
geom_point() +
labs(x = "Total Expenditure^0.06", y = "Life Expectancy^4.6", title = "Transformed Relationship")
# Run simple linear regression with transformed variables
slr_model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = who)
# Summary of the transformed regression model
summary(slr_model_transformed)
##
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_transformed 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
# Extract residuals from the transformed linear regression model
residuals_transformed <- residuals(slr_model_transformed)
# Create a histogram of residuals
hist(residuals_transformed, breaks = 20, col = "skyblue", border = "black",
main = "Histogram of Transformed Residuals", xlab = "Residuals", ylab = "Frequency")
This histogram displays closer liner distribution of residuals from the transformed linear regression model to assess the normality assumption and the distribution of the residuals. It is clear that the simple regression model using the transformed variables is better than that of without transforming variables.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
newdata <- data.frame(TotExp_transformed = c(1.5, 2.5))
# Predict life expectancy for new data points
predicted_life_exp <- predict(slr_model_transformed, newdata = newdata, interval = "predict")^(1/4.6)
predicted_life_exp
## fit lwr upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414
The transformed model predicts a life expectancy of approximately 63.31 years for a Total Expenditure of $860.705 (transformed to TotExp^0.06 = 1.5) with 95% confidence, that the true average life expectancy for this Total Expenditure level falls between approximately 35.94 and 73.01 years.
The model predicts a life expectancy of approximately 86.51 years for a Total Expenditure of $4,288,777 (transformed to TotExp^0.06 = 2.5) with 95% confidence, that the true average life expectancy for this Total Expenditure level falls between approximately 81.81 and 90.43 years.
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
The lm() function will be using to perform the multiple regression analysis
# Multiple linear regression model build
multi_lm <- lm(LifeExp ~ PropMD + TotExp + TotExp:PropMD, data=who)
# Linear regression model summary
summary(multi_lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp:PropMD, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
# Plot residual variability
plot(multi_lm, which = 1)
# Residuals Q-Q plot
qqnorm(multi_lm$residuals)
qqline(multi_lm$residuals)
A residual standard error of 8.765 suggests that, on average, the model’s predictions are off by around 8.765 years.F-statistic of 34.49 suggests that the model explains more variance than what would be expected by random chance. R-squared of 0.3574 means that only 35.74% of the variability in life expectancy. The residuals plot indicates that there is evidence of non-constant variability and non-normality in the residuals which violates one of the assumptions of linear regression.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
# Assuming you have the coefficients from your multiple regression model
# Replace b0, b1, b2, b3 with the coefficients from your model summary
b0 <- 6.277e+01 # replace with your intercept coefficient
b1 <- 1.497e+03 # replace with your PropMD coefficient
b2 <- 7.233e-05 # replace with your TotExp coefficient
b3 <- -6.026e-03 # replace with your interaction coefficient
# Given values
PropMD <- 0.03
TotExp <- 14
# Calculate the predicted Life Expectancy
LifeExp_Forecast <- b0 + b1 * PropMD + b2 * TotExp + b3 * PropMD * TotExp
LifeExp_Forecast
## [1] 107.6785
For a given values of PropMD=.03 and TotExp = 14, the estimated life expectancy is 107.
This forecast might be realistic because I think if we have strong enough psychological well beings, clean water, fresh food, nutritious intakes, daily workouts, less stress, overall physical health in our happy lives, we may live longer with the aid sophisticated medical technology and treatment.