# Load the necessary packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
Exp_data <- read_csv("https://raw.githubusercontent.com/SalouaDaouki/Data605/main/Exp_data.csv")
## Rows: 190 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (11): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, P...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
str(Exp_data)
## spc_tbl_ [190 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:190] "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ LifeExp : num [1:190] 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num [1:190] 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num [1:190] 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num [1:190] 0.998 1 0.999 1 0.997 ...
## $ PropMD : num [1:190] 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
## $ PropRN : num [1:190] 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : num [1:190] 20 169 108 2589 36 ...
## $ GovtExp : num [1:190] 92 3128 5184 169725 1620 ...
## $ TotExp : num [1:190] 112 3297 5292 172314 1656 ...
## $ LifeExp_4.6 : num [1:190] 2.93e+07 3.28e+08 3.28e+08 6.36e+08 2.62e+07 ...
## $ TotExp_.06 : num [1:190] 1.33 1.63 1.67 2.06 1.56 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. LifeExp = col_double(),
## .. InfantSurvival = col_double(),
## .. Under5Survival = col_double(),
## .. TBFree = col_double(),
## .. PropMD = col_double(),
## .. PropRN = col_double(),
## .. PersExp = col_double(),
## .. GovtExp = col_double(),
## .. TotExp = col_double(),
## .. LifeExp_4.6 = col_double(),
## .. TotExp_.06 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(Exp_data)
## # A tibble: 6 × 12
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## # TotExp_.06 <dbl>
tail(Exp_data)
## # A tibble: 6 × 12
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Vanuatu 69 0.97 0.964 0.999 1.36e-4 1.63e-3 67
## 2 Venezuel… 74 0.982 0.979 0.999 1.77e-3 1.03e-3 247
## 3 Viet Nam 72 0.985 0.983 0.998 5.22e-4 7.17e-4 37
## 4 Yemen 61 0.925 0.9 0.999 3.10e-4 6.33e-4 39
## 5 Zambia 43 0.898 0.818 0.994 1.08e-4 1.88e-3 36
## 6 Zimbabwe 43 0.945 0.915 0.994 1.58e-4 7.07e-4 21
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## # TotExp_.06 <dbl>
summary(Exp_data)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp LifeExp_4.6 TotExp_.06
## Min. : 10.0 Min. : 13 Min. : 23414019 Min. :1.166
## 1st Qu.: 559.5 1st Qu.: 584 1st Qu.:166291095 1st Qu.:1.465
## Median : 5385.0 Median : 5541 Median :307221061 Median :1.677
## Mean : 40953.5 Mean : 41696 Mean :307957212 Mean :1.684
## 3rd Qu.: 25680.2 3rd Qu.: 26331 3rd Qu.:421970229 3rd Qu.:1.842
## Max. :476420.0 Max. :482750 Max. :672603658 Max. :2.193
# Check for missing values
sum(is.na(Exp_data))
## [1] 0
No missing values.
Now, I will provide a scatterplot of LifeExp~TotExp
ggplot(Exp_data, aes(x = LifeExp, y = TotExp)) +
geom_point() +
labs(title = "Life expectancy vs Total expenditures",
x = "Life Expectancy",
y = "Total expenditures")
The scatter plot seems exponential, that may suggests that there is nonlinear association between the “Life Expectancy” and “Total Expenditures”.
Exp_data_lm <- lm(TotExp ~ LifeExp, data = Exp_data)
Exp_data_lm
##
## Call:
## lm(formula = TotExp ~ LifeExp, data = Exp_data)
##
## Coefficients:
## (Intercept) LifeExp
## -234039 4092
\(-234039\) is the y-intercept of the line of the best fit, and \(4092\) is the rate of change or the slope of the line. The regression equation for this model - to predict Total Expenditures based on the value of Life Expectancy is -
\(\text{Total Expenditures} = −234039+4092 \times \text{Life Expectancy}\)
Now let’s plot the line of best fit of the data:
plot(TotExp ~ LifeExp, data=Exp_data)
abline(Exp_data_lm, col = "chartreuse", lwd = 2)
Let’s start by extracting more information about the linear model:
summary(Exp_data_lm)
##
## Call:
## lm(formula = TotExp ~ LifeExp, data = Exp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70785 -47221 -25979 28684 389406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -234038.6 34568.5 -6.770 1.59e-10 ***
## LifeExp 4092.3 506.6 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75540 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
A F-statistic of 65.26 indicates that this simple linear model with the predictor variable, “Life Expectancy”, is statistically good compared to a model with no predictors; intercept-only model.
The Multiple R-squared indicates that 25.77% of the variability in total expenditures is explained by the variation in life expectancy.
The Standard error is 506.6 which is 8 times smaller than the corresponding coefficient (4092.3), so the model is a little bit good.
The p-value is very small which indicates that there is a strong evidence of linear association between life expectancy and total expenditures.
Now, let’s examine the residual plot to check the quality of this model.
plot(fitted(Exp_data_lm),resid(Exp_data_lm))
The residual plot is decreasing in a linear pattern followed by an exponential increasing pattern. This indicates that we need a better model; it suggests that the assumptions of simple linear regression might not be fully met.
Let’s take a look at the Q-Q plot and see what it will reveal:
qqnorm(resid(Exp_data_lm))
qqline(resid(Exp_data_lm))
The shape of the Q-Q plot shows a deviation from the right-side tail, which suggests that the residuals are not normally distributed. It also indicates that the residuals have heavier tail or are more skewed to the right than a normal distribution.
# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)
# Plot the regression diagnostics
plot(Exp_data_lm)
The scale-location plot shows a v-like pattern plus the cluster on the right side of the plot, I think this means that there is no consistency in the variability of the predictor; which may suggests that this regression analysis is not valid and we need a better model.
2. Raise life expectancy to the 4.6 power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the 0.06 power (nearly a log transform, \(TotExp^{.06}\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{.06}\), and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”
First, let’s create a new column that has the values of \(LifeExp^{4.6}\) and another column that has the values of \(TotExp^{.06}\): **(the two columns already exist in the data that I loaded from my Github, that is because when I am knitting the document, the link that was provided on blackboard produces an error. So i had to download the Exp_data into my local machine from R, upload it to my Github than…)
This is the code I used originally to add the two columns:
{Exp_data\(LifeExp_4.6 <- Exp_data\)LifeExp ^ 4.6 Exp_data\(TotExp_.06 <- Exp_data\)TotExp ^ 0.06}
Now let’s Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{.06}\), meaning that the \(LifeExp^{4.6}\) is going to be on the y-axis and the \(TotExp^{.06}\) is going to be on the x-axis:
ggplot(Exp_data, aes(x = TotExp_.06, y = LifeExp_4.6)) +
geom_point() +
labs(title = "Transformed Life expectancy vs Transformed Total expenditures",
x = "Transformed Total expenditures",
y = "Transformed Life Expectancy")
The scatter plot shows that there is a linear, positive correlation between the transformed variables.
Exp_data_lm1 <- lm(LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
Exp_data_lm1
##
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
##
## Coefficients:
## (Intercept) TotExp_.06
## -736527909 620060216
\(-736527909\) is the y-intercept of the line of the best fit, and $620060216 $ is the rate of change or the slope of the line (both values are very big). The regression equation for this model - to predict transformed life expectancy based on the value of the transformed total expenditures is -
\(\text{transformed LifeExp} = -736527909+620060216 \times \text{transformed TotExp}\)
Now let’s plot the line of best fit of the data:
plot(LifeExp_4.6 ~ TotExp_.06, data=Exp_data)
abline(Exp_data_lm1, col = "cornflowerblue", lwd = 2)
The line of best fit emphasizes that there is a positive linear
relationship between the TotExp_.06 and LifeExp_4.6.
Let’s start by extracting more information about the linear model:
summary(Exp_data_lm1)
##
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
A F-statistic of 507.7 indicates that this simple linear model with the predictor variable, “Total Expenditures”, is statistically good; meaning that the explanatory variable, “Life Expectancy” explains a good amount of the variability of the predictor.
The Multiple R-squared indicates that 72.98% of the variability in life expectancy is explained by the variation in total expenditures.
The Standard error is 27518940 which is over 22 times smaller than the corresponding coefficient (620060216), so the model is a good model.
The p-value is very tiny (< 2.2e-16) which indicates that there is a strong evidence of linear association between life expectancy and total expenditures.
Now, let’s examine the residual plot to check the quality of this model.
plot(fitted(Exp_data_lm1),resid(Exp_data_lm1))
The residual plot has data points scattered around in the plot, that indicates that the predictor in the regression model sufficiently explain the data.
Let’s take a look at the Q-Q plot and see what it will reveal:
qqnorm(resid(Exp_data_lm1))
qqline(resid(Exp_data_lm1))
The points plotted in this figure seem to follow almost a straight line, that means that the residuals are close normally distributed.
# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)
# Plot the regression diagnostics
plot(Exp_data_lm1)
I think the shape of the scale-location suggests that the variability of the residuals is constant.
The residuals are randomly scattered around zero with no clear patterns on the second model (with the transformed variables). In addition, the Q-Q plot for the second model seems to follow a straight line better than the first model, where there is a divergence on the right tail of the plot. However, based on the calculated RSS below, the first model has smaller RSS (measures the overall goodness of fit of the model by summing the squared differences between the observed and predicted values) which indicates that it is a better model.
# Get the residuals for each model
residuals <- residuals(Exp_data_lm)
residuals1 <- residuals(Exp_data_lm1)
# Calculate Residual Sum of Squares (RSS) for each residual
RSS <- sum(residuals^2)
RSS1 <- sum(residuals1^2)
RSS
## [1] 1.072912e+12
RSS1
## [1] 1.539508e+18
3. Using the results from 3, forecast life expectancy when \(TotExp^.06 =1.5\). Then forecast life expectancy when \(TotExp^.06=2.5\).
# Create a dataframe with the values of TotExp_0.06
new_data <- data.frame(TotExp_.06 = c(1.5, 2.5))
# Forecast life expectancy
life_expectancy_forecast <- predict(Exp_data_lm1, newdata = new_data)
# Print the forecasted life expectancy
print(life_expectancy_forecast)
## 1 2
## 193562414 813622630
For \(TotExp^0.06 = 1.5\), the forecasted life expectancy is approximately 193,562,414 years.
For \(TotExp^0.06 = 2.5\), the forecasted life expectancy is approximately 813,622,630 years.
4. Build the following multiple regression model and interpret the F Statistics, \(R^2\), standard error, and p-values. How good is the model? \(LifeExp = b_0+b_1 \times PropMd + b_2 \times TotExp +b_3 \times PropMD \times TotExp\)
# Fit the multiple regression model
Exp_data_lm2 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = Exp_data)
# Summary of the model
summary(Exp_data_lm2)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = Exp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The equation for the multiple regression model is:
\(LifeExp = 62.77+ 1497 \times PropMd + 0.00007233 \times TotExp -0.006026 \times PropMD \times TotExp\)
The F-statistic is 34.49 which indicates that the regression model is good enough; the model explains a good amount of the variance in the dependent variable (LifeExp). That is emphasized by the tiny value of the p-value \(< 2.2\times 10^{-16}\)
The R-squared value is 0.3574, indicating that approximately 35.74% of the variance in LifeExp is explained by the independent variables (PropMD, TotExp, and their product).
The standard error also indicates that the model is a good model; it is low value (8.765).
So this model is really good; it is better than the previous models, let’s confirm with the residuals plot:
plot(fitted(Exp_data_lm2),resid(Exp_data_lm2))
The shape observed on the residuals plot indicates the variability of life expectancy (the dependent variable) is not consistent across different levels of the independent variables (PropMD, TotExp, and their product).
Let’s take a look at the Q-Q plot and see what it will reveal:
qqnorm(resid(Exp_data_lm2))
qqline(resid(Exp_data_lm2))
The slight downward curve observed in the Q-Q plot suggests potential deviations from normality in the residuals.
# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)
# Plot the regression diagnostics
plot(Exp_data_lm2)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
The scale-location plot confirms the assumption that this model is good fit fails; a better model may be constructed for better predictions.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
# Create a data frame with the forecasted values
forecast_data <- data.frame(PropMD = 0.03, TotExp = 14)
# Use the predict function to forecast LifeExp
forecasted_life_exp <- predict(Exp_data_lm2, newdata = forecast_data)
# Print the forecasted LifeExp value
print(forecasted_life_exp)
## 1
## 107.696
Let’s find the row that corresponds to those values PropMD=.03 and TotExp = 14 and compare it to the predicted value of the life expectancy;
# Find the row where PropMD = 0.03 and TotExp = 14
desired_row <- subset(Exp_data, PropMD == 0.03 & TotExp == 14)
# Print the desired row
print(desired_row)
## # A tibble: 0 × 12
## # ℹ 12 variables: Country <chr>, LifeExp <dbl>, InfantSurvival <dbl>,
## # Under5Survival <dbl>, TBFree <dbl>, PropMD <dbl>, PropRN <dbl>,
## # PersExp <dbl>, GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## # TotExp_.06 <dbl>
Since we cannot find those exact values, let’s find values that are closer in the data frame so we can compare the life expectancy and decide if they are reasonable:
# Set the tolerance level for the search
tolerance <- 1.0
# Find rows where PropMD is close to 0.03 and TotExp is close to 14
desired_row1 <- subset(Exp_data, abs(PropMD - 0.03) <= tolerance & abs(TotExp - 14) <= tolerance)
# Print the desired row
print(desired_row1)
## # A tibble: 1 × 12
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Burundi 49 0.891 0.819 0.993 0.0000245 1.65e-4 3
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## # TotExp_.06 <dbl>
Based on the data collected, PropMD = 0.0000245 and TotExp = 13, the corresponding value of LifeExp = 49, I think the value of LifeExp = 107.696 which corresponds to PropMD=.03 and TotExp = 14 is realistic.
In this analysis, we explored real-world healthcare expenditure data from 2008, focusing on life expectancy as the dependent variable. We conducted simple linear regression to understand the relationship between life expectancy and total expenditures, finding a strong association. However, diagnostic plots suggested that the assumptions of linear regression might not be sufficiently met.
To address this, we transformed the variables and reran the regression analysis, finding a stronger relationship between the transformed variables. The multiple regression model, incorporating additional predictors like PropMd, TotExp and their product showed improved performance, explaining a higher proportion of variance in life expectancy.
We used the models to forecast life expectancy under specific conditions, comparing the results to values in the dataset to assess realism. Overall, while the models provided valuable insights into the factors influencing life expectancy, more complex model may be needed to explain the model better.