SDaouki_Assignment12

# Load the necessary packages 
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.92 loaded

library(ggplot2)

Loading the data into R from my Github

Exp_data <- read_csv("https://raw.githubusercontent.com/SalouaDaouki/Data605/main/Exp_data.csv")

## Rows: 190 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Country
## dbl (11): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

str(Exp_data)

## spc_tbl_ [190 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:190] "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : num [1:190] 42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num [1:190] 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num [1:190] 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num [1:190] 0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num [1:190] 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num [1:190] 0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : num [1:190] 20 169 108 2589 36 ...
##  $ GovtExp       : num [1:190] 92 3128 5184 169725 1620 ...
##  $ TotExp        : num [1:190] 112 3297 5292 172314 1656 ...
##  $ LifeExp_4.6   : num [1:190] 2.93e+07 3.28e+08 3.28e+08 6.36e+08 2.62e+07 ...
##  $ TotExp_.06    : num [1:190] 1.33 1.63 1.67 2.06 1.56 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   LifeExp = col_double(),
##   ..   InfantSurvival = col_double(),
##   ..   Under5Survival = col_double(),
##   ..   TBFree = col_double(),
##   ..   PropMD = col_double(),
##   ..   PropRN = col_double(),
##   ..   PersExp = col_double(),
##   ..   GovtExp = col_double(),
##   ..   TotExp = col_double(),
##   ..   LifeExp_4.6 = col_double(),
##   ..   TotExp_.06 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(Exp_data)

## # A tibble: 6 × 12
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## #   TotExp_.06 <dbl>

tail(Exp_data)

## # A tibble: 6 × 12
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Vanuatu        69          0.97           0.964  0.999 1.36e-4 1.63e-3      67
## 2 Venezuel…      74          0.982          0.979  0.999 1.77e-3 1.03e-3     247
## 3 Viet Nam       72          0.985          0.983  0.998 5.22e-4 7.17e-4      37
## 4 Yemen          61          0.925          0.9    0.999 3.10e-4 6.33e-4      39
## 5 Zambia         43          0.898          0.818  0.994 1.08e-4 1.88e-3      36
## 6 Zimbabwe       43          0.945          0.915  0.994 1.58e-4 7.07e-4      21
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## #   TotExp_.06 <dbl>

summary(Exp_data)

##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##     GovtExp             TotExp        LifeExp_4.6          TotExp_.06   
##  Min.   :    10.0   Min.   :    13   Min.   : 23414019   Min.   :1.166  
##  1st Qu.:   559.5   1st Qu.:   584   1st Qu.:166291095   1st Qu.:1.465  
##  Median :  5385.0   Median :  5541   Median :307221061   Median :1.677  
##  Mean   : 40953.5   Mean   : 41696   Mean   :307957212   Mean   :1.684  
##  3rd Qu.: 25680.2   3rd Qu.: 26331   3rd Qu.:421970229   3rd Qu.:1.842  
##  Max.   :476420.0   Max.   :482750   Max.   :672603658   Max.   :2.193

# Check for missing values
sum(is.na(Exp_data))

## [1] 0

No missing values.

Visualizing the data:

Now, I will provide a scatterplot of LifeExp~TotExp

ggplot(Exp_data, aes(x = LifeExp, y = TotExp)) +
  geom_point() +
  labs(title = "Life expectancy vs Total expenditures",
       x = "Life Expectancy",
       y = "Total expenditures")

The scatter plot seems exponential, that may suggests that there is nonlinear association between the “Life Expectancy” and “Total Expenditures”.

Simple Linear Regression:

Exp_data_lm <- lm(TotExp ~ LifeExp, data = Exp_data)
Exp_data_lm

## 
## Call:
## lm(formula = TotExp ~ LifeExp, data = Exp_data)
## 
## Coefficients:
## (Intercept)      LifeExp  
##     -234039         4092

$-234039$ is the y-intercept of the line of the best fit, and $4092$ is the rate of change or the slope of the line. The regression equation for this model - to predict Total Expenditures based on the value of Life Expectancy is -

$\text{Total Expenditures} = −234039+4092 \times \text{Life Expectancy}$

Now let’s plot the line of best fit of the data:

plot(TotExp ~ LifeExp, data=Exp_data)
abline(Exp_data_lm, col = "chartreuse", lwd = 2)

Evaluating the quality of the linear model:

Let’s start by extracting more information about the linear model:

summary(Exp_data_lm)

## 
## Call:
## lm(formula = TotExp ~ LifeExp, data = Exp_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70785 -47221 -25979  28684 389406 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -234038.6    34568.5  -6.770 1.59e-10 ***
## LifeExp        4092.3      506.6   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75540 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

A F-statistic of 65.26 indicates that this simple linear model with the predictor variable, “Life Expectancy”, is statistically good compared to a model with no predictors; intercept-only model.
The Multiple R-squared indicates that 25.77% of the variability in total expenditures is explained by the variation in life expectancy.
The Standard error is 506.6 which is 8 times smaller than the corresponding coefficient (4092.3), so the model is a little bit good.
The p-value is very small which indicates that there is a strong evidence of linear association between life expectancy and total expenditures.

Now, let’s examine the residual plot to check the quality of this model.

Residual Plot:

plot(fitted(Exp_data_lm),resid(Exp_data_lm))

The residual plot is decreasing in a linear pattern followed by an exponential increasing pattern. This indicates that we need a better model; it suggests that the assumptions of simple linear regression might not be fully met.

Let’s take a look at the Q-Q plot and see what it will reveal:

qqnorm(resid(Exp_data_lm))
qqline(resid(Exp_data_lm))

The shape of the Q-Q plot shows a deviation from the right-side tail, which suggests that the residuals are not normally distributed. It also indicates that the residuals have heavier tail or are more skewed to the right than a normal distribution.

# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)  

# Plot the regression diagnostics
plot(Exp_data_lm)

The scale-location plot shows a v-like pattern plus the cluster on the right side of the plot, I think this means that there is no consistency in the variability of the predictor; which may suggests that this regression analysis is not valid and we need a better model.

2. Raise life expectancy to the 4.6 power (i.e., $LifeExp^{4.6}$). Raise total expenditures to the 0.06 power (nearly a log transform, $TotExp^{.06}$). Plot $LifeExp^{4.6}$ as a function of $TotExp^{.06}$, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, $R^2$, standard error, and p-values. Which model is “better?”

First, let’s create a new column that has the values of $LifeExp^{4.6}$ and another column that has the values of $TotExp^{.06}$: **(the two columns already exist in the data that I loaded from my Github, that is because when I am knitting the document, the link that was provided on blackboard produces an error. So i had to download the Exp_data into my local machine from R, upload it to my Github than…)

This is the code I used originally to add the two columns:

{Exp_data$LifeExp_4.6 <- Exp_data$LifeExp ^ 4.6 Exp_data$TotExp_.06 <- Exp_data$TotExp ^ 0.06}

Now let’s Plot $LifeExp^{4.6}$ as a function of $TotExp^{.06}$, meaning that the $LifeExp^{4.6}$ is going to be on the y-axis and the $TotExp^{.06}$ is going to be on the x-axis:

ggplot(Exp_data, aes(x = TotExp_.06, y = LifeExp_4.6)) +
  geom_point() +
  labs(title = "Transformed Life expectancy vs Transformed Total expenditures",
       x = "Transformed Total expenditures",
       y = "Transformed Life Expectancy")

The scatter plot shows that there is a linear, positive correlation between the transformed variables.

Simple Linear Regression for the transformed variables:

Exp_data_lm1 <- lm(LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
Exp_data_lm1

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
## 
## Coefficients:
## (Intercept)   TotExp_.06  
##  -736527909    620060216

$-736527909$ is the y-intercept of the line of the best fit, and $620060216 $ is the rate of change or the slope of the line (both values are very big). The regression equation for this model - to predict transformed life expectancy based on the value of the transformed total expenditures is -

$\text{transformed LifeExp} = -736527909+620060216 \times \text{transformed TotExp}$

Now let’s plot the line of best fit of the data:

plot(LifeExp_4.6 ~ TotExp_.06, data=Exp_data)
abline(Exp_data_lm1, col = "cornflowerblue", lwd = 2)

The line of best fit emphasizes that there is a positive linear relationship between the TotExp_.06 and LifeExp_4.6.

Evaluating the quality of the linear model:

Let’s start by extracting more information about the linear model:

summary(Exp_data_lm1)

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06, data = Exp_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

A F-statistic of 507.7 indicates that this simple linear model with the predictor variable, “Total Expenditures”, is statistically good; meaning that the explanatory variable, “Life Expectancy” explains a good amount of the variability of the predictor.
The Multiple R-squared indicates that 72.98% of the variability in life expectancy is explained by the variation in total expenditures.
The Standard error is 27518940 which is over 22 times smaller than the corresponding coefficient (620060216), so the model is a good model.
The p-value is very tiny (< 2.2e-16) which indicates that there is a strong evidence of linear association between life expectancy and total expenditures.

Now, let’s examine the residual plot to check the quality of this model.

Residual Plot:

plot(fitted(Exp_data_lm1),resid(Exp_data_lm1))

The residual plot has data points scattered around in the plot, that indicates that the predictor in the regression model sufficiently explain the data.

Let’s take a look at the Q-Q plot and see what it will reveal:

qqnorm(resid(Exp_data_lm1))
qqline(resid(Exp_data_lm1))

The points plotted in this figure seem to follow almost a straight line, that means that the residuals are close normally distributed.

# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)  

# Plot the regression diagnostics
plot(Exp_data_lm1)

I think the shape of the scale-location suggests that the variability of the residuals is constant.

The residuals are randomly scattered around zero with no clear patterns on the second model (with the transformed variables). In addition, the Q-Q plot for the second model seems to follow a straight line better than the first model, where there is a divergence on the right tail of the plot. However, based on the calculated RSS below, the first model has smaller RSS (measures the overall goodness of fit of the model by summing the squared differences between the observed and predicted values) which indicates that it is a better model.

# Get the residuals for each model
residuals <- residuals(Exp_data_lm)
residuals1 <- residuals(Exp_data_lm1)

# Calculate Residual Sum of Squares (RSS) for each residual
RSS <- sum(residuals^2)
RSS1 <- sum(residuals1^2)
RSS

## [1] 1.072912e+12

RSS1

## [1] 1.539508e+18

3. Using the results from 3, forecast life expectancy when $TotExp^.06 =1.5$. Then forecast life expectancy when $TotExp^.06=2.5$.

# Create a dataframe with the values of TotExp_0.06
new_data <- data.frame(TotExp_.06 = c(1.5, 2.5))

# Forecast life expectancy
life_expectancy_forecast <- predict(Exp_data_lm1, newdata = new_data)

# Print the forecasted life expectancy
print(life_expectancy_forecast)

##         1         2 
## 193562414 813622630

For $TotExp^0.06 = 1.5$, the forecasted life expectancy is approximately 193,562,414 years.
For $TotExp^0.06 = 2.5$, the forecasted life expectancy is approximately 813,622,630 years.

4. Build the following multiple regression model and interpret the F Statistics, $R^2$, standard error, and p-values. How good is the model? $LifeExp = b_0+b_1 \times PropMd + b_2 \times TotExp +b_3 \times PropMD \times TotExp$

# Fit the multiple regression model
Exp_data_lm2 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = Exp_data)

# Summary of the model
summary(Exp_data_lm2)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = Exp_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The equation for the multiple regression model is:

$LifeExp = 62.77+ 1497 \times PropMd + 0.00007233 \times TotExp -0.006026 \times PropMD \times TotExp$

The F-statistic is 34.49 which indicates that the regression model is good enough; the model explains a good amount of the variance in the dependent variable (LifeExp). That is emphasized by the tiny value of the p-value $< 2.2\times 10^{-16}$
The R-squared value is 0.3574, indicating that approximately 35.74% of the variance in LifeExp is explained by the independent variables (PropMD, TotExp, and their product).
The standard error also indicates that the model is a good model; it is low value (8.765).

So this model is really good; it is better than the previous models, let’s confirm with the residuals plot:

Residual Plot:

plot(fitted(Exp_data_lm2),resid(Exp_data_lm2))

The shape observed on the residuals plot indicates the variability of life expectancy (the dependent variable) is not consistent across different levels of the independent variables (PropMD, TotExp, and their product).

Let’s take a look at the Q-Q plot and see what it will reveal:

qqnorm(resid(Exp_data_lm2))
qqline(resid(Exp_data_lm2))

The slight downward curve observed in the Q-Q plot suggests potential deviations from normality in the residuals.

# Set the size of the plotting device
options(repr.plot.width = 8, repr.plot.height = 8)  

# Plot the regression diagnostics
plot(Exp_data_lm2)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

The scale-location plot confirms the assumption that this model is good fit fails; a better model may be constructed for better predictions.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Create a data frame with the forecasted values
forecast_data <- data.frame(PropMD = 0.03, TotExp = 14)

# Use the predict function to forecast LifeExp
forecasted_life_exp <- predict(Exp_data_lm2, newdata = forecast_data)

# Print the forecasted LifeExp value
print(forecasted_life_exp)

##       1 
## 107.696

Let’s find the row that corresponds to those values PropMD=.03 and TotExp = 14 and compare it to the predicted value of the life expectancy;

# Find the row where PropMD = 0.03 and TotExp = 14
desired_row <- subset(Exp_data, PropMD == 0.03 & TotExp == 14)

# Print the desired row
print(desired_row)

## # A tibble: 0 × 12
## # ℹ 12 variables: Country <chr>, LifeExp <dbl>, InfantSurvival <dbl>,
## #   Under5Survival <dbl>, TBFree <dbl>, PropMD <dbl>, PropRN <dbl>,
## #   PersExp <dbl>, GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## #   TotExp_.06 <dbl>

Since we cannot find those exact values, let’s find values that are closer in the data frame so we can compare the life expectancy and decide if they are reasonable:

# Set the tolerance level for the search
tolerance <- 1.0 

# Find rows where PropMD is close to 0.03 and TotExp is close to 14
desired_row1 <- subset(Exp_data, abs(PropMD - 0.03) <= tolerance & abs(TotExp - 14) <= tolerance)

# Print the desired row
print(desired_row1)

## # A tibble: 1 × 12
##   Country LifeExp InfantSurvival Under5Survival TBFree    PropMD  PropRN PersExp
##   <chr>     <dbl>          <dbl>          <dbl>  <dbl>     <dbl>   <dbl>   <dbl>
## 1 Burundi      49          0.891          0.819  0.993 0.0000245 1.65e-4       3
## # ℹ 4 more variables: GovtExp <dbl>, TotExp <dbl>, LifeExp_4.6 <dbl>,
## #   TotExp_.06 <dbl>

Based on the data collected, PropMD = 0.0000245 and TotExp = 13, the corresponding value of LifeExp = 49, I think the value of LifeExp = 107.696 which corresponds to PropMD=.03 and TotExp = 14 is realistic.

Summary:

In this analysis, we explored real-world healthcare expenditure data from 2008, focusing on life expectancy as the dependent variable. We conducted simple linear regression to understand the relationship between life expectancy and total expenditures, finding a strong association. However, diagnostic plots suggested that the assumptions of linear regression might not be sufficiently met.

To address this, we transformed the variables and reran the regression analysis, finding a stronger relationship between the transformed variables. The multiple regression model, incorporating additional predictors like PropMd, TotExp and their product showed improved performance, explaining a higher proportion of variance in life expectancy.

We used the models to forecast life expectancy under specific conditions, comparing the results to values in the dataset to assess realism. Overall, while the models provided valuable insights into the factors influencing life expectancy, more complex model may be needed to explain the model better.

SDaouki_Assignment12

Saloua Daouki

2024-04-12

Loading the data into R from my Github

Visualizing the data:

Simple Linear Regression:

Evaluating the quality of the linear model:

Residual Plot:

Simple Linear Regression for the transformed variables:

Evaluating the quality of the linear model:

Residual Plot:

Residual Plot:

Summary: