Assignment 12

The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures. 1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met. 2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?” 3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5. 4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp 5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Getting started: Load libraries

Importing and reading the dataset

data <- read_csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/main/who.csv")

## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- data
df

## # A tibble: 190 × 10
##    Country  LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##    <chr>      <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghani…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
##  2 Albania       71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
##  3 Algeria       71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
##  4 Andorra       82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
##  5 Angola        41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
##  6 Antigua…      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
##  7 Argenti…      75          0.986          0.983  1.00  2.78e-3 7.41e-4     484
##  8 Armenia       69          0.979          0.976  0.999 3.70e-3 4.92e-3      88
##  9 Austral…      82          0.995          0.994  1.00  2.33e-3 9.15e-3    3181
## 10 Austria       80          0.996          0.996  1.00  3.61e-3 6.46e-3    3788
## # ℹ 180 more rows
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>

colnames(df)

##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"

Step 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the

variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Create a scatterplot of LifeExp vs. TotExp
plot(df$TotExp, df$LifeExp, xlab="GDP per Capita", ylab="Life Expectancy", main="Scatterplot of Life Expectancy vs.TotExp")

# Run simple linear regression
lm_model <- lm(LifeExp ~ TotExp, data=df)

Are the assumptions of simple linear regression met in thi model?

The model suggests that there is a statistically significant relationship between TotExp and LifeExp (p < 0.001). However, the R-squared value indicates that only about 25.77% of the variability in LifeExp is explained by the linear relationship with TotExp, suggesting that other factors may also influence life expectancy.

# Summary of the linear regression model
summary(lm_model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

INTERPRETING THE ESTIMATE COLUMN OR REGRESSION COEFFICIENT: - The estimated intercept coefficient((Intercept) of 64.75 represents the predicted value of LifeExp when TotExp is equal to zero. - The estimated coefficient of 6.297e-05 for TotExp indicates that for each unit increase in TotExp, the LifeExp is expected to increase by approximately 6.297e-05 units.

INTERPRETING THE STANDARD ERROR COLUMN (STD): - The standard error of the estimate is 7.79 which means that there is about 7.80 variation around the estimate of the regression coefficient. -The residual standard error, which measures the spread of the residuals around the regression line, is 9.371.

INTERPRETING THE t value COLUMN: The t value column shows a larger number which indicates that the results did not occurred by chance.

INTERPRETING THE pr(>/t) COLUMN/SIGNIFICANCE: Both the intercept and TotExp coefficient are statistically significant (p < 0.001), as indicated by the ’***’ symbols. This suggests that there is strong evidence to reject the null hypothesis that these coefficients are equal to zero.

# Residuals analysis
model <- lm(LifeExp ~ TotExp, data=df)
model.lm <- model
# Set up the plot layout (to plot one graph at a time)
par(mfrow=c(1,1))
# Plot diagnostics for the linear regression model
plot(model.lm)

The analysis of the residuals plots indicates that the assumptions of simple linear regression (linearity, independence & normality of observations, and homoscedasticity) are not met in this model. Indeed, a look at the QQ plot of the residual indicates that the residuals are not normally distributed. Also, an examination of the residuals vs fitted plot reveals that the spread of the residuals changes systematically with the value of the independent variable,which indicates heteroscedasticity and a violation of the assumption of normal distribution.

Step 2

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06

power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# Perform transformations
df_transformed <- df %>%
  mutate(LifeExp_transformed = LifeExp^4.6,
         TotExp_transformed = TotExp^0.06)

# Plot the transformed variables
plot(df_transformed$TotExp_transformed, df_transformed$LifeExp_transformed,
     xlab = "Total Expenditures (Transformed)", ylab = "Life Expectancy (Transformed)",
     main = "Transformed Variables: Life Expectancy vs. Total Expenditures")

# Run simple linear regression with transformed variables
lm_model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = df_transformed)

# Summary of the linear regression model with transformed variables
summary(lm_model_transformed)

## 
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = df_transformed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -736527910   46817945  -15.73   <2e-16 ***
## TotExp_transformed  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Step 3

Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# Life expectancy when TotExp^.06 =1.5
# Create a data frame with the new value of TotExp
new_data1 <- data.frame(TotExp = 1.5)

# Forecast life expectancy for the new value
lm_new1 <- predict(lm_model, newdata = new_data1)

# Print the forecast
print(lm_new1 )

##        1 
## 64.75347

# Life expectancy when TotExp^.06 =2.5
new_data2 <- data.frame(TotExp = 2.5)

# Forecast life expectancy for the new value
lm_new2 <- predict(lm_model, newdata = new_data2)

# Print the forecast
print(lm_new2 )

##        1 
## 64.75353

Step 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error,and p-values. How good is the model?LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

 # Fit the multiple regression model with corrected variable names
lm_model2 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)

lm_model2

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMD:TotExp  
##     6.277e+01      1.497e+03      7.233e-05     -6.026e-03

# Summary of the multiple regression model
summary(lm_model2)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpretaion of the Linear Model Summary:

The model reveals a signifcant relationship between LifeExp and PropMD, TotExp and PropMD * TotExp with a respective significant p value of (< 2e-16), (2.32e-07 ) and (6.35e-05 ***). In terms of estimates, the regression model specifically shows: - a 1.49 (+/- 2.7 margin error) increase in LifeExp for every 1% increase in PropMD ; - a 7.23 increase (+/-8.9 margin error) in LifeExp for every 1% increase in TotExp; and - a -6.02 (+/_ 1.4 margin error) decrease in LifeExp for every 1% increase in PropMD:TotExp - The estimated intercept coefficient((Intercept) of 6.27 represents the predicted value of LifeExp when TotExp is equal to zero. In terms of p values displayed in the Pr(>|t|) column of the output mode summary,one can deduce from their very low values that the null hypothesis must be rejected as it is less likely that the calculated p values would have occurred by chance

Lastly, the adjusted R-squared of 0.3471 indicates that 34% of the variability in LifeExp is explained by the combination of PropMD, TotExp and PropMD:TotExp.

Model evaluation:

This regression model does seem appropriate. The model violates the assumptions of linear regressions and notably the assumption of homoscedaciticity . Also, the mdeian is not equal to zero and there is not a similar spread on either side of the median:the Minimum is at -27.320 and the maximum is at 13.074

# Residuals analysis
model2 <- lm(LifeExp ~ TotExp, data=df)
lm_model2 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = df)
model.lm2 <- lm_model2
# Set up the plot layout (to plot one graph at a time)
par(mfrow=c(1,1))
# Plot diagnostics for the linear regression model
plot(model.lm2)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Step 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Assuming you've already fitted the multiple regression model
# Let's call it lm_model

# Create a new data frame with the values of PropMD and TotExp
new_data <- data.frame(PropMD = 0.03, TotExp = 14)

# Predict LifeExp using the multiple regression model
predicted_lifeExp <- predict(lm_model, newdata = new_data)

# Print the predicted LifeExp
print(predicted_lifeExp)

##        1 
## 64.75426

Does this forecast seem realistic?

NO. This forecast does not seem realistic at all, because a regression model with identical residuals would not be considered realistic or useful for making predictions. It indicates a serious problem with the model or the data as it suggests that the model is not capturing any of the variability in the data. In other words, the model is not explaining any of the variation in the dependent variable.

# Summary of the multiple regression model
summary(predicted_lifeExp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   64.75   64.75   64.75   64.75   64.75   64.75

# Residuals analysis
predicted_lifeExp <- predict(lm_model, newdata = new_data)

lm_model3 <- predict(lm_model, newdata = new_data)

model.lm3 <- lm_model3
# Set up the plot layout (to plot one graph at a time)
par(mfrow=c(1,1))
# Plot diagnostics for the linear regression model
plot(model.lm3)