Assignment Week 12:

The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

  2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

  3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

  4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Answer 1:

\[\hat{y} = a_{0} + a_{1}x_{1} + a_{2}x_{2} + ... + + a_{k}x_{k}\]

Loading the Libraries:

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

Fetching CSV File from GitHub:

urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/who.csv"

who_data<-read_csv(url(urlfile))
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(who_data)
## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
nrow(who_data)
## [1] 190
ncol(who_data)
## [1] 10
str(who_data)
## spc_tbl_ [190 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:190] "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : num [1:190] 42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num [1:190] 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num [1:190] 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num [1:190] 0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num [1:190] 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num [1:190] 0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : num [1:190] 20 169 108 2589 36 ...
##  $ GovtExp       : num [1:190] 92 3128 5184 169725 1620 ...
##  $ TotExp        : num [1:190] 112 3297 5292 172314 1656 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   LifeExp = col_double(),
##   ..   InfantSurvival = col_double(),
##   ..   Under5Survival = col_double(),
##   ..   TBFree = col_double(),
##   ..   PropMD = col_double(),
##   ..   PropRN = col_double(),
##   ..   PersExp = col_double(),
##   ..   GovtExp = col_double(),
##   ..   TotExp = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Scatterplot:

who_data.lm <- lm(LifeExp ~ TotExp, who_data)
ggplot(data = who_data, aes(x = TotExp, y = LifeExp)) + 
        geom_point(color='blue') +
        geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

### Simple Linear Regression:

summary(who_data.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpretation of Coefficients:

The intercept term indicates that when Total Expenditure (TotExp) is zero, the estimated Life Expectancy (LifeExp) is approximately 64.75 years.

The coefficient for TotExp is 6.297e-05, shows that for each unit increase in Total Expenditure, Life Expectancy is estimated to increase by approximately 0.00006297 years (or about 0.023 seconds).

Significance of Coefficients:

Both the intercept and the coefficient for TotExp are statistically significant, as indicated by their low p-values (p < 0.001). This suggests that there is a significant linear relationship between Total Expenditure and Life Expectancy.

Overall Fit of the Model:

The multiple R-squared value is 0.2577, showing that approximately 25.77% of the variability in Life Expectancy can be explained by Total Expenditure.

The adjusted R-squared value, which adjusts for the number of predictors in the model, is slightly lower at 0.2537.

The F-statistic tests the overall significance of the model.

With a very low p-value (p < 0.001), it suggests that the model as a whole is statistically significant.

Assumptions of Simple Linear Regression:

Linearity: The coefficient estimates suggest a linear relationship between Total Expenditure and Life Expectancy, which is in line with the assumption of linearity.

Independence of Residuals: The output does not provide diagnostic plots to assess this assumption. Further analysis, such as plotting residuals against predicted values, is necessary to evaluate independence.

Homoscedasticity: Again, without residual plots, it’s challenging to assess whether the spread of residuals is constant across different levels of Total Expenditure.

Normality of Residuals: While the output does not provide direct information on the normality of residuals, we can use the residual standard error as an indicator of spread. However, formal tests or diagnostic plots are needed to confirm normality.

Summary:

However the coefficient estimates and overall model fit suggest that the linear regression model is statistically significant and explains a portion of the variability in Life Expectancy, further analysis is needed to assess whether all the assumptions of simple linear regression are met, specifically regarding the independence, homoscedasticity, and normality of residuals.

Answer 2:

Re-run the Simple Regression Model using the Transformed Variables:

who <- who_data[,c(2,10)]
lifeExp_4.6 <- who$LifeExp^4.6
totExp_0.06 <- who$TotExp^0.06

who2_lm <- lm(lifeExp_4.6~totExp_0.06)

summary(who2_lm)
## 
## Call:
## lm(formula = lifeExp_4.6 ~ totExp_0.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## totExp_0.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Plotting the Model:

plot(lifeExp_4.6 ~  totExp_0.06)
abline(who2_lm)

Interpretation of Coefficients:

The intercept term suggests that when Total Expenditure (totexp_0.06) is zero, the estimated Life Expectancy (lifeexp_4.6) is approximately -736,527,910.

The coefficient for totexp_0.06 is 620,060,216, is showing that for each unit increase in Total Expenditure, Life Expectancy is estimated to increase by approximately 620,060,216 years.

Significance of Coefficients:

Both the intercept and the coefficient for totexp_0.06 are highly statistically significant, as indicated by their very low p-values (p < 0.001).

This suggests a significant linear relationship between Total Expenditure and Life Expectancy.

Overall Fit of the Model:

The multiple R-squared value is 0.7298, indicating that approximately 72.98% of the variability in Life Expectancy can be explained by Total Expenditure.

The adjusted R-squared value, which adjusts for the number of predictors in the model, is slightly lower at 0.7283.

The F-statistic tests the overall significance of the model.

With an extremely low p-value (p < 0.001), it suggests that the model as a whole is highly statistically significant.

Residual Standard Error:

The residual standard error is approximately 90,490,000, indicating the average amount that the observed values deviate from the predicted values.

Comparison of the two Models:

For comparing the two models, we can consider following factors:

R-squared: The second model (R-squared = 0.7298) explains a higher percentage of the variability in Life Expectancy compared to the first model (R-squared = 0.2577).

A higher R-squared value generally indicates a better fit of the model to the data.

F-statistic: The F-statistic for the second model is substantially higher (507.7) compared to the first model (65.26), showing that the second model is a better fit for the data.

Residual Standard Error: The residual standard error for the second model (90,490,000) is also lower than that of the first model (9.371), suggesting that the second model provides a better prediction of Life Expectancy.

The second model appears to be better in terms of explaining the variability in Life Expectancy and providing a more accurate prediction.

Answer 3:

Forecast Life Expectancy:

forcast <- data.frame(totExp_0.06=c(1.5, 2.5))
predict(who2_lm, forcast) ^ (1/4.6)
##        1        2 
## 63.31153 86.50645

Answer 4:

Multiple Regression Model:

urlfile<- "https://raw.githubusercontent.com/uzmabb182/Data605_Assignment/main/who.csv"

who<-read_csv(url(urlfile))
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(who)
## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
multiple_regression <- lm(LifeExp ~ PropMD + TotExp + TotExp * PropMD, data=who)
summary(multiple_regression)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp * PropMD, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpretation of Coefficients:

The intercept term indicates that when both PropMD and TotExp are zero, the estimated Life Expectancy is approximately 62.77 years.

The coefficient for PropMD (1.497e+03) indicates that for each unit increase in the Proportion of Medical Doctors, Life Expectancy is estimated to increase by approximately 1497 years.

The coefficient for TotExp (7.233e-05) suggests that for each unit increase in Total Expenditure, Life Expectancy is estimated to increase by approximately 0.00007233 years.

The coefficient for the interaction term (PropMD:TotExp) (-6.026e-03) indicates how the effect of Total Expenditure on Life Expectancy changes depending on the Proportion of Medical Doctors.

It shows that as the Proportion of Medical Doctors increases, the effect of Total Expenditure on Life Expectancy decreases by approximately 0.006026 years.

Significance of Coefficients:

All coefficients are statistically significant as (p < 0.05), indicating that both the individual predictors and the interaction term are significantly associated with Life Expectancy.

Overall Fit of the Model:

The multiple R-squared value is 0.3574, is indicating that approximately 35.74% of the variability in Life Expectancy is explained by the predictors included in the model.

The adjusted R-squared value, which adjusts for the number of predictors in the model, is slightly lower at 0.3471.

The F-statistic tests the overall significance of the model.

Having an extremely low p-value (< 2.2e-16), it suggests that the model as a whole is highly statistically significant.

Residual Standard Error:

The residual standard error is approximately 8.765, and it is showing that the average amount that the observed values deviate from the predicted values.

Summary:

The model provides a statistically significant explanation of Life Expectancy based on the included predictors.

It’s important to note that while the model explains a moderate proportion of the variability in Life Expectancy, there may be other factors not accounted for in the model that influence Life Expectancy.

Answer 5:

Forecast LifeExp when PropMD=.03 and TotExp = 14

forcast <- data.frame(PropMD=0.03, TotExp=14)
predict(multiple_regression, forcast)
##       1 
## 107.696

This forecast does not seem realistic because 107.696 is an unrealistic age for a person.