Assignment 12

The attached who.csv dataset contains real-world data from 2008. The variables included follow. Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

#Read in Dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("/Users/marjetevucinaj/Downloads/who.csv")

Question 1:

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

library(ggplot2) #load library
plot(data$TotExp, data$LifeExp) #adding abline doesnt appear to make sense here

#run simple linear regression
model1 <- lm(LifeExp ~ TotExp, data = data) #LifeExp (dependent/ response) is modeled as a function of TotExp (independent / predictor).
summary(model1)  
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Provide and interpret the F statistics, R^2, standard error,and p-values only.

F-statistic: 65.26 Degrees of freedom (DF): 1, 188 P-value: 7.714e-14 represents overall significance of model; these values above suggest that the linear model is a good fit and the p value is less than .05 meaning its statistically significant

R-squared: 0.2577 Standard - suggesting that 25.77% of the variability in LifeExp is explained by the TotExp model*

The standard error is 7.795^e-14, can be interpreted through the t-value as being 8.079 times smaller than the correlation coefficient of TotalExp, and suggests there is an acceptable level of variability

plot(model1)

The fitted residuals reflect a pattern suggesting it might be non-linear. The QQ residuals do not appear to be normal (-1 to 1 might be close) but there a lot of skews on both right and left. Scale location, is used to test heteroscedasticity, suggests, since it is not a straught line, that there is not constant variance of residual. Residual vs leverage suggests that the outliers impact the regression model.

hist(resid(model1)) #using this to double check that the distibution is not normal 

Question 2:

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?” model 2 is ‘better’ explanations are below.

data2 <- data %>% 
  mutate(LifeExp2 = LifeExp^4.6) %>%  
  mutate(TotExp2 = TotExp^.06) # raising to the power question states

model2 <- lm(LifeExp2~TotExp2,data=data2)
summary(model2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = data2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Provide and interpret the F statistics, R^2, standard error, and p-values. F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16, suggests that linear model is a good fit and the p value is less than .05 meaning its statistically significant. R squared is close ot .73 which is higher than model 1, and suggests that 73% of the variability in LifeExp2 is explained by the TotExp2. Standard Error is 27518940 which is 22.53 smaller than the corresponding coefficient, 620060216, and suggests (since is more thsn 5-10x smaller) might suggest lower variability or an issue overfitting since its so much smaller.

plot(data2$TotExp2, data2$LifeExp2)
abline(model2) #this visually seems like the linear model might be a better fit for second model compare to question1 

plot(model2)

The fitted residuals shows the line mostly on 0 suggest it might be linear relaltionship since there is not pattern . The QQ residuals appear to more normal in comparision and has more points following the line but there a lot of skews on the left. Scale location, shows close straight suggesting it might fit heteroscedasticity/ constant variance of residual. Residual vs leverage still has some outliers but they might impact the regression model less so since the line is close to zero.

Question 3:

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5. Use the coefficients value from model 2, I think, to find the predicted values, (Intercept) -736527910, TotExp2 620060216, both are to the power of .06, and will need to take the inverse of LifeExp^4.6 from question2 to solve for values we are asked for

# Define function
forecast_LifeExp <- function(tot_exp3) {
  result <- (-736527909 + 620060216 * tot_exp3)^(1/4.6)
  return(result)}

# TotExp^0.06 = 1.5
forecast1 <- forecast_LifeExp(1.5)
cat("Forecast when TotExp^.06 = 1.5 is", forecast1, "\n")
## Forecast when TotExp^.06 = 1.5 is 63.31153
# TotExp^0.06 = 2.5
forecast2 <- forecast_LifeExp(2.5)
cat("Forecast when TotExp^.06 = 2.5 is", forecast2, "\n")
## Forecast when TotExp^.06 = 2.5 is 86.50645

Question 4:

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp PropMD:proportion of the population who are MDs, using the original data not to the power; not including b1, b2 and b3 in model…since they are the coefficients of each.

model4 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data=data)
#naming model as question number 
summary(model4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

F Statistics: 34.49 on 3 and 186 DF, p-values: 2.2e-16, suggest that the model is statistically significant since it is less than .05. R^2 - adjusted is 34.71 which is higher than model 1, suggesting that addind these variables allows the model to account for ~35%. Standard error, ratios to their coefficients appear to be within an acceptable range (between 5 to 10) except for PropMD:TotExp (-4)

How good is the model? This model has some improvements, specifically iwth the R^2 being higher, but has the same concerns as the model 1 in terms of linear model possibly not being a good fit since residuals are skewed, the variance is not constant and deviate from the line in the qqplot.

plot(model4)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Question 5: Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Based on a google search the oldest person lived to be 122 years old, which we can consider an outlier but the max on this dataset was 83 years old so this forecast of over 107 years old does not seem realistic unless its referring to a very small percentage of people and not the average.

#define the variables as prompted in question
PropMD5 <- .03
TotExp5 <- 14
forecasted_lifeExp5 <- predict(model4, newdata = data.frame(PropMD = PropMD5, TotExp = TotExp5))
forecasted_lifeExp5
##       1 
## 107.696