library(readr)

data <- read_csv("who.csv")
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptinos of simple linear regression are met.

#scatter
plot(data$TotExp, data$LifeExp, 
     xlab = "Total Expenditure", ylab = "Life Expectancy",
     main = "Life Expectancy vs Total Expenditure")

model <- lm(LifeExp ~ TotExp, data = data)
summary(model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

F Stat: 65.26
F Stat P Val: 7.714e-14
R^2: .2577


  The F Statistic is highly significant with a p-value of 7.714e-14, suggesting that the model overall fits better than a model with only an intercept, meaning that our model does have some level of predictive power.
  The R^2 value is quite low at .2577, meaning only about 25% of the variability can be explained by the model.
  The standard errors are roughly the same size as the estimates themselves. On average, when the model is wrong this typically means it will be about as large as the prediction itself.
  I do not think the assumptions of linear regression are met. One can tell this by looking at the scatterplot as it has clear logistic properties as it approaches higher values it flattens out.



  Other assumptions to check and how if more is needed:
  1. Linear relationship: Could be checked by the residuals vs fitted
  2. Autocorrelation: When residual errors are correlated with themselves, this means observations are correlated with each other at later points. This violates an assumption of that errors are independent and that observations are independent.
  3. Multicollinearity: When two variables show similar patterns together. This would have been partially seen in the scatterplot, but can also be viewed in a correlation plot.
  4. Heteroskedasticity: This occurs when variance in residuals are non-constant. This can be seen in a residuals vs fitted plot where the points follow a pattern. Some notable ones are the increasing in width “trumpet” pattern, where points fall further apart as predictions grow larger.
  5. Normal Distribution of Error Terms: The QQ Plot can show if the residuals follow a normal distribution by showing their presence against their relative quantiles.



2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”



  So I attempted two different methods to model this. One was transforming the data in the dataframe, and another by using lm()’s I() capabilities. They gave drastically different models, so I will interpret both, and hopefully the professor can explain why they aren’t equivalent.
  First the method using I().
model2 <- lm(I(LifeExp^4.6) ~ I(TotExp^.06), data = data)
summary(model2)
## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16


F Stat: 507.7
F Stat P Val: 2.2e-16
R^2: .72

  Standard Errors are about half as large as the estimate, which is better than the first model suggesting that the error could be as large as the prediction to begin with. Both the p-values show significance in that if there were no relationship between the variables and ability to predict, the outcome would not reside in such far tails of the null hypothesis. If I’ve done nothing wrong here, this model is certainly an improvement over the old model, especially since the linear assumption could be re-introduced after the transformations that account for the logistic-like relationship.


  The DF transformation method:
data2 <- data
data2$TotExp <- data2$TotExp^4.6
data2$LifeExp <- data2$LifeExp^.06

model2.1 <- lm(LifeExp ~ TotExp, data = data2)
summary(model2.1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.038175 -0.006180  0.004431  0.008741  0.017524 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.286e+00  9.848e-04 1305.760   <2e-16 ***
## TotExp      1.761e-28  6.771e-29    2.601     0.01 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01339 on 188 degrees of freedom
## Multiple R-squared:  0.03475,    Adjusted R-squared:  0.02961 
## F-statistic: 6.767 on 1 and 188 DF,  p-value: 0.01002


F Stat: 6.767
F Stat P Val: .01002
R^2: ~.035


  Standard Errors are about as large as the estimate, which matches the first model. Although the p-values show significance, TotExp is barely significant. The F Stat is also barely significant. The R^2 value is about as bad as the first model. This model is about equivalent to the first.
plot(data2$TotExp, data2$LifeExp,
     xlab = "Total Exp", ylab = "Life Exp",
     main = "Life Exp as a fn of Total Exp")

| The plot does not really improve over the first, unless the outliers were to be removed and the plot could hone in on the early interactions in 0.0e+00. The plots are equivalent regardless of which transformation method used.

plot((data$TotExp)^4.6, (data2$LifeExp)^.06,
     xlab = "Total Exp", ylab = "Life Exp",
     main = "Life Exp as a fn of Total Exp")



3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.



  Assuming ‘3’ was a typo, I will use the more promising model from number 2.
summary(model2)
## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
life_exp_1 <- -7536527910 + 620060216*1.5
life_exp_2 <- -7536527910 + 620060216*2.5

print(life_exp_1)
## [1] -6606437586
print(life_exp_2)
## [1] -5986377370
  My question for the professor is how can these be interpreted? Are we able to un-transform them without any issue?



4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3xPropMDxTotExp

model4 <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
summary(model4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16


F Stat: 34.49
F Stat P Val: <2.2e-16
R^2: .3574


  The F Stat shows signifiant difference from the null hypothesis. The P Values for estimates on coefficients also show significance. The standard errors are bigger than the estimates themselves which is unfortunate and implies a lack of precision in the model. The R^2 value is relatively better than earlier models without transformation, but is still poor. This does not seem to be a promising model.


5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why

or why not?

coefficients <- coef(model4)
int <- coefficients[1]
propCo <- coefficients[2]
totCo <- coefficients[3]

PropMDval <- .03
TotExpval <- 14


forecast_1 <- int + propCo*PropMDval + totCo*TotExpval
print(forecast_1)
## (Intercept) 
##    107.6985
  This does not seem like a realistic life expectancy simply based on the fact that in the US only .027% of people live to be 100. Maybe the dataset is biased, lets check.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data %>% select(TotExp,PropMD,LifeExp) %>% filter(LifeExp > 100)
## # A tibble: 0 × 3
## # ℹ 3 variables: TotExp <dbl>, PropMD <dbl>, LifeExp <dbl>


  There are no observations where LifeExp was greater than 100, thus the forecast is certainly untrustworthy.