Data Description

The attached who.csv dataset contains real-world data from 2008. The variables included are:

library(readr)

who.df <- read_csv('https://raw.githubusercontent.com/amberferger/DATA605_Homework/master/AFerger_Assignment12_Data.csv')
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   LifeExp = col_double(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_double(),
##   GovtExp = col_double(),
##   TotExp = col_double()
## )

Problem 1

Provide a scatterplot of \(LifeExp~TotExp\) and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.

plot(who.df$TotExp, who.df$LifeExp, main="Total Expenditure vs Life Expectancy", xlab="Total Expenditure", ylab="Life Expectancy")

who.lm <- lm(LifeExp~TotExp, dat = who.df)
summary(who.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Are assumptions of linear regression met? No – we can see off the bat that the data does not seem to follow a linear trend (it looks logarithmic). Additionally, the variable that we’ve used only accounts for \(25.77\%\) of the variance, so other factors must be at play.

Problem 2

Raise life expectancy to the 4.6 power (ie. \(LifeExp^{4.6}\)). Raise total expenditure to the 0.06 power (nearly a log transform, \(TotExp^{0.06}\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{0.06}\) and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better”?

who.df$LifeExp46 <- who.df$LifeExp^(4.6)
who.df$TotExp06 <- who.df$TotExp^(0.06)

plot(who.df$TotExp06, who.df$LifeExp46, main="Total Expenditure vs Life Expectancy - Transformed", xlab="Total Expenditure", ylab="Life Expectancy")

who2.lm <- lm(LifeExp46~TotExp06, data = who.df)
summary(who2.lm)
## 
## Call:
## lm(formula = LifeExp46 ~ TotExp06, data = who.df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp06     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Which model is better? This model is much better because:

  1. The plot looks more linear.
  2. The \(R^2\) value has increased drastically.
  3. The standard error more closely matches 1.5 times the first and third quartiles.

Problem 3

Using the results from 2, forecast life expectancy when \(TotExp^{0.06} = 1.5\). Then forecast life expectancy when \(TotExp^{0.06}= 2.5\).
We know that our model predicts \(LifeExp^{4.6}\), so in order to translate this to just \(LifeExp\), we will need to do the following: \(predictedValue^{\frac{5}{23}}\):

predTotExp1 <- data.frame(TotExp06 = 1.5)
predTotExp2 <- data.frame(TotExp06 = 2.5)

pred1 <- predict(who2.lm, newdata=predTotExp1)
pred2 <- predict(who2.lm, newdata=predTotExp2)

paste0('For total expenditure = 1.5, the life expectancy is: ', pred1^(5/23))
## [1] "For total expenditure = 1.5, the life expectancy is: 63.3115334469743"
paste0('For total expenditure = 2.5, the life expectancy is: ', pred2^(5/23))
## [1] "For total expenditure = 2.5, the life expectancy is: 86.5064484844719"

Problem 4

Build the following multiple refression model and interpret the F statistic, \(R^2\), standard error, and p-values. How good is the model?
\[LifeExp = b_0 + b_1PropMD + b_2TotExp + b_3(PropMD*TotExp) \]

who.df$interaxn_term <- who.df$PropMD * who.df$TotExp

who3.lm <- lm(LifeExp ~ PropMD + TotExp + interaxn_term, data = who.df)
summary(who3.lm)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + interaxn_term, data = who.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## interaxn_term -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

How good is the model? This model is not great at fitting the data because the \(R^2\) value is low and the standard error is not around 1.5 times the first and third quartiles. Our previous model fit the data better.

Problem 5

Forecast \(LifeExp\) when \(PropMD=0.3\) and \(TotExp = 14\). Does this forecast seem realistic? Why or why not?

predPropMD <- 0.3
predTotExp3 <- 14
predInteraxn <- predPropMD * predTotExp3

predData <- data.frame(PropMD = predPropMD,
                       TotExp = predTotExp3,
                       interaxn_term = predInteraxn)


pred3 <- predict(who3.lm, newdata=predData)

pred3
##        1 
## 511.9966

Does this forecast seem realistic? Why or why not? No, this forecast does not seem realistic because an individual is not going to live to be 512 years old.