The attached who.csv dataset contains real-world data from 2008. The variables included are:
library(readr)
who.df <- read_csv('https://raw.githubusercontent.com/amberferger/DATA605_Homework/master/AFerger_Assignment12_Data.csv')
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_double(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_double(),
## GovtExp = col_double(),
## TotExp = col_double()
## )
Provide a scatterplot of \(LifeExp~TotExp\) and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.
plot(who.df$TotExp, who.df$LifeExp, main="Total Expenditure vs Life Expectancy", xlab="Total Expenditure", ylab="Life Expectancy")
who.lm <- lm(LifeExp~TotExp, dat = who.df)
summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Are assumptions of linear regression met? No – we can see off the bat that the data does not seem to follow a linear trend (it looks logarithmic). Additionally, the variable that we’ve used only accounts for \(25.77\%\) of the variance, so other factors must be at play.
Raise life expectancy to the 4.6 power (ie. \(LifeExp^{4.6}\)). Raise total expenditure to the 0.06 power (nearly a log transform, \(TotExp^{0.06}\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{0.06}\) and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better”?
who.df$LifeExp46 <- who.df$LifeExp^(4.6)
who.df$TotExp06 <- who.df$TotExp^(0.06)
plot(who.df$TotExp06, who.df$LifeExp46, main="Total Expenditure vs Life Expectancy - Transformed", xlab="Total Expenditure", ylab="Life Expectancy")
who2.lm <- lm(LifeExp46~TotExp06, data = who.df)
summary(who2.lm)
##
## Call:
## lm(formula = LifeExp46 ~ TotExp06, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Which model is better? This model is much better because:
Using the results from 2, forecast life expectancy when \(TotExp^{0.06} = 1.5\). Then forecast life expectancy when \(TotExp^{0.06}= 2.5\).
We know that our model predicts \(LifeExp^{4.6}\), so in order to translate this to just \(LifeExp\), we will need to do the following: \(predictedValue^{\frac{5}{23}}\):
predTotExp1 <- data.frame(TotExp06 = 1.5)
predTotExp2 <- data.frame(TotExp06 = 2.5)
pred1 <- predict(who2.lm, newdata=predTotExp1)
pred2 <- predict(who2.lm, newdata=predTotExp2)
paste0('For total expenditure = 1.5, the life expectancy is: ', pred1^(5/23))
## [1] "For total expenditure = 1.5, the life expectancy is: 63.3115334469743"
paste0('For total expenditure = 2.5, the life expectancy is: ', pred2^(5/23))
## [1] "For total expenditure = 2.5, the life expectancy is: 86.5064484844719"
Build the following multiple refression model and interpret the F statistic, \(R^2\), standard error, and p-values. How good is the model?
\[LifeExp = b_0 + b_1PropMD + b_2TotExp + b_3(PropMD*TotExp) \]
who.df$interaxn_term <- who.df$PropMD * who.df$TotExp
who3.lm <- lm(LifeExp ~ PropMD + TotExp + interaxn_term, data = who.df)
summary(who3.lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + interaxn_term, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## interaxn_term -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
How good is the model? This model is not great at fitting the data because the \(R^2\) value is low and the standard error is not around 1.5 times the first and third quartiles. Our previous model fit the data better.
Forecast \(LifeExp\) when \(PropMD=0.3\) and \(TotExp = 14\). Does this forecast seem realistic? Why or why not?
predPropMD <- 0.3
predTotExp3 <- 14
predInteraxn <- predPropMD * predTotExp3
predData <- data.frame(PropMD = predPropMD,
TotExp = predTotExp3,
interaxn_term = predInteraxn)
pred3 <- predict(who3.lm, newdata=predData)
pred3
## 1
## 511.9966
Does this forecast seem realistic? Why or why not? No, this forecast does not seem realistic because an individual is not going to live to be 512 years old.