Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the
For a linear regression model, there are four prerequisites:
From the scatter plot, we can see there is not a linear relationship between Life Expectancy and Total Expenditures (violating requirement 1)
plot(LifeExp ~ TotExp, data=countries)
Further, the residue is not constantly distributed for all values of Total Expenditure (violating requirement 2)
model_1 <- lm(LifeExp ~ TotExp, data=countries)
plot(resid(model_1) ~ countries$TotExp )
abline(0,0)
Heteroscedasticity (violating 2) ) can also be confirmed with the QQ Plot.
qqnorm(resid(model_1))
qqline(resid(model_1))
Therefore, we can assume the linear regression is not appropriate for this dataset. But we can go ahead and perform the linear regression just an exercise:
summary(model_1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Observations:
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Raising the two columns with the mutate function, we get:
countries <- countries %>% mutate (LifeExpRaised = LifeExp^4.6)
countries <- countries %>% mutate (TotExpRaised = TotExp^0.06)
plot(LifeExpRaised ~ TotExpRaised, data=countries)
model_2 <- lm(LifeExpRaised ~ TotExpRaised, data=countries)
plot(resid(model_2) ~ countries$TotExpRaised )
abline(0,0)
We see the scatterplot and residue plots show a better distribution and linear relationships between our ‘raised’ variables.
The model itself shows an improvement:
summary(model_2)
##
## Call:
## lm(formula = LifeExpRaised ~ TotExpRaised, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpRaised 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Regarding this model:
With the plots and R^2 stat, we can conclude this is a ‘better’ model overall from the the simple linear regression model.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
(-736527910 + 620060216 * 1.5)^(1/4.6)
## [1] 63.31153
(-736527910 + 620060216 * 2.5)^(1/4.6)
## [1] 86.50645
We get estimates of 63.3 and 86.5 which appear to be reasonable.
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
We can see there are two quantitative terms and an interaction term combining them together.
multiple_model <- lm(LifeExp ~ PropMD +
TotExp +
PropMD*TotExp,
data=countries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
With the p-value and F-statistic, we see this model is an improvement over a model with no variables. However, the R^2 stat explains 34.5% of the dataset while the ‘raised’ power model explains 73% of the dataset.
This model doesn’t appear to perform as well as the last model.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
PropMD <- 0.03
TotExp <- 14
multiple_model$coefficients[1] +
multiple_model$coefficients[2] * PropMD +
multiple_model$coefficients[3] * TotExp +
multiple_model$coefficients[4] * PropMD * TotExp
## (Intercept)
## 107.696
We get an estimate of 107.7 which seems on the high side but understandable since the standard error is 8.77 and the R^2 statistic is 35%. There’s a lot of variance to this model so it might overestimate the life expectancy.