raw_data <- read_csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Fall18\\605\\week12\\who.csv')
Provide a scatterplot of LifeExp ~ TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the \(F\) statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
plot(raw_data$TotExp, raw_data$LifeExp)
lm.1 <- lm(LifeExp ~ TotExp, data=raw_data)
summary(lm.1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
A simple linear regression was performed comparing LifeExp to TotExp. The p-value, F statistic both suggest that there is a statistically significant relationship between the two variables. The \(R^2\) value indicates that the strength of this relationship is moderate, but still significant given that we are working with real world data. The standard error for the TotExp predictor is very small, indicating that the distribution of values for TotExp are clustered very close together. The estimate indicates that they are clustered close to 0.
All of this is for naught though as the preconditions for regression are not met. The scatter plot demonstrates fairly convincingly that there is not a linear relationship between the predictor and the response. This can be verified by examining the residuals of the regression. Thus the conclusions drawn above should never be used to make any predictions. The equation will need to be addressed before any conclusions can be made.
plot(lm.1)
Raise life expectancy to the \(4.6\) power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the \(0.06\) power (nearly a log transform, \(TotExp^{.06}\)). Plot \(LifeExp^4.6\) as a function of \(TotExp^.06\), and re-run the simple regression model using the transformed variables. Provide and interpret the \(F\) statistics, \(R^2\), standard error, and p-values. Which model is “better?”
transformed_data <- raw_data %>%
mutate(TotExp = TotExp ** 0.06,
LifeExp = LifeExp ** 4.6)
plot(transformed_data$TotExp, transformed_data$LifeExp)
lm.2 <- lm(LifeExp ~ TotExp, data=transformed_data)
summary(lm.2)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = transformed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The results this time around are much better. The p-value and F-statistic still indicate a strong statisitically significant relationship. However, this time the \(R^2\) is significantly higher, indicating that the TotExp variable has a lot of predictive ability on LifeExp and the Std.Error indicates a range of values that does not bring us close to 0. All of these values indicate that there is a strong positive relationship between the two variables. Furthermore, examining the scatter plot and residual plots indicate that the relationship between the two variables is roughly inear.
As a result, the “better” model is the second, transformed model. In fact, it is the only valid model of the two and thus must win by default. With that being said, the evidence supports the fact that this model has strong predictive ability.
plot(lm.2)
Using the result, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
test_data <- data.frame(TotExp=c(1.5, 2.5))
predict(lm.2, newdata=test_data)
## 1 2
## 193562414 813622630
Build the following multiple regression model and interpret the \(F\) Statistics, \(R^2\), standard error, and p-values. How good is the model?
\(LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\)
lm.3 <- lm(LifeExp ~ TotExp + PropMD + I(TotExp * PropMD), data=raw_data)
summary(lm.3)
##
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + I(TotExp * PropMD),
## data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## I(TotExp * PropMD) -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The p-value and F-statistic indicate statistically significant values while the \(R^2\) is quite worse than the previous transformed regression. The residual standard error is similar to the first regression, and further examine would be required to determine whether there is a significant difference. If not, this would indicate that the additional predictors has little predictive ability.
Most concerning however is the below plots. There appears to be clear pattern to the residuals and the qqplot does not appear to adhear to the line. This indicates that the model is inappropriate for linear regression and that it’s results should not be used. Perhaps a transformation, as seen in the second regression, could help alleviate these issues.
plot(lm.3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
test.data <- tibble(PropMD = c(0.03), TotExp = c(14))
predict(lm.3, newdata=test.data)
## 1
## 107.696
This results are concerning and should be taken with a grain of salt. Ignorning the issues with the regression model momentarily, the inputs may not produce the best predictions. Regressions have difficulty when inputs are outside the range of the original predictors. For example, if a predictor has a range of 0 to 100, then testing new data with an input of 150 will result in unreliable output. Now technically neither of the test data inputs are outside the range of the original predictors, however, they are at the extreme upper and lower ends of those ranges. In addition, they are only kept in the range due to the fact that there are a number of outliers amongst the values. This is supported by the above plot that shows a high leverage outlier, a point that should be addressed in the original regression.
Taking all of this together indicates that one should be incredibly careful when interpreting these results and that I personally would not feel comfortable as to the accuracy of this prediction.