- Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
df = read.csv("https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data605/who.csv")
lm_exp <- lm(LifeExp~TotExp, data=df)
plot(LifeExp~TotExp, data=df)
abline(lm_exp)

summary(lm_exp)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
#P-value for F-test and TotExp is less than 0.05 but Adjusted R-squared is way too low, 0.2537.
#Standard error, 9.371, is higher than what we want it to be.
#Since p-value is less than 0.05 for both F-statistics and TotExp, we reject the null hypothesis; it is statistically significant.
#Diagnostic plot
plot(lm_exp)




#From the diagnostic plot, we can say that many of residuals are not centered around mean = 0.
#Normal QQ values are not fitting the theoretical line fairly well.
#Residual vs fitted graph tells us constant variance condition also fails.
#Both QQ plot and residual vs. fitted value graph tell us that this model is not normally distributed.
#Residual analysis - Histogram and Summary
hist(resid(lm_exp))

summary(resid(lm_exp))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -24.764 -4.778 3.154 0.000 7.116 13.292
#Since mean < median, we can say that the model is negatively skewed.
#Overall, the assumptions of regression are not met since the model is not normally distributed.
- Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
LifeExp_power <- (df$LifeExp)^4.6
TotExp_power <- (df$TotExp)^0.06
df['LifeExp_power'] <- LifeExp_power
df['TotExp_power'] <- TotExp_power
lm_exp_power <- lm(LifeExp_power~TotExp_power, data=df)
plot(LifeExp_power~TotExp_power, data=df)
abline(lm_exp_power)

summary(lm_exp_power)
##
## Call:
## lm(formula = LifeExp_power ~ TotExp_power, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_power 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(lm_exp_power)




#Residual analysis - Histogram and Summary
hist(resid(lm_exp_power))

summary(resid(lm_exp_power))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -308616089 -53978977 13697187 0 59139231 211951764
#QQ plot line and residual vs fitted graph suggests that at least model 2 is closer to normal distribution than model 1.
#However, Since mean < median, we can say that the model is still negatively skewed.
#F-statistics is 507.7 and adjusted R^2 is 0.7283
#P-values both for F-statistics and TotExp_power is less than 0.05.
#Residual standard error is 90490000 but since variables are rescaled, we cannot really say residual standard error for model 2 is higher than model 1.
#Relative to the size of TotExp_power coefficient, residual standard error rather decreased in model 2.
#Overall, model 2 is much better than model 1.
model1_stderr_relative = 9.371 / 6.297e-05
model2_stderr_relative = 90490000 / 620060216
- Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
case1 = predict(lm_exp_power, data.frame(TotExp_power=1.5))
case2 = predict(lm_exp_power, data.frame(TotExp_power=2.5))
Life_exp_converted1 = case1^(1/4.6)
Life_exp_converted2 = case2^(1/4.6)
#After converting into original scale, we know that it is 63.31153 for the first case and 86.50645 for the second case.
- Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? #LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
lm_exp_multi = lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data=df)
summary(lm_exp_multi)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
plot(lm_exp_multi)



## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

hist(resid(lm_exp_multi))

summary(resid(lm_exp_multi))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27.320 -4.132 2.098 0.000 6.540 13.074
#P-values for each variable and F-statistics are all below 0.05 so the results are statistically significant.
#We reject null hypothesis.
#Adjusted R^2 is around 0.3471 and it is substantially higher than model 1 but less than model 2.
#Residual standard error is smaller than model 1 but not model 2, with respect to residual standard error divided by variable coefficients.
#The model is not normally distributed according to QQ plot and residual vs fitted graph test, similar to model 1.
#Since mean < median, we can say that the model is still negatively skewed.
- Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
case3 = predict(lm_exp_multi, data.frame(PropMD = 0.03, TotExp=14))
#It is not very realistic since LifeExp is > 100.