1, Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

who <- read.csv("https://raw.githubusercontent.com/Zchen116/assignments/master/who.csv",header=TRUE, sep=",")
head(who)
##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046
summary(who)
##                 Country       LifeExp      InfantSurvival   Under5Survival  
##  Afghanistan        :  1   Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Algeria            :  1   Median :70.00   Median :0.9785   Median :0.9745  
##  Andorra            :  1   Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##  Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##  Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##  (Other)            :184                                                    
##      TBFree           PropMD              PropRN             PersExp       
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883   Min.   :   3.00  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455   1st Qu.:  36.25  
##  Median :0.9992   Median :0.0010474   Median :0.0027584   Median : 199.50  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336   Mean   : 742.00  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164   3rd Qu.: 515.25  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387   Max.   :6350.00  
##                                                                            
##     GovtExp             TotExp      
##  Min.   :    10.0   Min.   :    13  
##  1st Qu.:   559.5   1st Qu.:   584  
##  Median :  5385.0   Median :  5541  
##  Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :476420.0   Max.   :482750  
## 
life <- lm(LifeExp ~ TotExp, who)
summary(life)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
plot(LifeExp ~ TotExp, who, xlab="Total Expenditures", ylab="Life Expectancy", main="LifeExp vs. TotExp")
abline(life)

plot(life$fitted.values, life$residuals, xlab="Fitted Values", ylab="Residuals",
     main="Fitted Values vs.Residuals")

qqnorm(life$residuals)
qqline(life$residuals)

R^2: The regression output indicates that 0.2577 of variation in life expectancy is explained by the total expenditure.

Standard Error: The error is approximately 8x smaller then the corresponding coefficient.

P-value: 7.714e-14, which is very small in this model which indicates that the total expenditure is a significant variable and that it is likely to impact life expectancy. We reject the null hypothesis.

F-Statistic: 65.26, which is large, usually indicating a stronger relationship between the independent and dependent variables.

The residual plot shows there is no constant variability and that the residuals are not normally distributed.

2, Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExp_46 <- who$LifeExp^4.6
TotExp_06 <- who$TotExp^0.06
life2 <- lm(LifeExp_46 ~ TotExp_06, who)
summary(life2)
## 
## Call:
## lm(formula = LifeExp_46 ~ TotExp_06, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_06    620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(LifeExp_46 ~ TotExp_06, who, xlab="Total Expenditures", ylab="Life Expectancy", main="LifeExp_46 vs. TotExp_06")

abline(life2)

plot(life2$fitted.values, life2$residuals, xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs.Residuals")

qqnorm(life2$residuals)
qqline(life2$residuals)

R^2: The regression output indicates that 0.7298 of variation in life expectancy is explained by the total expenditure. This is better than the first model.

Standard Error: The error is approximately 22x smaller then the corresponding coefficient.

P-value: it’s lower than 2.2e-16 and is very small in this model, which indicates that the total expenditure is a significant variable and that it is likely to impact life expectancy.

F-Statistic: it’s 507.7 and is large, which usually indicating a stronger relationship between the independent and dependent variables.

The residual plot shows that the variability is more constant (compared to the previous model) and it looks that the residuals are nearly normal.

3, Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

result_3 <- data.frame(TotExp_06=c(1.5,2.5))
predict(life2,result_3,interval="predict")^(1/4.6)
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

4, Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

result_4 <- lm(LifeExp ~ PropMD + TotExp + TotExp:PropMD, who)
summary(result_4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + TotExp:PropMD, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
plot(result_4$fitted.values, result_4$residuals, xlab="Fitted Values", ylab="Residuals",
 main="Fitted Values vs.Residuals")

qqnorm(result_4$residuals)
qqline(result_4$residuals)

R^2: The regression output indicates that 0.3574 of variation in life expectancy is explained by the total expenditure. This is better than the first model.

Standard Error: The error is approximately 22x smaller then the corresponding coefficient.

P-value: it’s lower than 2.2e-16 and is very small in this model, which indicates that the total expenditure is a significant variable and that it is likely to impact life expectancy.

F-Statistic: it’s 34.49 which indicates a weak relationship between the independent and dependent variables.

The residuals plot shows that the variability is not constant and the residuals are not normally distributed.

5, Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

from the result of result 4, we can know:

Life Expentancy MR = 62.8 + .000072 Total Expenditure + 1,497 PropMD + .006 * Total Expenditrure * PropMD

result_5 = 62.8 + .000072 * 14 + 1497 * 0.03 + .006 * 14 * 0.03
result_5
## [1] 107.7135

Life expectancy is predicted to be 107.70 which is unrealistic.