and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
First we can load it into a data frame:
who <- "who.csv" %>% read.csv(stringsAsFactors = FALSE) %>% data.frame
head(who) # look at the top of the df
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
str(who) #look at the data stored in the df
## 'data.frame': 190 obs. of 10 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ LifeExp : int 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num 0.998 1 0.999 1 0.997 ...
## $ PropMD : num 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
## $ PropRN : num 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : int 20 169 108 2589 36 503 484 88 3181 3788 ...
## $ GovtExp : int 92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
## $ TotExp : int 112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...
pairs(who[,-1], gap = 0.5, col = "orangered") # [,-1] to remove the country name colunm
We can now run the linear regression:
fit1 <- lm(LifeExp ~ TotExp, data = who)
summary(fit1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
plot(who$TotExp, who$LifeExp, xlab = "Total Expenditures ($)" ,ylab = "Life Expectancy (yrs)", col = "steelblue")
abline(fit1, col="yellow3")
hist(resid(fit1), main = "Histogram of Residuals", xlab = "residuals")
plot(fitted(fit1), resid(fit1))
The p-value suggests a statistically significant correlation between total expenditures and life expectancy, since \(p<<0.05\). The R\(^2\) of 0.2577 means that about 25.77% of the variability of life expectancy about the mean is explained by the model. This is a moderately weak correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept. The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between \(\pm 9.371\) years. These statistics suggest we have a useful model.
The linear model, when plotted over the data, does not match the data very closely. Furthermore, the residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.
(i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
le_4.6 <- who$LifeExp^4.6
te_0.06 <- who$TotExp^0.06
fit2 <- lm(le_4.6 ~ te_0.06)
summary(fit2)
##
## Call:
## lm(formula = le_4.6 ~ te_0.06)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## te_0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(who$TotExp^0.06, who$LifeExp^4.6, xlab = "Total Expenditures^0.06 ($^0.06)" ,ylab = "Life Expectancy^4.6 (yrs^0.06)", col = "steelblue")
abline(fit2, col="yellow3")
hist(resid(fit2), main = "Histogram of Residuals", xlab = "residuals")
plot(fitted(fit2), resid(fit2))
The p-value suggests a statistically significant correlation between total expenditures^0.06 and life expectancy^4.6, since \(p<<0.05\). The R\(^2\) of 0.7298 means that about 72.98% of the variability of life expectancy about the mean is explained by the model. This is a moderately strong correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept 507.7 is much larger than 65.26, so it is a better fit than before. Note that The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between \(\pm 90490000\) years^4.6. These statistics suggest we have a useful model.
The linear model, when plotted over the data, matches the data more closely. Furthermore, the residual analysis shows that the residuals are normally distributed and show constant variance; there is no noticeable trend. Therefore, the linear model is valid in this case.
This model is better than in part 1.
forecast life expectancy when TotExp^.06 =1.5.
Then forecast life expectancy when TotExp^.06=2.5.
\[ y = -736527910 + 620060216x \\ y = -736527910 + 620060216(1.5) \\ y = 193562414 \\ le = y^{1/4.6} \\ le = 193562414^{1/4.6} \\ le = 63.31153 \space years \]
Life expectancy is about 63.3 years when tot_exp^0.06 = 1.5.
\[ y = -736527910 + 620060216x \\ y = -736527910 + 620060216(2.5) \\ y = 813622630 \\ le = y^{1/4.6} \\ le = 813622630^{1/4.6} \\ le = 86.50645 \space years \]
Life expectancy is about 86.5 years when tot_exp^0.06 = 2.5.
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
fit3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
summary(fit3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
hist(resid(fit3), main = "Histogram of Residuals", xlab = "residuals")
plot(fitted(fit3), resid(fit3))
The p-value is < 0.05 so the model is statistically significant. The F-statistic of 34.49 tells us that adding the 3 variables performs better than just the intercept, but barely as the F-statistic penalizes you for adding variables. R\(^2\) of 0.3574 means that 35.74% of the variability about the mean of life expectancy is explained by these 3 variables. This is a moderately weak correlation. Residual standard error of 8.765 means that if the residuals are normally distributed, 64% will be \(\pm\) 8.765 years.
The residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.
when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
\[ LifeExp = 6.277*10^{1} + 1.497*10^{3}*PropMD +7.233*10^{-5}TotExp -6.026*10^{-3}*PropMD*TotExp \\ LifeExp = 6.277*10^{1} + 1.497*10^{3}*0.03 +7.233*10^{-5}*14 -6.026*10^{-3}*0.03*14 \\ LifeExp = 107.6808 \]
107.7 years seems very unrealistic first, that age for a human being is an outlier to the point where someone making it to that age will be featured in National or even International news. Life expectancy is the average time a person can expect to live, I don’t see approximately 50% of a population making it to 107 with modern technology. Furthermore, 3% of the population being MDs seems unreasonably high; that would be about 9.5 million doctors in the US. Total Expenditures of $14 seems unreasonably low as that includes both personal and government expenditures.