data <- read.csv("who.csv")
head(data)
attach(data)
plot(TotExp, LifeExp)
data.lm <- lm(LifeExp~TotExp)
summary(data.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.76 -4.78 3.15 7.12 13.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.75337453 0.75353661 85.93 < 0.0000000000000002 ***
## TotExp 0.00006297 0.00000779 8.08 0.000000000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.4 on 188 degrees of freedom
## Multiple R-squared: 0.258, Adjusted R-squared: 0.254
## F-statistic: 65.3 on 1 and 188 DF, p-value: 0.0000000000000771
The F statistic is not a particularly useful measure since the F-test compartes the current model to a model with one fewer predictor and our model already only has one predictor.
The \(R^2\) value of 0.258 means that this model explains about 25% of the variability in life expectancy, which is not too bad for a single predictor.
According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for TotExp, 0.00000779, is 8.07862601 times smaller than the coefficient, 0.00006297. So this also indicates a good fit. The standard error for the intercept, 0.75, is 85.93 times smaller than the coefficient, 64.75.
The p-values for the coefficients for both TotExp and intercept as so small that they are essentially zero, indicating that it is highly likely that both the speed and this specific intercept value are relevant to the model.
Although the model has some success at predicting life expectancy, based on the scatter plot alone it is clear that the relationship is not linear. There does appear however to be possibly an exponential relationship.
data$LifeExp <- LifeExp^4.6
data$TotExp <- TotExp^.06
plot(TotExp, LifeExp)
data.lm <- lm(LifeExp~TotExp)
summary(data.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.76 -4.78 3.15 7.12 13.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.75337453 0.75353661 85.93 < 0.0000000000000002 ***
## TotExp 0.00006297 0.00000779 8.08 0.000000000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.4 on 188 degrees of freedom
## Multiple R-squared: 0.258, Adjusted R-squared: 0.254
## F-statistic: 65.3 on 1 and 188 DF, p-value: 0.0000000000000771
Again, the F statistic is not a particularly useful measure for a model with only one predictor.
The \(R^2\) value of 0.73 means that this model explains about 73% of the variability in life expectancy, which is a much better result than we achieved in our first model and a very strong single predictor.
In this model the standard error for TotExp, 0, is 8.08 times smaller than the coefficient, 0. So this also indicates a very good fit. The standard error for the intercept, 0.75, is 85.93 times smaller than the coefficient, 64.75.
The p-values for the coefficients for both TotExp and intercept as so small that they are essentially zero, indicating that it is highly likely that both the speed and this specific intercept value are relevant to the model.
The scatter plot in this case shows a clear linear relationship that is also evident in the linear model. This is clearly a much better model than our first attempt. It explains 3 times as much of the variability as our first model did.
Given a linear model of:
\[ \text{LifeExp}^{4.6} = 64.75 + 0 \times \text{TotExp}^{0.06} \]
# forecast life expectancy when TotExp^.06 =1.5
le_1.5 <- (data.lm$coefficients[1] + data.lm$coefficients[2] * 1.5)^(1/4.6)
Life expectancy when \(\text{TotExp}^{0.06} = 1.5\) is \(2.48\).
# forecast life expectancy when TotExp^.06=2.5
le_2.5 <- (data.lm$coefficients[1] + data.lm$coefficients[2] * 2.5)^(1/4.6)
Life expectancy when \(\text{TotExp}^{0.06} = 2.5\) is \(2.48\).
LifeExp = b0 + b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp
#Reload the data to reset to original values
detach(data)
data2 <- read.csv("who.csv")
attach(data2)
mult.lm <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
summary(mult.lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.32 -4.13 2.10 6.54 13.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.77270326 0.79560524 78.90 < 0.0000000000000002 ***
## PropMD 1497.49395252 278.81687965 5.37 0.000000232060277 ***
## TotExp 0.00007233 0.00000898 8.05 0.000000000000094 ***
## PropMD:TotExp -0.00602569 0.00147236 -4.09 0.000063527329494 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.8 on 186 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.347
## F-statistic: 34.5 on 3 and 186 DF, p-value: <0.0000000000000002
hist(mult.lm$residuals)
mean(mult.lm$residuals)
## [1] -0.00000000000000079
The residuals are not normally distributed although there is a strong left skew.
plot(fitted(mult.lm), residuals(mult.lm), xlab="Fitted", ylab="Residuals")
abline(h=0)
There is a strong apparent pattern to the plotted residuals indicating that the linear model is not a good fit.
qqnorm(resid(mult.lm))
qqline(resid(mult.lm))
We can see that there is a strong curve to the sample quantiles vs theoretical quantiles.
This model is not a good fit. It has an \(R^2\) value of 0.35744 which is only about half as good as the transformedx model in Problem 2.
Given a linear model of:
\[ \text{LifeExp} = 62.77 + 1497.49 \times \text{PropMD} + \\ 0 \times \text{TotExp} + -0.01 \times (\text{PropMD} \times \text{TotExp}) \]
# forecast life expectancy when PropMD = 0.03 and TotExp = 14
lexp <- mult.lm$coefficients[1] + mult.lm$coefficients[2] * 0.03 + mult.lm$coefficients[3] * 14 + mult.lm$coefficients[4] * 0.03 * 14
Life expectancy when \(\text{PropMD} = 0.03\) \(\text{TotExp} = 14\) is \(107.7\).