Linear Regression Part II
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
data = read.csv(file = "C:\\Users\\arono\\source\\R\\Data605\\who.csv", header = TRUE)
who.lm <- lm(data$LifeExp~data$TotExp, data = data)
b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]
# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(data$LifeExp~data$TotExp, main=title, xlab="Expendentures", ylab="Expectency")
abline(who.lm) Visually its a poor fit because the response variable is not homoscedastic. The total expenditures increase greatly at the end but if we were to remove them, the abline would look much more reasonable which tells me the visual interpretation can be deceiving.
sum_who.lm<-summary(who.lm)
sum_who.lm##
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## data$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
(Intercept) represents the y intercept and data$TotExp represents the slope.
Each Estimate represents the coefficient, so the equation \(\hat{y}=0.00006297x +64.75\) is applied to all values of x.
Each Std. Error represents the deviation caused by each component in the equation. Im not clear on the actual calculation but its worth noting that excels data anyalysis linear regression produces the same results.
Each T value is simply the estimate divided by the standard error, i.e. a high T-Stat indicates that the actual values do not deviate from the predicted values very much.
Each Pr(>|t|) maps the T-Stat to the Studen Test Distribution, which may be described as the normal distribution except it accounts for N or the number of observations ( called Degrees of Freedom ).
The coeffcient is 0.0000629701 and the intercept is 64.7 which give you an idea that the range of the response variable is considerably higher than that of the explanatory variable.
The Residual standard error is a more direct measurement of the predicted vs actual values.
\[RSE \ = \ \sqrt{\sum{\frac{y-\bar{y}}{df}}}\]
sqrt(sum((data$LifeExp-who.lm$fitted.values)^2)/who.lm$df.residual)## [1] 9.371033
Note 190 observastions minus 1 explanatory variable minus 1 equals 188 degrees of freedom.
R Squared represents the “variability of y explained by x”. Its the ration of the predicted vs the mean and the predicted vs the actual
The caculuation is
\[R^2 \ = \frac{SSExplained}{SSTotal} \ = \ \frac{\sum_{i=1}^{n}{(\hat{y} \ - \ \bar{y} )^2}}{\sum_{i=1}^{n}{(y_i \ - \ \bar{y} )^2}}\]
mu<-mean(data$LifeExp)
ss_explained<-sum((who.lm$fitted-mu)^2) # predicted vs mean
ss_unexplained<-sum((data$LifeExp-who.lm$fitted)^2) # actual vs predicted
ss_explained/(ss_explained+ss_unexplained) # R squared## [1] 0.2576922
The adjusted R2 is R2 qualified by the number of observations relative to the number of explanatory variables.
\[Adj R^2 \ = \ R^2 * \frac{n-1}{n-k}\]
where
n = number of observations
k = number of explanatory variables
sum_who.lm$adj.r.squared## [1] 0.2537437
n=length(who.lm$residuals)
k=length(sum_who.lm$coefficients[1,])-1 #Subtract one to ignore intercept
SSE=sum(sum_who.lm$residuals^2)
SSreg=sum((data$LifeExp-mean(data$LifeExp))^2)
1-((SSE/SSreg)*(n-1)/(n-(k-1)))## [1] 0.2537437
The F Statistic uses the F distribution which considers the number of explanatory variables and as such it is generally suggested for hypothesis testing around the effect of different coefficients from multiple regression.
The calculation is
\[F \ = \ \frac{ \frac{ \sum{(\bar{y}-\mu)^2}}{df1} }{ \frac{ \sum{ (\bar{y}-y_i)^2}}{df2} }\]
mu<-mean(data$LifeExp)
# numerator
ss_explained<-sum((who.lm$fitted-mu)^2) # predicted vs mean
ssreg<-ss_explained/(length(who.lm$coef)-1) # in our case there is only 1 explanatory variable
ss_unexplained<-sum((data$LifeExp-who.lm$fitted)^2) # actual vs predicted
sse<-ss_unexplained/who.lm$df.residual
ssreg/sse## [1] 65.2642
To respect your model, you want your results to show
linearity
nearly normal residuals
constant variability
We already looked at linearity and R squared, and saw it was ok but not great.
These 2 plots show the residuals are just somewhat normal and the variabity of the residuals are not homoscedastic.
hist(sum_who.lm$residuals,breaks=20)Another way to visualize normality is through a qq plot.
qqnorm(resid(sum_who.lm))
qqline(resid(sum_who.lm))Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
LifeExp4.5<-data$LifeExp^4.6
TotExp0.06<-data$TotExp^.06
who.lm <- lm(LifeExp4.5~TotExp0.06)
b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]
fitted_recalc<-(m * data$TotExp) + b
# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(LifeExp4.5~TotExp0.06, main=title, xlab="Expendentures", ylab="Expectency")
abline(who.lm) The visual interpretation looks better in reflecting the relationship. The problem I have is that the numbers behind the life expectency no longer have meaning.
summary(who.lm)##
## Call:
## lm(formula = LifeExp4.5 ~ TotExp0.06)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The T values are higher and the F Statistic is higher making the residuals seems less concerning.
R squared is closer to 1 and supports our visual interpretation.
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]
sprintf("If Total Exp = 1.5, then Life Expectency would be %.2f", m * 1.5 + b)## [1] "If Total Exp = 1.5, then Life Expectency would be 193562413.99"
sprintf("If Total Exp = 1.5, then Life Expectency would be %.2f", m * 2.5 + b)## [1] "If Total Exp = 1.5, then Life Expectency would be 813622629.64"
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
\[LifeExp=\beta_0+\beta_1 * PropMd + \beta_2 * TotExp +\beta_3 * PropMD * TotExp\]
who.mlm <- lm(data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD * data$TotExp) , data = data)
b0<-who.mlm$coefficients[1]
b1<-who.mlm$coefficients[2]
b2<-who.mlm$coefficients[3]
b3<-who.mlm$coefficients[4]
sprintf(" y = %.2fB0 + %.2fB1 + %.2fB2 ",b0,b1,b2)## [1] " y = 62.77B0 + 1497.49B1 + 0.00B2 "
summary(who.mlm)##
## Call:
## lm(formula = data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD *
## data$TotExp), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## data$PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## data$TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## data$PropMD:data$TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Adding the product increased R Squared and reduced the RSE. But still the model isnt a good fit.
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
sprintf("If Total Exp = 14 and PropMD=.03, then Life Expectency would be %.2f", b1*.03 + b2 * 14 + b3*.03*14 + b0)## [1] "If Total Exp = 14 and PropMD=.03, then Life Expectency would be 107.70"
107 is not realistic.