Linear Regression Part II






Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.



The Visual Interpretation



data = read.csv(file = "C:\\Users\\arono\\source\\R\\Data605\\who.csv", header = TRUE)


who.lm <- lm(data$LifeExp~data$TotExp, data = data)

b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]

# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(data$LifeExp~data$TotExp, main=title, xlab="Expendentures", ylab="Expectency")
abline(who.lm) 


Visually its a poor fit because the response variable is not homoscedastic. The total expenditures increase greatly at the end but if we were to remove them, the abline would look much more reasonable which tells me the visual interpretation can be deceiving.



The Interpretation of the Summary


sum_who.lm<-summary(who.lm)

sum_who.lm
## 
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## data$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14


(Intercept) represents the y intercept and data$TotExp represents the slope.


Each Estimate represents the coefficient, so the equation \(\hat{y}=0.00006297x +64.75\) is applied to all values of x.


Each Std. Error represents the deviation caused by each component in the equation. Im not clear on the actual calculation but its worth noting that excels data anyalysis linear regression produces the same results.



Each T value is simply the estimate divided by the standard error, i.e. a high T-Stat indicates that the actual values do not deviate from the predicted values very much.


Each Pr(>|t|) maps the T-Stat to the Studen Test Distribution, which may be described as the normal distribution except it accounts for N or the number of observations ( called Degrees of Freedom ).


The coeffcient is 0.0000629701 and the intercept is 64.7 which give you an idea that the range of the response variable is considerably higher than that of the explanatory variable.


The Residual standard error is a more direct measurement of the predicted vs actual values.


\[RSE \ = \ \sqrt{\sum{\frac{y-\bar{y}}{df}}}\]

sqrt(sum((data$LifeExp-who.lm$fitted.values)^2)/who.lm$df.residual)
## [1] 9.371033


Note 190 observastions minus 1 explanatory variable minus 1 equals 188 degrees of freedom.


R Squared represents the “variability of y explained by x”. Its the ration of the predicted vs the mean and the predicted vs the actual


The caculuation is


\[R^2 \ = \frac{SSExplained}{SSTotal} \ = \ \frac{\sum_{i=1}^{n}{(\hat{y} \ - \ \bar{y} )^2}}{\sum_{i=1}^{n}{(y_i \ - \ \bar{y} )^2}}\]

mu<-mean(data$LifeExp)
ss_explained<-sum((who.lm$fitted-mu)^2)                   # predicted vs mean
ss_unexplained<-sum((data$LifeExp-who.lm$fitted)^2)       # actual vs predicted
ss_explained/(ss_explained+ss_unexplained)                # R squared
## [1] 0.2576922


The adjusted R2 is R2 qualified by the number of observations relative to the number of explanatory variables.


\[Adj R^2 \ = \ R^2 * \frac{n-1}{n-k}\]


where

n = number of observations


k = number of explanatory variables


sum_who.lm$adj.r.squared
## [1] 0.2537437
n=length(who.lm$residuals)
k=length(sum_who.lm$coefficients[1,])-1   #Subtract one to ignore intercept

SSE=sum(sum_who.lm$residuals^2)
SSreg=sum((data$LifeExp-mean(data$LifeExp))^2)
1-((SSE/SSreg)*(n-1)/(n-(k-1)))
## [1] 0.2537437


The F Statistic uses the F distribution which considers the number of explanatory variables and as such it is generally suggested for hypothesis testing around the effect of different coefficients from multiple regression.


The calculation is

\[F \ = \ \frac{ \frac{ \sum{(\bar{y}-\mu)^2}}{df1} }{ \frac{ \sum{ (\bar{y}-y_i)^2}}{df2} }\]


mu<-mean(data$LifeExp)

# numerator
ss_explained<-sum((who.lm$fitted-mu)^2)                   # predicted vs mean
ssreg<-ss_explained/(length(who.lm$coef)-1)    # in our case there is only 1 explanatory variable


ss_unexplained<-sum((data$LifeExp-who.lm$fitted)^2)       # actual vs predicted
sse<-ss_unexplained/who.lm$df.residual
ssreg/sse
## [1] 65.2642



Assumptions of Linear Regressions



To respect your model, you want your results to show

linearity

nearly normal residuals

constant variability


We already looked at linearity and R squared, and saw it was ok but not great.


These 2 plots show the residuals are just somewhat normal and the variabity of the residuals are not homoscedastic.


hist(sum_who.lm$residuals,breaks=20)


Another way to visualize normality is through a qq plot.


qqnorm(resid(sum_who.lm))
qqline(resid(sum_who.lm))




Rescaling Model



Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExp4.5<-data$LifeExp^4.6
TotExp0.06<-data$TotExp^.06

who.lm <- lm(LifeExp4.5~TotExp0.06)
b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]

fitted_recalc<-(m * data$TotExp)  + b

# Visualize the regression line
title<-sprintf(" y = %.2fx + %.2f ",m,b)
plot(LifeExp4.5~TotExp0.06, main=title, xlab="Expendentures", ylab="Expectency")
abline(who.lm) 


The visual interpretation looks better in reflecting the relationship. The problem I have is that the numbers behind the life expectency no longer have meaning.


summary(who.lm)
## 
## Call:
## lm(formula = LifeExp4.5 ~ TotExp0.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp0.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16


The T values are higher and the F Statistic is higher making the residuals seems less concerning.


R squared is closer to 1 and supports our visual interpretation.


Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.


b<-who.lm$coefficients[1]
m<-who.lm$coefficients[2]

sprintf("If Total Exp = 1.5, then Life Expectency would be %.2f", m * 1.5  + b)
## [1] "If Total Exp = 1.5, then Life Expectency would be 193562413.99"
sprintf("If Total Exp = 1.5, then Life Expectency would be %.2f", m * 2.5  + b)
## [1] "If Total Exp = 1.5, then Life Expectency would be 813622629.64"



Mulitple Linear Regression Model




Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?


\[LifeExp=\beta_0+\beta_1 * PropMd + \beta_2 * TotExp +\beta_3 * PropMD * TotExp\]


who.mlm <- lm(data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD * data$TotExp) , data = data)

b0<-who.mlm$coefficients[1]
b1<-who.mlm$coefficients[2]
b2<-who.mlm$coefficients[3]
b3<-who.mlm$coefficients[4]

sprintf(" y = %.2fB0 + %.2fB1 + %.2fB2 ",b0,b1,b2)
## [1] " y = 62.77B0 + 1497.49B1 + 0.00B2 "
summary(who.mlm)
## 
## Call:
## lm(formula = data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD * 
##     data$TotExp), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.277e+01  7.956e-01  78.899  < 2e-16 ***
## data$PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## data$TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## data$PropMD:data$TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16


Adding the product increased R Squared and reduced the RSE. But still the model isnt a good fit.


Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?


sprintf("If Total Exp = 14 and PropMD=.03, then Life Expectency would be %.2f", b1*.03 + b2 * 14 + b3*.03*14  + b0)
## [1] "If Total Exp = 14 and PropMD=.03, then Life Expectency would be 107.70"


107 is not realistic.