The response variable is our outcome that we are really interested in understanding, it is also called the dependent variable, because its value should depend on other characteristics we measure on our observational units.
The explanatory variable(s) are used to predict, or explain the variation in the outcome variable. These are referred to as the independent variables, because each predictor is assumed to be independent of each other, meaning they each reflect a unique component that explains the outcome
\(TFR_i = \beta_0 +\beta_1 * \text{life expectancy}_i + \epsilon_i\)
library(broom)
library(readr)
library(dplyr)
library(ggplot2)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv", col_types = read_rds(url("https://raw.githubusercontent.com/coreysparks/r_courses/master/prbspec.rds")))
names(prb)<-tolower(names(prb))
fit<-lm(tfr~e0total, data=prb)
coef(fit)
## (Intercept) e0total
## 10.9090905 -0.1159699
The regression line is then:
\(TFR_i =\) 10.909 + -0.116* \(\text{life expectancy}_i\)
ggplot(prb, aes(x=e0total, y=tfr))+geom_point()+geom_smooth(method = "lm", se = FALSE)
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
This simple model assumes:
That life expectancy impacts the TFR in a linear manner, meaning that the relationship between life expectancy and the TFR is constant for all values of life expectancy.
So this model says that we can predict the TFR using the life expectancy and a random error term.
This error term is called the residual and represents the non-predictable portion of the variation in the TFR
We expect that the model will work perfectly, so the expected value (mean) of the residuals is 0
We assume the error terms have a constant variance, or \(Var(\epsilon_i) = \sigma_{\epsilon}^2\)
We also assume the \(\epsilon_i\) are independent of one another and that the \(\epsilon_i\) are normally distributed.
\(RSS = \sum_i (y_i - \hat{y}_i )^2\)
Where \(\hat{y}_i = E[y|x] = \hat{\beta_0} +\hat{\beta_1} * \text{life expectancy}_i\)
To estimate the regression parameters, the Gauss Markov theorm is used to solve the linear equation for \(\beta_1\)
\(\beta_1 = \frac {\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\)
The top part of the equation is the cross-products, or covariance between the outcome and the predictor, and the bottom is the sum-of-squares, or un-scaled variance in the predictors. The other parameter can be found once \(\beta_1\) is estimated as:
\(\beta_0 = \bar{y} - \beta_1 \bar{x}\)
These are the ordinary least - squares estimates of the regression parameters, and are a direct solution for the regression model when the conditions of normality, homoskedasticity and independence are met.
Mean square error
\[\sigma_{\epsilon}^2 =\text{MSE} = \frac{(y_i - \bar{y})^2}{n-p}\]
\[s.e.(\hat{\beta_1}) = \frac{\text{MSE}}{\sum (x_i - \bar{x})^2}\]
The test statistic is a t-statistic, generally, and
\[t = \frac{\hat{\beta_1}}{s.e.(\hat{\beta})}\]
and we can find a 95% confidence interval for \(\beta_1\) by using it’s assumed normal sampling distribution, as:
\[\hat{\beta_1} \pm 1.96*s.e.(\hat{\beta})\]
or by assuming a t-distribution as:
\[\hat{\beta_1} \pm t(1-\alpha/2; df=n-p)*s.e.(\hat{\beta})\]
Revisiting our model from above, we can get these things by using summary() of the model fit.
summary(fit)
##
## Call:
## lm(formula = tfr ~ e0total, data = prb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2821 -0.5467 -0.0752 0.5765 2.8012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.909090 0.423423 25.76 <2e-16 ***
## e0total -0.115970 0.006161 -18.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9719 on 205 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6335, Adjusted R-squared: 0.6317
## F-statistic: 354.3 on 1 and 205 DF, p-value: < 2.2e-16
We see the estimates of the parameters, their associated standard errors, and the t-statistics for each, along with the calculated p-value of the test.
To get confidence intervals, for any model, we can use the confint() function
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 10.0742690 11.7439119
## e0total -0.1281171 -0.1038228
Likewise, since we’ve seen the linear model before in the ANOVA and two group setting, we also get a model \(R^2\) from the summary, and we see that percent of the variation in the TFR is explained by the life expectancy.
So, what would we conclude about our analysis in this case?
In general, I examine the t statistics for the \(\beta_1\) parameter first, if it is not different from 0 then there is no association
Secondly, I examine the model \(R^2\) to see how the model is doing in terms of explaining variation in the dependent variable. If it is low (<.1), well that’s too bad, we have to work with what we have. Perhaps we should consider adding more predictor variables. If it’s high (>.4) well that’s good too, and maybe I won’t add more predictor variables. Then I proceed to check the other model assumptions
fit$residuals. We can also calculate standardized residuals using the rstandard() functionTypically we would use graphical methods to identify if there is non-constant variance:
plot(fit, which=1)
In this plot, we are looking for patterns. One patter we would like to see would be that the residuals are constantly varying with respect to the fitted values, in this case, they are not. Lower fitted values have lower variance than higher fitted values. The line provided can also examine the residuals for non-constant trend in residuals, and in this case, the red line begins to indicate a positive trend in the residuals for higher values of the fitted values, suggesting non-constant variation.
We can also apply the Breush-Pagan test here, (from the help) “The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model”. So, this is a test for trend, where the above method was purely graphical.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 40.597, df = 1, p-value = 1.871e-10
Which supports our assertion that the variance was not constant.
We next examine our model for normality of residuals. First a graphical method:
plot(fit, which=2)
Not bad, but is it good enough? We can use the devilish Shapiro Wilk test or the Kolmogorov- Smirnov test.
shapiro.test(resid(fit))
##
## Shapiro-Wilk normality test
##
## data: resid(fit)
## W = 0.98395, p-value = 0.01866
ks.test(resid(fit), y = pnorm)
## Warning in ks.test(resid(fit), y = pnorm): ties should not be present for
## the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(fit)
## D = 0.055826, p-value = 0.539
## alternative hypothesis: two-sided
So, S-W says the residuals are not normal, but K-S says the cumulative distribution of the residuals does not differ from a normal. So, we’ll go with the Russians this time.