Dependent and Independent variables

On models…

  • We speak of regression analysis in terms of constructing “models” for the dependent variable.
  • A model is a cartoon of reality, it will undoubtedly leave out important things, and oversimplify relationships
  • Regression analysis is a model where we try to explain our outcome variable in terms of a combination of explanatory variables. More often than not, we will over-generalize the real relationships involved.
  • We must remember the idea of parsimony, where the simpler solution is often the better one

What the regression model does..

  • The model estimates a mathematical function of the explanatory variables that most accurately predicts the response variable.
  • There are two primary uses of regression models
  • This function can be used in future cases, where you may want to predict a value of an outcome that has not been observed
  • Very useful for demographic work
  • Prediction is notoriously difficult and problematic, especially if the outcome you’re working on is something that could cause life or death
  • The model may be used to describe, or explain how a phenomena has worked in the past or currently works, in other words models work well to describe associations

Estimating the parameters of a model

\(RSS = \sum_i (y_i - \hat{y}_i )^2\)

Where \(\hat{y}_i = E[y|x] = \hat{\beta_0} +\hat{\beta_1} * \text{life expectancy}_i\)

To estimate the regression parameters, the Gauss Markov theorm is used to solve the linear equation for \(\beta_1\)

\(\beta_1 = \frac {\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\)

The top part of the equation is the cross-products, or covariance between the outcome and the predictor, and the bottom is the sum-of-squares, or un-scaled variance in the predictors. The other parameter can be found once \(\beta_1\) is estimated as:

\(\beta_0 = \bar{y} - \beta_1 \bar{x}\)

These are the ordinary least - squares estimates of the regression parameters, and are a direct solution for the regression model when the conditions of normality, homoskedasticity and independence are met.

Inference for regression parameters

Mean square error

\[\sigma_{\epsilon}^2 =\text{MSE} = \frac{(y_i - \bar{y})^2}{n-p}\]

\[s.e.(\hat{\beta_1}) = \frac{\text{MSE}}{\sum (x_i - \bar{x})^2}\]

The test statistic is a t-statistic, generally, and

\[t = \frac{\hat{\beta_1}}{s.e.(\hat{\beta})}\]

and we can find a 95% confidence interval for \(\beta_1\) by using it’s assumed normal sampling distribution, as:

\[\hat{\beta_1} \pm 1.96*s.e.(\hat{\beta})\]

or by assuming a t-distribution as:

\[\hat{\beta_1} \pm t(1-\alpha/2; df=n-p)*s.e.(\hat{\beta})\]

Revisiting our model from above, we can get these things by using summary() of the model fit.

summary(fit)
## 
## Call:
## lm(formula = tfr ~ e0total, data = prb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2821 -0.5467 -0.0752  0.5765  2.8012 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.909090   0.423423   25.76   <2e-16 ***
## e0total     -0.115970   0.006161  -18.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9719 on 205 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6335, Adjusted R-squared:  0.6317 
## F-statistic: 354.3 on 1 and 205 DF,  p-value: < 2.2e-16

We see the estimates of the parameters, their associated standard errors, and the t-statistics for each, along with the calculated p-value of the test.

To get confidence intervals, for any model, we can use the confint() function

confint(fit)
##                  2.5 %     97.5 %
## (Intercept) 10.0742690 11.7439119
## e0total     -0.1281171 -0.1038228

Likewise, since we’ve seen the linear model before in the ANOVA and two group setting, we also get a model \(R^2\) from the summary, and we see that percent of the variation in the TFR is explained by the life expectancy.

So, what would we conclude about our analysis in this case?

Evaluating the model assumptions

Typically we would use graphical methods to identify if there is non-constant variance:

plot(fit, which=1)

In this plot, we are looking for patterns. One patter we would like to see would be that the residuals are constantly varying with respect to the fitted values, in this case, they are not. Lower fitted values have lower variance than higher fitted values. The line provided can also examine the residuals for non-constant trend in residuals, and in this case, the red line begins to indicate a positive trend in the residuals for higher values of the fitted values, suggesting non-constant variation.

We can also apply the Breush-Pagan test here, (from the help) “The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model”. So, this is a test for trend, where the above method was purely graphical.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 40.597, df = 1, p-value = 1.871e-10

Which supports our assertion that the variance was not constant.

Normality of residuals

We next examine our model for normality of residuals. First a graphical method:

plot(fit, which=2)

Not bad, but is it good enough? We can use the devilish Shapiro Wilk test or the Kolmogorov- Smirnov test.

shapiro.test(resid(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.98395, p-value = 0.01866
ks.test(resid(fit), y = pnorm)
## Warning in ks.test(resid(fit), y = pnorm): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(fit)
## D = 0.055826, p-value = 0.539
## alternative hypothesis: two-sided

So, S-W says the residuals are not normal, but K-S says the cumulative distribution of the residuals does not differ from a normal. So, we’ll go with the Russians this time.