DEM 7273 - Regression Analysis Part 2

We saw last time the case of a simple regression analysis, where we have one predictor variable - We saw how to describe the relationship between the predictor and the dependent variable, how to test the null hypothesis that this relationship is 0, and various diagnostic criteria for testing the assumptions of the Ordinary Least Squares (OLS) regression model

Multiple regression analysis

We now consider the more complicated, but inherently more interesting case where we have more than one predictor variable in our analysis, and we determine the influence of each of these predictors on the dependent variable.
This is the more realistic form of analysis, as any regression model you ever construct is bound to have multiple predictors
Especially if you stay in demography

Simple regression model

Remember from last time that the simplest regression model is one with one outcome variable and one predictor.
Suppose we have the Total Fertility Rate (TFR) for a country and the we want to examine the relationship between the TFR and life expectancy, we could write the simple regression:

$TFR_i = \beta_0 +\beta_1 * \text{life expectancy}_i + \epsilon_i$

Where $\beta_0$ is the model intercept (mean of y when x=0), $\beta_1$ is the slope relating Life expectancy to the TFR and $\epsilon_i$ is the residual(unexplained) term of the model. The residual term is the unexplained random variation for each individual country, not accounted for by the independent variables.

The Multiple regression model adds other terms to this model, but we are still well within our beloved linear model framework. For instance, if we were to include the gross domestic product of the countries in our analysis, we could re-write our model as:

$TFR_i = \beta_0 +\beta_1 * \text{life expectancy}_i +\beta_2 * \text{GDP}_i \epsilon_i$

so we are trying to predict the ith observation’s dependent variable using two predictor variables. We interpret the effects of these variables on out outcome using the direction and magnitude of the $\beta_1$ and $\beta_2$ parameters.
The assumptions of this model are the same as for the simple linear regression model:
Independence of the errors,
constant variance (homoskedasticity)
Normality of the error terms
Linearity of the relationship

More predictors

We can easily incorporate more than 2 predictors in the model as other additive terms. A generalization of our model would be:

$y_i = \beta_0 +\sum_i ^p \beta_p * x_{ip} + \epsilon_i$

We can also pull the $\beta_0$ term inside the sum, and just write:

$y_i = \sum_i ^p \beta_p * x_{ip} + \epsilon_i$

This is true, because, if we look at what the computer actually sees in the linear model design matrix, there is actually another column of information, a vector of 1’s, added on the right hand side:

Matrix form of the linear model

If we have a linear model with two predictors: $y_i = \beta_0 +\beta_1* x_{1i} + \beta_2* x_{2i}+ \epsilon_i$

The computer arranges these values into a series of matrices:

\[\begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} 1 & x_{11} &x_{12}\\ 1 & x_{12} &x_{22}\\ \vdots & \vdots & \vdots\\ 1 & x_{1n} &x_{2n}\\ \end{pmatrix} * \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \end{pmatrix} + \begin{pmatrix} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_n\\ \end{pmatrix}\]

This equation can be solved for the $\beta$’s using the Gauss Markov theorem :

\[\mathbf{\beta} = (\mathbf{X'X})^{-1} \mathbf{X'Y}\] where the ' indicates a matrix transpose, and the -1 indicates a matrix inverse. This is just a more compact notation of what we saw last time, where we solved for

$\beta_1 = \frac {\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$

In the matrix equation above, the $\mathbf{X'X}$ term is the denominator, and referred to as the sums of squares matrix, the term $\mathbf{X'Y}$ is the numerator from the simple regression equation and referred to as the cross-products matrix. The matrix inverse is the same thing as dividing, so what we saw last time generalizes out to any dimension of x.

We can see the math in action as:

library(broom)
library(readr)
library(dplyr)
library(ggplot2)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv", col_types = read_rds(url("https://raw.githubusercontent.com/coreysparks/r_courses/master/prbspec.rds")))
names(prb)<-tolower(names(prb))

#prb2<-prb%>%na.omit(tfr, e0total,gnippppercapitausdollars )
prb2<-prb[ c("tfr", "e0total","gnippppercapitausdollars" )]
prb2<- prb2[complete.cases(prb2),]

y<-as.matrix(prb2[,"tfr"])
x<-as.matrix(cbind(1, prb2[ c("e0total","gnippppercapitausdollars" )]))

solve(t(x)%*%x)%*%(t(x)%*%y)

##                                    tfr
## 1                         1.075895e+01
## e0total                  -1.124095e-01
## gnippppercapitausdollars -8.066120e-06

Which matches the easy way:

fit<-lm(tfr~e0total+gnippppercapitausdollars, data=prb)
coef(fit)

##              (Intercept)                  e0total gnippppercapitausdollars 
##             1.075895e+01            -1.124095e-01            -8.066120e-06

The regression line is then:

$TFR_i =$ 10.759 + -0.112 * $\text{life expectancy}_i$ + -8.110^{-6}* $\text{life expectancy}_i$

so, we observe a negative relationship between the TFR and life expectancy (as seen in the negative $\beta_1$ parameter as well as a negative relationship between TFR and GDP per capita

We can visualize these relationships in their partial forms using the car library:

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

avPlots(fit,terms =~.)

Interpretation of the regression terms

Just as in the simple regression analysis, we construct hypothesis tests that each of our $\beta$’s = 0 using the t-test formed by dividing the $\beta$ / s.e.$(\beta)$.
We can create an ANOVA table showing the sources of variation, and calculate a F-test to test the null hypothesis that all of the $\beta$’s ==0.
For the multiple regression model, we talk of the effects of each predictor as partial effects, meaning the effect of that particular variable, given all of the other predictors in the model, and the other predictors are held constant (usually at their means).
This allows us to interpret the effect of each predictor, but we must avoid the temptation to say “This predictor is more important than this other predictor”
this is a problem if the covariates are not on the same scale

Standardized Coefficients

If we have predictors measured on different scales, say GDP and life expectancy we have variables measured in different manners, and often the range of the values could differ by several orders of magnitude
E.g. the range for
GDP per capita = $290 to 64,440
Life Expectancy = 33 to 82 years

For predictor variables with large ranges (and large variances), we often see values of regression coefficients that are VERY small - e.g. in the model above, the coefficient for GDP is -8.066120310^{-6}, which is very small. That’s because this is the effect of a $1 change in GDP per capita on the TFR, which is very small. We could divide the GDP by some scaling value such as 1000, and we would get:

fit<-lm(tfr~e0total+I(gnippppercapitausdollars/1000), data=prb)
coef(fit)

##                      (Intercept)                          e0total 
##                      10.75894930                      -0.11240949 
## I(gnippppercapitausdollars/1000) 
##                      -0.00806612

and see that a $1000 increase in GDP leads to a reduction of the TFR by -0.0080661, which is still small, but the scale is a little easier to comprehend.

A more attractive method is to scale all variables in the analysis to a common scale prior to analysis. An easy way to do this is to z-score the variables prior to analysis. This is where we subtract the mean of each variable from each value and divide by the standard deviation.

$x_{zi} = \frac{x_i - \bar{x}}{s_x}$

When we z-score the variables, they are now on the scale of the standard deviation. This means, that the interpretation is different. The $\beta$ now represents the change in x as a result of a 1 standard deviation change in x. For the GDP example above, this would be like saying the $\beta$ would reflect a

In R, the scale() function can do this internally in a model statement. Here goes:

fitz<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars), data=prb)
summary(fitz)

## 
## Call:
## lm(formula = tfr ~ scale(e0total) + scale(gnippppercapitausdollars), 
##     data = prb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2097 -0.5243  0.0184  0.6143  2.7535 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.03667    0.07289  41.664   <2e-16 ***
## scale(e0total)                  -1.23544    0.09490 -13.018   <2e-16 ***
## scale(gnippppercapitausdollars) -0.10903    0.09495  -1.148    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9661 on 173 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.6498, Adjusted R-squared:  0.6457 
## F-statistic: 160.5 on 2 and 173 DF,  p-value: < 2.2e-16

so the coefficient for $GDP_z$ is now -0.1090256 now says that a 1 standard deviation increase in the GDP leads to a decrease in the TFR of -0.1090256. This is equivalent to a 1.3516510^{4} change in GDP per capita.

While a 1 unit change in life expectancy leads to a -1.2354387 change in TFR, which is much larger of an effect. This corresponds to a change in life expectancy of 10.9905196 years.

This alleviates the problem of differential scaling among variables, but caution must be used

A standard deviation increase in one variable is not the same as that for another variable
What about binary variables?
In general try to make your interpretations in terms of the original units, not standard units, but these are useful for comparing relative strengths of association between predictors on different scales.

Check assumptions

As with any model, we still need to check the assumptions:

plot(fit)

#normality check
ks.test(rstudent(fit), y = pnorm)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rstudent(fit)
## D = 0.053334, p-value = 0.6986
## alternative hypothesis: two-sided

#heteroskedasticity check
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(fit)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 35.204, df = 2, p-value = 2.267e-08

So we have a heteroskedasticity problem. This means that the errors do not have constant variance. This is bad because we use the errors to make our tests for our $\beta$’s. In matrix terms, when we have OLS estimates of $\beta$, we can find the variance-covariance matrix of the $\beta$’s as:

$s^2(\beta) = \text{MSE} (\mathbf{X'X})^{-1}$

When $\text{MSE}$ is not constant, meaning that it changes with values of X, then the standard errors are incorrect. So, instead of assuming the $\text{MSE}$ is constant, we can use the empirical patterns in the residuals and construct heteroskedasticity consistent standard errors, or robust standard errors. The econometrician Halbert White, was one of a series of folks to derive estimates of the standard errors for the linear model that are consistent in the presence of unequal residual variances. These are commonly called White’s standard errors. We can use some R functions to get these. In the lmtest library we will use the coeftest() function and in the car library, we will use the hccm() function to generate heteroskedasticity corrected covariance matrices for the $\beta$’s.

 coeftest(fit, vcov. = hccm(fit, type = "hc0"))

## 
## t test of coefficients:
## 
##                                    Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                      10.7589493  0.8782440 12.2505 < 2.2e-16
## e0total                          -0.1124095  0.0134593 -8.3518 2.096e-14
## I(gnippppercapitausdollars/1000) -0.0080661  0.0066871 -1.2062    0.2294
##                                     
## (Intercept)                      ***
## e0total                          ***
## I(gnippppercapitausdollars/1000)    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare this to the output from:

summary(fit)

## 
## Call:
## lm(formula = tfr ~ e0total + I(gnippppercapitausdollars/1000), 
##     data = prb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2097 -0.5243  0.0184  0.6143  2.7535 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      10.758949   0.538315  19.986   <2e-16 ***
## e0total                          -0.112409   0.008635 -13.018   <2e-16 ***
## I(gnippppercapitausdollars/1000) -0.008066   0.007024  -1.148    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9661 on 173 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.6498, Adjusted R-squared:  0.6457 
## F-statistic: 160.5 on 2 and 173 DF,  p-value: < 2.2e-16

We see the t-statistics are much smaller for the e0total effect, and the intercept. This is what correcting for heteroskedasticity does for you. Before, the tests were wrong, and now they are better. This is important because these are the tests of your hypotheses, so they better be right.

Use of qualitative predictors

We can easily incorporate categorical variables into the model
Say we have a variable indicating Hispanic ethnicity, so, if the person reports a Yes to the question “Do you consider yourself to be Hispanic?”, they are coded as $x_{his}$=1, otherwise $x_{his}$=0
This is so - called dummy variable coding

Say we want to study the effect of race on income, so our dependent variable is wages. - We have a factor variable for race with 4 levels - White, Black, Asian, Other - We are interested in how individuals of nonwhite races compare to whites in terms of incomes, so we say that our analysis will be in “reference” to whites.

To code this, we need to construct 3 dummy variables - $x_{black}$ If the person reports their race as black, they are coded as $x_{black}$=1, otherwise $x_{black}$=0 - $x_{asian}$ If the person reports their race as asian, they are coded as$x_{asian}$=1, otherwise $x_{asian}$=0 - $x_{other}$ If the person reports their race as other, they are coded as $x_{other}$=1, otherwise $x_{other}$=0 - We do not create a dummy variable for whites, because if the individual reports their race as white, they are accounted for because they have 0’s for each of these 3 other variables.

This is called reference group coding using dummy variables, get used to doing this a lot.

Larger data example

library(haven)

ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         sexrecode=ifelse(sex==1, "male", "female"))%>%
  mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"0nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing"))%>%
  mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs", 
                            .$educd %in% c(62:64)~"hs",
                            .$educd %in% c(65:100)~"somecoll",
                            .$educd %in% c(101:116)~"collgrad", 
                            .$educd ==999 ~ "missing"))

#newpums$race_eth<-as.factor(newpums$race_eth)
#newpums$race_eth<-relevel(newpums$race_eth, ref = "nh_white")


ifit<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth+edurec, data=newpums)
summary(ifit)

## 
## Call:
## lm(formula = log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode + 
##     race_eth + edurec, data = newpums)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.994 -0.401  0.143  0.582  4.396 
## 
## Coefficients:
##                   Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      10.563474   0.005396 1957.539  < 2e-16 ***
## scale(age)        2.314694   0.015384  150.457  < 2e-16 ***
## scale(I(age^2))  -2.068041   0.015378 -134.480  < 2e-16 ***
## sexrecodemale     0.424539   0.005265   80.632  < 2e-16 ***
## race_ethhispanic -0.081778   0.008162  -10.020  < 2e-16 ***
## race_ethnh_asian -0.057485   0.011581   -4.964 6.92e-07 ***
## race_ethnh_black -0.202619   0.009154  -22.135  < 2e-16 ***
## race_ethnh_other -0.140497   0.016757   -8.384  < 2e-16 ***
## edurechs         -0.742082   0.006994 -106.106  < 2e-16 ***
## edurecnohs       -1.020649   0.010893  -93.699  < 2e-16 ***
## edurecsomecoll   -0.574522   0.006464  -88.886  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9529 on 132471 degrees of freedom
## Multiple R-squared:  0.3103, Adjusted R-squared:  0.3103 
## F-statistic:  5961 on 10 and 132471 DF,  p-value: < 2.2e-16

bptest(ifit)

## 
##  studentized Breusch-Pagan test
## 
## data:  ifit
## BP = 1692.4, df = 10, p-value < 2.2e-16

coeftest(ifit, vcov. = hccm(ifit, type = "hc0"))

## 
## t test of coefficients:
## 
##                    Estimate Std. Error   t value  Pr(>|t|)    
## (Intercept)      10.5634737  0.0055195 1913.8579 < 2.2e-16 ***
## scale(age)        2.3146944  0.0192379  120.3197 < 2.2e-16 ***
## scale(I(age^2))  -2.0680410  0.0196780 -105.0941 < 2.2e-16 ***
## sexrecodemale     0.4245392  0.0052816   80.3809 < 2.2e-16 ***
## race_ethhispanic -0.0817778  0.0080128  -10.2059 < 2.2e-16 ***
## race_ethnh_asian -0.0574849  0.0121487   -4.7318 2.228e-06 ***
## race_ethnh_black -0.2026188  0.0093721  -21.6193 < 2.2e-16 ***
## race_ethnh_other -0.1404975  0.0177192   -7.9291 2.224e-15 ***
## edurechs         -0.7420823  0.0069592 -106.6336 < 2.2e-16 ***
## edurecnohs       -1.0206493  0.0114874  -88.8497 < 2.2e-16 ***
## edurecsomecoll   -0.5745217  0.0064579  -88.9648 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(sjPlot)
sjp.lm(ifit, sort.est = F, title = "Regression effects from IPUMS data- Wage as outcome", show.summary = T)

## Warning: 'sjstats::get_model_pval' is deprecated.
## Use 'p_value' instead.
## See help("Deprecated")