We saw last time the case of a simple regression analysis, where we have one predictor variable - We saw how to describe the relationship between the predictor and the dependent variable, how to test the null hypothesis that this relationship is 0, and various diagnostic criteria for testing the assumptions of the Ordinary Least Squares (OLS) regression model
\(TFR_i = \beta_0 +\beta_1 * \text{life expectancy}_i + \epsilon_i\)
The Multiple regression model adds other terms to this model, but we are still well within our beloved linear model framework. For instance, if we were to include the gross domestic product of the countries in our analysis, we could re-write our model as:
\(TFR_i = \beta_0 +\beta_1 * \text{life expectancy}_i +\beta_2 * \text{GDP}_i \epsilon_i\)
so we are trying to predict the ith observation’s dependent variable using two predictor variables. We interpret the effects of these variables on out outcome using the direction and magnitude of the \(\beta_1\) and \(\beta_2\) parameters.
Linearity of the relationship
\(y_i = \beta_0 +\sum_i ^p \beta_p * x_{ip} + \epsilon_i\)
We can also pull the \(\beta_0\) term inside the sum, and just write:
\(y_i = \sum_i ^p \beta_p * x_{ip} + \epsilon_i\)
This is true, because, if we look at what the computer actually sees in the linear model design matrix, there is actually another column of information, a vector of 1’s, added on the right hand side:
If we have a linear model with two predictors: \(y_i = \beta_0 +\beta_1* x_{1i} + \beta_2* x_{2i}+ \epsilon_i\)
The computer arranges these values into a series of matrices:
\[\begin{pmatrix} y_1\\ y_2\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} 1 & x_{11} &x_{12}\\ 1 & x_{12} &x_{22}\\ \vdots & \vdots & \vdots\\ 1 & x_{1n} &x_{2n}\\ \end{pmatrix} * \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \end{pmatrix} + \begin{pmatrix} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_n\\ \end{pmatrix}\]
This equation can be solved for the \(\beta\)’s using the Gauss Markov theorem :
\[\mathbf{\beta} = (\mathbf{X'X})^{-1} \mathbf{X'Y}\] where the ' indicates a matrix transpose, and the -1 indicates a matrix inverse. This is just a more compact notation of what we saw last time, where we solved for
\(\beta_1 = \frac {\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\)
In the matrix equation above, the \(\mathbf{X'X}\) term is the denominator, and referred to as the sums of squares matrix, the term \(\mathbf{X'Y}\) is the numerator from the simple regression equation and referred to as the cross-products matrix. The matrix inverse is the same thing as dividing, so what we saw last time generalizes out to any dimension of x.
We can see the math in action as:
library(broom)
library(readr)
library(dplyr)
library(ggplot2)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv", col_types = read_rds(url("https://raw.githubusercontent.com/coreysparks/r_courses/master/prbspec.rds")))
names(prb)<-tolower(names(prb))
#prb2<-prb%>%na.omit(tfr, e0total,gnippppercapitausdollars )
prb2<-prb[ c("tfr", "e0total","gnippppercapitausdollars" )]
prb2<- prb2[complete.cases(prb2),]
y<-as.matrix(prb2[,"tfr"])
x<-as.matrix(cbind(1, prb2[ c("e0total","gnippppercapitausdollars" )]))
solve(t(x)%*%x)%*%(t(x)%*%y)
## tfr
## 1 1.075895e+01
## e0total -1.124095e-01
## gnippppercapitausdollars -8.066120e-06
Which matches the easy way:
fit<-lm(tfr~e0total+gnippppercapitausdollars, data=prb)
coef(fit)
## (Intercept) e0total gnippppercapitausdollars
## 1.075895e+01 -1.124095e-01 -8.066120e-06
The regression line is then:
\(TFR_i =\) 10.759 + -0.112 * \(\text{life expectancy}_i\) + -8.110^{-6}* \(\text{life expectancy}_i\)
We can visualize these relationships in their partial forms using the car library:
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
avPlots(fit,terms =~.)
For predictor variables with large ranges (and large variances), we often see values of regression coefficients that are VERY small - e.g. in the model above, the coefficient for GDP is -8.066120310^{-6}, which is very small. That’s because this is the effect of a $1 change in GDP per capita on the TFR, which is very small. We could divide the GDP by some scaling value such as 1000, and we would get:
fit<-lm(tfr~e0total+I(gnippppercapitausdollars/1000), data=prb)
coef(fit)
## (Intercept) e0total
## 10.75894930 -0.11240949
## I(gnippppercapitausdollars/1000)
## -0.00806612
and see that a $1000 increase in GDP leads to a reduction of the TFR by -0.0080661, which is still small, but the scale is a little easier to comprehend.
A more attractive method is to scale all variables in the analysis to a common scale prior to analysis. An easy way to do this is to z-score the variables prior to analysis. This is where we subtract the mean of each variable from each value and divide by the standard deviation.
\(x_{zi} = \frac{x_i - \bar{x}}{s_x}\)
When we z-score the variables, they are now on the scale of the standard deviation. This means, that the interpretation is different. The \(\beta\) now represents the change in x as a result of a 1 standard deviation change in x. For the GDP example above, this would be like saying the \(\beta\) would reflect a
In R, the scale() function can do this internally in a model statement. Here goes:
fitz<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars), data=prb)
summary(fitz)
##
## Call:
## lm(formula = tfr ~ scale(e0total) + scale(gnippppercapitausdollars),
## data = prb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2097 -0.5243 0.0184 0.6143 2.7535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.03667 0.07289 41.664 <2e-16 ***
## scale(e0total) -1.23544 0.09490 -13.018 <2e-16 ***
## scale(gnippppercapitausdollars) -0.10903 0.09495 -1.148 0.252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9661 on 173 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.6498, Adjusted R-squared: 0.6457
## F-statistic: 160.5 on 2 and 173 DF, p-value: < 2.2e-16
so the coefficient for \(GDP_z\) is now -0.1090256 now says that a 1 standard deviation increase in the GDP leads to a decrease in the TFR of -0.1090256. This is equivalent to a 1.3516510^{4} change in GDP per capita.
While a 1 unit change in life expectancy leads to a -1.2354387 change in TFR, which is much larger of an effect. This corresponds to a change in life expectancy of 10.9905196 years.
This alleviates the problem of differential scaling among variables, but caution must be used
As with any model, we still need to check the assumptions:
plot(fit)
#normality check
ks.test(rstudent(fit), y = pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: rstudent(fit)
## D = 0.053334, p-value = 0.6986
## alternative hypothesis: two-sided
#heteroskedasticity check
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit)
##
## studentized Breusch-Pagan test
##
## data: fit
## BP = 35.204, df = 2, p-value = 2.267e-08
So we have a heteroskedasticity problem. This means that the errors do not have constant variance. This is bad because we use the errors to make our tests for our \(\beta\)’s. In matrix terms, when we have OLS estimates of \(\beta\), we can find the variance-covariance matrix of the \(\beta\)’s as:
\(s^2(\beta) = \text{MSE} (\mathbf{X'X})^{-1}\)
When \(\text{MSE}\) is not constant, meaning that it changes with values of X, then the standard errors are incorrect. So, instead of assuming the \(\text{MSE}\) is constant, we can use the empirical patterns in the residuals and construct heteroskedasticity consistent standard errors, or robust standard errors. The econometrician Halbert White, was one of a series of folks to derive estimates of the standard errors for the linear model that are consistent in the presence of unequal residual variances. These are commonly called White’s standard errors. We can use some R functions to get these. In the lmtest library we will use the coeftest() function and in the car library, we will use the hccm() function to generate heteroskedasticity corrected covariance matrices for the \(\beta\)’s.
coeftest(fit, vcov. = hccm(fit, type = "hc0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7589493 0.8782440 12.2505 < 2.2e-16
## e0total -0.1124095 0.0134593 -8.3518 2.096e-14
## I(gnippppercapitausdollars/1000) -0.0080661 0.0066871 -1.2062 0.2294
##
## (Intercept) ***
## e0total ***
## I(gnippppercapitausdollars/1000)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare this to the output from:
summary(fit)
##
## Call:
## lm(formula = tfr ~ e0total + I(gnippppercapitausdollars/1000),
## data = prb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2097 -0.5243 0.0184 0.6143 2.7535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.758949 0.538315 19.986 <2e-16 ***
## e0total -0.112409 0.008635 -13.018 <2e-16 ***
## I(gnippppercapitausdollars/1000) -0.008066 0.007024 -1.148 0.252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9661 on 173 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.6498, Adjusted R-squared: 0.6457
## F-statistic: 160.5 on 2 and 173 DF, p-value: < 2.2e-16
We see the t-statistics are much smaller for the e0total effect, and the intercept. This is what correcting for heteroskedasticity does for you. Before, the tests were wrong, and now they are better. This is important because these are the tests of your hypotheses, so they better be right.
Say we have a variable indicating Hispanic ethnicity, so, if the person reports a Yes to the question “Do you consider yourself to be Hispanic?”, they are coded as \(x_{his}\)=1, otherwise \(x_{his}\)=0
This is so - called dummy variable coding
Say we want to study the effect of race on income, so our dependent variable is wages. - We have a factor variable for race with 4 levels - White, Black, Asian, Other - We are interested in how individuals of nonwhite races compare to whites in terms of incomes, so we say that our analysis will be in “reference” to whites.
To code this, we need to construct 3 dummy variables - \(x_{black}\) If the person reports their race as black, they are coded as \(x_{black}\)=1, otherwise \(x_{black}\)=0 - \(x_{asian}\) If the person reports their race as asian, they are coded as\(x_{asian}\)=1, otherwise \(x_{asian}\)=0 - \(x_{other}\) If the person reports their race as other, they are coded as \(x_{other}\)=1, otherwise \(x_{other}\)=0 - We do not create a dummy variable for whites, because if the individual reports their race as white, they are accounted for because they have 0’s for each of these 3 other variables.
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
sexrecode=ifelse(sex==1, "male", "female"))%>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"0nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing"))%>%
mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs",
.$educd %in% c(62:64)~"hs",
.$educd %in% c(65:100)~"somecoll",
.$educd %in% c(101:116)~"collgrad",
.$educd ==999 ~ "missing"))
#newpums$race_eth<-as.factor(newpums$race_eth)
#newpums$race_eth<-relevel(newpums$race_eth, ref = "nh_white")
ifit<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth+edurec, data=newpums)
summary(ifit)
##
## Call:
## lm(formula = log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode +
## race_eth + edurec, data = newpums)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.994 -0.401 0.143 0.582 4.396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.563474 0.005396 1957.539 < 2e-16 ***
## scale(age) 2.314694 0.015384 150.457 < 2e-16 ***
## scale(I(age^2)) -2.068041 0.015378 -134.480 < 2e-16 ***
## sexrecodemale 0.424539 0.005265 80.632 < 2e-16 ***
## race_ethhispanic -0.081778 0.008162 -10.020 < 2e-16 ***
## race_ethnh_asian -0.057485 0.011581 -4.964 6.92e-07 ***
## race_ethnh_black -0.202619 0.009154 -22.135 < 2e-16 ***
## race_ethnh_other -0.140497 0.016757 -8.384 < 2e-16 ***
## edurechs -0.742082 0.006994 -106.106 < 2e-16 ***
## edurecnohs -1.020649 0.010893 -93.699 < 2e-16 ***
## edurecsomecoll -0.574522 0.006464 -88.886 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9529 on 132471 degrees of freedom
## Multiple R-squared: 0.3103, Adjusted R-squared: 0.3103
## F-statistic: 5961 on 10 and 132471 DF, p-value: < 2.2e-16
bptest(ifit)
##
## studentized Breusch-Pagan test
##
## data: ifit
## BP = 1692.4, df = 10, p-value < 2.2e-16
coeftest(ifit, vcov. = hccm(ifit, type = "hc0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5634737 0.0055195 1913.8579 < 2.2e-16 ***
## scale(age) 2.3146944 0.0192379 120.3197 < 2.2e-16 ***
## scale(I(age^2)) -2.0680410 0.0196780 -105.0941 < 2.2e-16 ***
## sexrecodemale 0.4245392 0.0052816 80.3809 < 2.2e-16 ***
## race_ethhispanic -0.0817778 0.0080128 -10.2059 < 2.2e-16 ***
## race_ethnh_asian -0.0574849 0.0121487 -4.7318 2.228e-06 ***
## race_ethnh_black -0.2026188 0.0093721 -21.6193 < 2.2e-16 ***
## race_ethnh_other -0.1404975 0.0177192 -7.9291 2.224e-15 ***
## edurechs -0.7420823 0.0069592 -106.6336 < 2.2e-16 ***
## edurecnohs -1.0206493 0.0114874 -88.8497 < 2.2e-16 ***
## edurecsomecoll -0.5745217 0.0064579 -88.9648 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(sjPlot)
sjp.lm(ifit, sort.est = F, title = "Regression effects from IPUMS data- Wage as outcome", show.summary = T)
## Warning: 'sjstats::get_model_pval' is deprecated.
## Use 'p_value' instead.
## See help("Deprecated")