Dr Eitan Tzelgov
We are estimating a model based on fitting a line that minimizes the squared errors!
This means that the linear regression model finds the line that minimizes the (squared) distance between it and the observations
It looks something like this:
\[ y_i = \alpha+x_1\beta_i + x_2\beta_i + x_3\beta_i + \varepsilon_i \]
The intercept (also known as constant), gives us the predicted value for the dependent variable when all the independent variables equal zero.
The \(\beta\)’s tell us the marginal effect/slope associated with each variable (what would one-unit change in each independent variable mean in terms of change in values of the dependent variable?)
Inference regarding significance is based on the \(t\) value (greater than 2 in absolute value?), on the \(p\)-value (smaller than 0.05?) etc.
The \(p\)-value is the risk we are willing to make in rejecting the null-hypothesis by mistake
Therefore, smaller \(p\)-values indicate higher confidence that we are making the correct decision in rejecting the null.
One way to evaluate model fit is to think about how close the line falls relative to the data
In other words, how small the residual (errors) are
To measure model fit, we sum the squared errors \(SS_{res}=\sum e^2_i\)
And compare this sum to the variation in the dependent variable \(SS_{tot}=\sum (y_i-\bar{y})^2\)
Like this: \(R^2=1-\frac{SS_{res}}{SS_{tot}}\)
Higher values (up to 1) indicate that the model is explaining more of the variation in the dependent variable
But \(R^2\) are at least weakly increasing with more predictors/variables
So far, we’ve looked at regression w/ 2 continuous variables
Sometimes, our independent variable can be nominal
For example: gender differences; differences between Home v. EU /RoW students; religious vs. non-religious; conservative vs. liberals etc.
When testing for difference between groups, we use a t-test
This gives us a \(t-value\), and a corresponding \(p-value\) (larger t-values correspond to smaller p values)
The null hypothesis is that there are no differences between groups
if the \(p-value\) is “very small”, we reject this null
There is a difference in the tenure of male and female leaders
Independent variable: gender
Dependent variable: tenure in months
## 'data.frame': 497 obs. of 6 variables:
## $ country: Factor w/ 12 levels "AUS","AUT","BEL",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ name : Factor w/ 473 levels "(death)","(party merger)",..: 13 110 316 242 457 196 331 96 180 211 ...
## $ party : Factor w/ 82 levels "","ÖVP","ALP",..: 15 15 15 15 15 15 15 15 36 36 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 2 2 ...
## $ age : num 60 49 49 47 36 47 62 39 46 64 ...
## $ length : num 20 114 170 78 50 25 25 NA 156 56 ...
## # A tibble: 2 x 2
## gender mean_tenure
## <fct> <dbl>
## 1 Female 44.1
## 2 Male 68.4
##
## Two Sample t-test
##
## data: length by gender
## t = -2.3897, df = 428, p-value = 0.0173
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -44.271149 -4.311499
## sample estimates:
## mean in group Female mean in group Male
## 44.11765 68.40897
##
## Call:
## lm(formula = length ~ gender, data = leaders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.41 -43.41 -20.41 16.81 473.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.118 9.543 4.623 5.02e-06 ***
## genderMale 24.291 10.165 2.390 0.0173 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.15 on 428 degrees of freedom
## (67 observations deleted due to missingness)
## Multiple R-squared: 0.01317, Adjusted R-squared: 0.01086
## F-statistic: 5.711 on 1 and 428 DF, p-value: 0.0173
\(R\) lets us run models with nominal variables
But we need to make sure they are factor variables, for example:
## [1] "factor"
The coefficient attached to this variable gives you the predicted value of the dependent variable for observations with this value
The reference category is subsumed within the intercept
##
## Call:
## lm(formula = length ~ gender, data = leaders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.41 -43.41 -20.41 16.81 473.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.118 9.543 4.623 5.02e-06 ***
## genderMale 24.291 10.165 2.390 0.0173 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.15 on 428 degrees of freedom
## (67 observations deleted due to missingness)
## Multiple R-squared: 0.01317, Adjusted R-squared: 0.01086
## F-statistic: 5.711 on 1 and 428 DF, p-value: 0.0173
As some of you have probably noted, sometimes we have nominal variables with more than two values
Examples: country, party
Here, we pick a reference category
All other values become dummy variables
##
## Call:
## lm(formula = length ~ country, data = leaders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.24 -40.22 -20.06 19.87 487.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.1951 10.6826 5.354 1.42e-07 ***
## countryAUT 8.3674 16.1348 0.519 0.6043
## countryBEL -1.3156 13.0571 -0.101 0.9198
## countryCAN 20.1382 17.5804 1.145 0.2527
## countryDEU -2.2259 13.6418 -0.163 0.8705
## countryESP 23.9185 14.8477 1.611 0.1080
## countryGBR 16.6310 17.8198 0.933 0.3512
## countryHUN -18.2951 24.1247 -0.758 0.4487
## countryISR 15.1049 18.6563 0.810 0.4186
## countryITA 0.6549 18.6563 0.035 0.9720
## countryNOR 26.0402 14.3478 1.815 0.0703 .
## countryROU 11.2166 19.7318 0.568 0.5700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.4 on 418 degrees of freedom
## (67 observations deleted due to missingness)
## Multiple R-squared: 0.02917, Adjusted R-squared: 0.003618
## F-statistic: 1.142 on 11 and 418 DF, p-value: 0.3269
##
## Call:
## lm(formula = length ~ country, data = leaders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.24 -40.22 -20.06 19.87 487.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.826 14.263 5.176 3.53e-07 ***
## countryAUS -16.631 17.820 -0.933 0.351
## countryAUT -8.264 18.699 -0.442 0.659
## countryBEL -17.947 16.118 -1.113 0.266
## countryCAN 3.507 19.959 0.176 0.861
## countryDEU -18.857 16.595 -1.136 0.256
## countryESP 7.288 17.600 0.414 0.679
## countryHUN -34.926 25.910 -1.348 0.178
## countryISR -1.526 20.913 -0.073 0.942
## countryITA -15.976 20.913 -0.764 0.445
## countryNOR 9.409 17.180 0.548 0.584
## countryROU -5.414 21.878 -0.247 0.805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.4 on 418 degrees of freedom
## (67 observations deleted due to missingness)
## Multiple R-squared: 0.02917, Adjusted R-squared: 0.003618
## F-statistic: 1.142 on 11 and 418 DF, p-value: 0.3269
Sometimes we want to make conditional hypotheses
Examples:
“The more x, the more y, but only under condition z”
“The positive effect of x on y will be greater the higher z”
“the effect of economic development on levels of democracy is positive, but only when the country is not a producer of oil”
The effect of one independent variable on the dependent variable depends on the level of another independent variable
Make sure you understand this
## [1] "integer"
##
## Call:
## lm(formula = POLITY ~ GDP90LGN + OPEC + GDP90LGN * OPEC, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9633 -3.4897 0.5645 3.8950 14.0723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.649 3.358 -4.661 1.16e-05 ***
## GDP90LGN 5.687 1.027 5.537 3.37e-07 ***
## OPECOPEC member 15.853 10.985 1.443 0.153
## GDP90LGN:OPECOPEC member -7.431 3.108 -2.391 0.019 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.556 on 85 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3876, Adjusted R-squared: 0.366
## F-statistic: 17.94 on 3 and 85 DF, p-value: 4.168e-09
\(y = a + \beta_1\times x_1 + \beta_2\times x_2 + \beta_3\times x_1 \times x_2\),
or, in our case:
POLITY = a + \(\beta_1\times\) GDP +\(\beta_2\times\) OPEC+ \(\beta_3\times\) GDP\(\times\) OPEC
\(y = a + \beta_1\times x_1\)
or,
POLITY = a + \(\beta_1\times\) GDP
Let’s use the model estimates and make a ‘prediction’ about a specific country:
polity_interact<-summary(lm(POLITY~GDP90LGN+OPEC+GDP90LGN*OPEC, data=dat))
tbl_regression(polity_interact, intercept = T)| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| (Intercept) | -16 | -22, -9.0 | <0.001 |
| GDP90LGN | 5.7 | 3.6, 7.7 | <0.001 |
| OPEC | |||
| Not an OPEC member | — | — | |
| OPEC member | 16 | -6.0, 38 | 0.2 |
| GDP90LGN * OPEC | |||
| GDP90LGN * OPEC member | -7.4 | -14, -1.3 | 0.019 |
| 1 CI = Confidence Interval | |||
\(\hat{y}=-16 + 5.7 \times 4 + 16 \times 1 + (-7.4 \times 4 \times 1)\)
\(\hat{y}=-6.8\)
Now, let’s Suppose the same Nambia is not an OPEC member:
\(\hat{y}=-16 + 5.7\times 4\)
Now, \(\hat{y}=6.8\)
AKA, the “Oil Curse”
if not, google Trump↩︎