Olesya Volchenko and Anna Shirokanova
April 3, 2023
Definition: Linear regression is a statistical technique where a continuous outcome is regressed on predictors, assuming the relationships among them are linear.
Depending on the outcome type and the ‘link function’ (the shape of relationship), regressions can have various names.
Compare: https://stats.idre.ucla.edu/other/dae/
This course will speak about linear regression only.
image source: http://www.math.com/school/subject2/lessons/S2U4L2GL.html#sm1
Linear Regression serves to predict a continuous (metric) dependent variable (= ‘the outcome’)
Predictor variables can be both categorical and continuous
Multiple linear regression can absorb several predictors
The results generalise to the population, if the sample is representative.
The relationship between the Y (outcome) and Xs (predictors) is described with an equation of linear regression: y = ax + b
| Relationship | Third variable |
|---|---|
| The larger the foot size of a kid, the more clever s/he is | ?????? |
| The taller the person, the shorter the hair of that person | ?????? |
| People using the Internet daily in Africa are happier | ?????? |
| Ice-cream sales are positively related to the number of people drowning | ?????? |
| People who attend opera are healthier | ?????? |
| Relationship | Third variable |
|---|---|
| The larger the foot size of a kid, the more clever s/he is | Age |
| The taller the person, the shorter the hair of that person | Gender |
| People using the Internet daily in Africa are happier | Income |
| Ice-cream sales are positively related to the number of people drowning | Season |
| People who attend opera are healthier | Income / Status |
There are variables x, x1 and y, n = 50, normally distributed
| y | x | x1 |
|---|---|---|
| 114.3889 | 3.093205 | 3 |
| 112.2937 | 3.002092 | 4 |
| 117.7774 | 5.005030 | 3 |
| 126.0127 | 6.619925 | 7 |
| 119.7704 | 5.422178 | 4 |
| 119.3044 | 6.090456 | 1 |
## y x x1
## Min. :110.9 Min. :3.002 Min. : 1.0
## 1st Qu.:117.7 1st Qu.:4.365 1st Qu.: 3.0
## Median :120.0 Median :4.988 Median : 5.5
## Mean :120.6 Mean :5.061 Mean : 5.5
## 3rd Qu.:123.5 3rd Qu.:5.567 3rd Qu.: 8.0
## Max. :138.1 Max. :9.026 Max. :10.0
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 8.7521, df = 48, p-value = 1.647e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6469066 0.8720898
## sample estimates:
## cor
## 0.7840706
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8575 -2.4754 0.0283 2.2571 5.2295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.2367 2.0289 50.882 < 2e-16 ***
## x 3.4357 0.3926 8.752 1.65e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.91 on 48 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.6067
## F-statistic: 76.6 on 1 and 48 DF, p-value: 1.647e-11
##
## Call:
## lm(formula = y ~ x + x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80025 -0.56990 0.02968 0.54171 2.30021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.60458 0.63377 157.16 <2e-16 ***
## x 3.06165 0.11958 25.60 <2e-16 ***
## x1 1.00462 0.04581 21.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8774 on 47 degrees of freedom
## Multiple R-squared: 0.9657, Adjusted R-squared: 0.9642
## F-statistic: 661.8 on 2 and 47 DF, p-value: < 2.2e-16
| y | y | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 103.24 | 99.16 – 107.32 | <0.001 | 99.60 | 98.33 – 100.88 | <0.001 |
| x | 3.44 | 2.65 – 4.23 | <0.001 | 3.06 | 2.82 – 3.30 | <0.001 |
| x1 | 1.00 | 0.91 – 1.10 | <0.001 | |||
| Observations | 50 | 50 | ||||
| R2 / R2 adjusted | 0.615 / 0.607 | 0.966 / 0.964 | ||||
Dummy-variables
dichotomised
show the differences in intercepts between groups (there is a ‘reference group’)
if class(var) = factor in R, it is recoded into dummies by default
200 males and females
## age salary sex
## Min. :15.00 Min. : 46.52 F:100
## 1st Qu.:26.00 1st Qu.: 83.59 M:100
## Median :30.00 Median : 96.21
## Mean :30.01 Mean : 94.96
## 3rd Qu.:33.25 3rd Qu.:105.22
## Max. :46.00 Max. :138.11
| salary | ||
|---|---|---|
| Predictors | Estimates | p |
| (Intercept) | 0.65 | 0.181 |
| age | 2.97 | <0.001 |
| sex [M] | 10.21 | <0.001 |
| Observations | 200 | |
| R2 / R2 adjusted | 0.995 / 0.995 | |
## educ dummy1 dummy2
## 1 tertiary 0 0
## 2 secondary 1 0
## 3 primary 0 1
Artwork by @allison_horst
x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")Pane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X
In experiments we can randomize and control conditions.
But it is not true for observational studies. -> Therefore we need to control for a set of variables (usually socio-demographics) in order to take those uncontrolled differences into account and be able to tell whether the predictor of interest is indeed related to the outcome.
Example: happiness and Internet use
Source: Blavatskyy, P. (2021). Obesity of politicians and corruption in post‐Soviet countries. Economics of Transition and Institutional Change, 29(2), 343-356.
Tables
| salary | ||
|---|---|---|
| Predictors | Estimates | p |
| (Intercept) | 0.65 | 0.181 |
| age | 2.97 | <0.001 |
| sex [M] | 10.21 | <0.001 |
| Observations | 200 | |
| R2 / R2 adjusted | 0.995 / 0.995 | |
Tables: Examples
Source: Valenzuela, S., Park, N., & Kee, K. F. (2009). Is there social capital in a social network site?: Facebook use and college students’ life satisfaction, trust, and participation. Journal of computer-mediated communication, 14(4), 875-901.
Tables: Examples
Source: Goidel, K., Gaddie, K., & Ehrl, M. (2017). Watching the news and support for democracy: Why media systems matter. Social Science Quarterly, 98(3), 836-855.
Tables: Examples
Baker, L. A., Cahalin, L. P., Gerst, K., & Burr, J. A. (2005). Productive activities and subjective well-being among older adults: The influence of number of activities and time commitment. Social Indicators Research, 73(3), 431-458.
Equations
For example,
Source: Evans, P., & Rauch, J. E. (1999). Bureaucracy and growth: A cross-national analysis of the effects of” Weberian” state structures on economic growth. American sociological review, 748-765.
Multiple linear regression in R (Sheffield U): https://www.sheffield.ac.uk/polopoly_fs/1.536483!/file/MASH_multiple_regression_R.pdf
Correlation and Regression course (DataCamp): https://learn.datacamp.com/courses/correlation-and-regression
Intermediate Regression in R (Chapters 1-3, DataCamp): https://learn.datacamp.com/courses/intermediate-regression-in-r
Linear Regression and Modeling (Coursera, Duke U): https://www.coursera.org/learn/linear-regression-model
Data Science: Linear Regression (Edx, Harvard U): https://www.edx.org/course/data-science-linear-regression