Lecture 6 : Regression

Olesya Volchenko and Anna Shirokanova

April 3, 2023

What do we already know?

Regression

Definition: Linear regression is a statistical technique where a continuous outcome is regressed on predictors, assuming the relationships among them are linear.

Depending on the outcome type and the ‘link function’ (the shape of relationship), regressions can have various names.

Compare: https://stats.idre.ucla.edu/other/dae/

This course will speak about linear regression only.

Regression and other methods that we have covered

Regression modelling

image source: http://www.math.com/school/subject2/lessons/S2U4L2GL.html#sm1

Ordinary least squares (OLS)

Statistical inference in regression

Linear Regression serves to predict a continuous (metric) dependent variable (= ‘the outcome’)

Predictor variables can be both categorical and continuous

Multiple linear regression can absorb several predictors

The results generalise to the population, if the sample is representative.

The relationship between the Y (outcome) and Xs (predictors) is described with an equation of linear regression: y = ax + b

Why do we need regression?

Omitted variable effect (1)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is ??????
The taller the person, the shorter the hair of that person ??????
People using the Internet daily in Africa are happier ??????
Ice-cream sales are positively related to the number of people drowning ??????
People who attend opera are healthier ??????

Omitted variable effect (2)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is Age
The taller the person, the shorter the hair of that person Gender
People using the Internet daily in Africa are happier Income
Ice-cream sales are positively related to the number of people drowning Season
People who attend opera are healthier Income / Status

Toy data example

There are variables x, x1 and y, n = 50, normally distributed

y x x1
114.3889 3.093205 3
112.2937 3.002092 4
117.7774 5.005030 3
126.0127 6.619925 7
119.7704 5.422178 4
119.3044 6.090456 1
##        y               x               x1      
##  Min.   :110.9   Min.   :3.002   Min.   : 1.0  
##  1st Qu.:117.7   1st Qu.:4.365   1st Qu.: 3.0  
##  Median :120.0   Median :4.988   Median : 5.5  
##  Mean   :120.6   Mean   :5.061   Mean   : 5.5  
##  3rd Qu.:123.5   3rd Qu.:5.567   3rd Qu.: 8.0  
##  Max.   :138.1   Max.   :9.026   Max.   :10.0

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 8.7521, df = 48, p-value = 1.647e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6469066 0.8720898
## sample estimates:
##       cor 
## 0.7840706

Estimate the model

model1 <- lm(y ~ x, data = data)
summary(model1)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8575 -2.4754  0.0283  2.2571  5.2295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 103.2367     2.0289  50.882  < 2e-16 ***
## x             3.4357     0.3926   8.752 1.65e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.91 on 48 degrees of freedom
## Multiple R-squared:  0.6148, Adjusted R-squared:  0.6067 
## F-statistic:  76.6 on 1 and 48 DF,  p-value: 1.647e-11

What are the conclusions?

Plot

Any questions so far?

Residuals

Coefficient of determination R2

Several predictors

Several predictors

Let’s add another predictor to the model

model2 <- lm(y ~ x + x1, data = data)
summary(model2)
## 
## Call:
## lm(formula = y ~ x + x1, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80025 -0.56990  0.02968  0.54171  2.30021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.60458    0.63377  157.16   <2e-16 ***
## x            3.06165    0.11958   25.60   <2e-16 ***
## x1           1.00462    0.04581   21.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8774 on 47 degrees of freedom
## Multiple R-squared:  0.9657, Adjusted R-squared:  0.9642 
## F-statistic: 661.8 on 2 and 47 DF,  p-value: < 2.2e-16
sjPlot::tab_model(model1, model2)
  y y
Predictors Estimates CI p Estimates CI p
(Intercept) 103.24 99.16 – 107.32 <0.001 99.60 98.33 – 100.88 <0.001
x 3.44 2.65 – 4.23 <0.001 3.06 2.82 – 3.30 <0.001
x1 1.00 0.91 – 1.10 <0.001
Observations 50 50
R2 / R2 adjusted 0.615 / 0.607 0.966 / 0.964
# run the following code in R if you'd like to see how a regression with 2 numeric predictors looks like
#library(car)
#scatter3d(x = x, y = y, z = x1)

How many predictors can we add to one regression model?

Questions?

What should I do if I’d like to add categorical predictor?

Dummy-variables

Example: salary of males and females

200 males and females

##       age            salary       sex    
##  Min.   :15.00   Min.   : 46.52   F:100  
##  1st Qu.:26.00   1st Qu.: 83.59   M:100  
##  Median :30.00   Median : 96.21          
##  Mean   :30.01   Mean   : 94.96          
##  3rd Qu.:33.25   3rd Qu.:105.22          
##  Max.   :46.00   Max.   :138.11

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)
tab_model(model2, show.ci = F)
  salary
Predictors Estimates p
(Intercept) 0.65 0.181
age 2.97 <0.001
sex [M] 10.21 <0.001
Observations 200
R2 / R2 adjusted 0.995 / 0.995

Example: plot

What if there are more than 2 categories? (e.g. educational attainment)

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Any questions on dummy variables?

Another cool example

Artwork by @allison_horst

Linear Regression Assumptions

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Pane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X

The general idea of control variables

In experiments we can randomize and control conditions.

But it is not true for observational studies. -> Therefore we need to control for a set of variables (usually socio-demographics) in order to take those uncontrolled differences into account and be able to tell whether the predictor of interest is indeed related to the outcome.

Example: happiness and Internet use

Why do we need control variables? Example

Source: Blavatskyy, P. (2021). Obesity of politicians and corruption in post‐Soviet countries. Economics of Transition and Institutional Change, 29(2), 343-356.

How to report regression modelling?

Tables

tab_model(model2, show.ci = F)
  salary
Predictors Estimates p
(Intercept) 0.65 0.181
age 2.97 <0.001
sex [M] 10.21 <0.001
Observations 200
R2 / R2 adjusted 0.995 / 0.995

How to report regression modelling?

Tables: Examples

Source: Valenzuela, S., Park, N., & Kee, K. F. (2009). Is there social capital in a social network site?: Facebook use and college students’ life satisfaction, trust, and participation. Journal of computer-mediated communication, 14(4), 875-901.

How to report regression modelling?

Tables: Examples

Source: Goidel, K., Gaddie, K., & Ehrl, M. (2017). Watching the news and support for democracy: Why media systems matter. Social Science Quarterly, 98(3), 836-855.

How to report regression modelling?

Tables: Examples

Baker, L. A., Cahalin, L. P., Gerst, K., & Burr, J. A. (2005). Productive activities and subjective well-being among older adults: The influence of number of activities and time commitment. Social Indicators Research, 73(3), 431-458.

What is important when you are reading regression table

How to report regression modelling?

Equations

For example,

Source: Evans, P., & Rauch, J. E. (1999). Bureaucracy and growth: A cross-national analysis of the effects of” Weberian” state structures on economic growth. American sociological review, 748-765.

Summary: nuts and bolts of regression modelling

What’s next?

All models are wrong, but some are useful

Any questions now?

Image source: http://www.ninandrews.com/blog/2018/1/11/where-am-