Regression: Not Only a Mystery Thriller

Olesya Volchenko and Anna Shirokanova

May 13, 2021

What do we already know?

Why do we need regression?

Omitted variable effect (1)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is ?
The lower the person, the longer the hair of that person ?
The larger the school class, the better are average grades ?
People using the Internet daily in Africa are happier ?
Ice-cream sales are positively related to the number of people drowning ?

Omitted variable effect (2)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is Age
The taller the person, the shorter the hair of that person Gender
The larger the school class, the better are average grades School size / equipment
People using the Internet daily in Africa are happier Income
Ice-cream sales are positively related to the number of people drowning Season

Regression modelling

Toy data example

There are variables X and Y, n = 50, normally distributed

y x
0.0949945 -0.2807451
-1.4442674 -0.5776430
-2.4506750 -1.1273309
-2.2369508 -0.9033092
2.1592886 0.4880545
-6.8133458 -1.6500107
##        y                x          
##  Min.   :-8.146   Min.   :-2.7091  
##  1st Qu.:-3.083   1st Qu.:-0.9292  
##  Median :-1.367   Median :-0.3983  
##  Mean   :-1.138   Mean   :-0.3787  
##  3rd Qu.: 1.265   3rd Qu.: 0.2604  
##  Max.   : 7.922   Max.   : 2.4979

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 21.127, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9134637 0.9715847
## sample estimates:
##       cor 
## 0.9502104

Estimate the model

model1 <- lm(y ~ x)
summary(model1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39265 -0.72426  0.04457  0.70943  1.95334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.009131   0.153084   -0.06    0.953    
## x            2.980894   0.141096   21.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 48 degrees of freedom
## Multiple R-squared:  0.9029, Adjusted R-squared:  0.9009 
## F-statistic: 446.3 on 1 and 48 DF,  p-value: < 2.2e-16

What are the conclusions?

Plot

Coefficient of determination R2

Other ways to evaluate model fit:

Several predictors

Dummy-variables

Example: salary of males and females

200 males and females

##       age            salary           sex           
##  Min.   :17.00   Min.   : 49.63   Length:200        
##  1st Qu.:26.75   1st Qu.: 84.08   Class :character  
##  Median :30.00   Median : 94.42   Mode  :character  
##  Mean   :30.09   Mean   : 95.50                     
##  3rd Qu.:34.25   3rd Qu.:106.66                     
##  Max.   :46.00   Max.   :146.75

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)
tab_model(model2, show.ci = F)
  salary
Predictors Estimates p
(Intercept) -0.29 0.583
age 3.02 <0.001
sex [M] 9.85 <0.001
Observations 200
R2 / R2 adjusted 0.994 / 0.994

Example: plot

What if there are more than 2 categories? (e.g. educational attainment)

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Another cool example

Artwork by @allison_horst

Interaction effect

Model Comparison

model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
tab_model(model3, model4, show.ci = F)
  salary salary
Predictors Estimates p Estimates p
(Intercept) 9.62 0.164 100.67 <0.001
age 3.97 <0.001 0.98 <0.001
sex [M] 81.04 <0.001 -96.41 <0.001
age * sex [M] 5.89 <0.001
Observations 200 200
R2 / R2 adjusted 0.904 / 0.903 0.990 / 0.990

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Can we identify a possible interaction effect visually?

Sometimes yes. Heteroskedasticity may indicate a moderation/ interaction. Let’s recall the assumptions first.

Linear Regression Assumptions

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Pane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X

Anscombe’s Quartet

Feature Value
Mean x 9.0
Variance x 10.0
Mean y 7.5
Variance y 3.75
Correlation between x and y 0.816
Regression fitted line y = 3 + 0.5x

Summary: nuts and bolts of regression modelling

What’s next?