Регрессия: начало

Олеся Волченко

22 ноября 2022

Что мы уже умеем

Что мы уже умеем

Регрессия и другие тесты

Зачем нужна регрессия?

Эффект неучтенной переменной

Взаимосвязь третья переменная
Чем больше размер ноги у ребенка, тем он/она умнее ?
Чем меньше рост человека, тем длиннее у него волосы ?
Успеваемость детей выше в больших классах ?
Люди, использующие интернет, более счастливы ?
Продажи мороженого положительно связаны с числом утонувших ?

Регрессионное моделирование

image source: http://www.math.com/school/subject2/lessons/S2U4L2GL.html#sm1

Метод наименьших квадратов (Ordinary Least Squares)

Метод наименьших квадратов (Ordinary Least Squares)

Пример на сгенерированных данных

Есть 2 переменные: x и y

x <- rnorm(50, 0, 1)
y <- 3 * x + rnorm(50, 0, 1)
data <- data.frame(x, y)

library(knitr)
kable(data)
x y
-0.0420556 0.5375396
0.8294886 2.8540532
-0.0365314 0.7394153
0.2105915 1.8580477
0.9946852 2.3642316
-1.6090263 -2.7957574
-1.8319644 -5.2320124
1.1489460 4.5646913
-2.0052268 -6.2681064
-0.4576069 -1.9061540
-0.8430764 -3.3892007
0.2194096 1.3838598
0.6534667 1.4168551
0.9266658 3.5620675
0.2123688 1.4899677
-1.1216443 -3.5516881
0.8906316 3.0148303
1.0072476 3.7730458
-0.4796909 -2.4513453
0.0556535 0.0458745
0.1889695 -0.2651105
0.1163043 -0.0669968
-2.0867750 -7.6099401
1.4688240 4.6039039
-0.4476705 -1.9104429
0.1508349 -0.1324566
1.3299648 3.8582357
-1.4597276 -3.7786285
2.0213237 5.6173115
0.3902761 1.9510453
-0.8059297 -0.5299832
-0.9580052 -2.4226523
0.6459511 2.1698346
2.0494385 5.7930219
0.2644132 2.2680020
1.1738013 1.7261510
0.1894638 0.6644543
0.2548628 0.5268850
0.3071534 0.1429770
0.1029325 0.0201419
1.8215032 5.3972471
0.7463561 1.5809223
-0.3374921 -1.8074850
-0.6843752 -3.0694714
0.6212784 2.0992369
0.3754429 1.3152758
1.6351216 3.9681053
0.3276417 2.3506046
0.7083033 0.3979142
0.2424356 0.0150273

Визуализируем взаимосвязь

plot(x, y, pch = 16)

Используем корреляцию, чтобы установить, как связаны эти переменные

cor.test(x, y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 23.256, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9274307 0.9762872
## sample estimates:
##       cor 
## 0.9583755

Оценим модель

model1 <- lm(y ~ x, data = data)
summary(model1)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71322 -0.60132 -0.08079  0.61095  1.90246 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.006924   0.125357   0.055    0.956    
## x           2.924218   0.125741  23.256   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8716 on 48 degrees of freedom
## Multiple R-squared:  0.9185, Adjusted R-squared:  0.9168 
## F-statistic: 540.8 on 1 and 48 DF,  p-value: < 2.2e-16

Какие выводы можно сделать?

График

Коэффициент детерминации R2

Другие способы оценки качества модели

Остатки

Dummy-переменные / фиктивные переменные

Dummy-переменные:

Пример: данные о зарплате молодых специалистов

Данные о зарплате и возрасте 20 мужчин и женщин

age salary sex
36 117.07733 M
39 127.09086 M
34 112.68538 M
29 98.48969 M
25 86.51680 M
34 110.50185 M
30 99.26856 M
36 118.09416 M
31 102.80794 M
31 101.94750 M
36 108.88374 F
23 67.45167 F
36 108.70234 F
22 65.09488 F
31 91.36838 F
20 57.64829 F
28 83.64096 F
46 137.19560 F
29 85.96395 F
32 93.91930 F

Пример: эксплораторные графики

par(mfrow = c(1, 2))
plot(genderdata$age, genderdata$salary, pch = 16, xlab = "age", ylab = "salary")
boxplot(genderdata$salary ~ genderdata$sex)

Пример: модель

model2 <- lm(salary ~ age + sex, data = genderdata)
summary(model2)
## 
## Call:
## lm(formula = salary ~ age + sex, data = genderdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.49009 -0.79342 -0.09051  0.63942  1.78846 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.80055    1.27777  -1.409    0.177    
## age          3.02929    0.04065  74.524  < 2e-16 ***
## sexM        10.79666    0.48940  22.061 5.99e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.076 on 17 degrees of freedom
## Multiple R-squared:  0.9975, Adjusted R-squared:  0.9972 
## F-statistic:  3435 on 2 and 17 DF,  p-value: < 2.2e-16
library(sjPlot)
tab_model(model2, show.ci = F)
  salary
Predictors Estimates p
(Intercept) -1.80 0.177
age 3.03 <0.001
sex [M] 10.80 <0.001
Observations 20
R2 / R2 adjusted 0.998 / 0.997

Пример: График

Что делать, если категорий в предикторе больше, чем 2? (например, образование)

##        educ dummy1 dummy1.1
## 1  tertiary      0        0
## 2 secondary      1        0
## 3   primary      0        1

Реальный пример

Зависимая переменная

Независимые переменные

data <- read.spss("/Users/olesyavolchenko/Yandex.Disk.localized/datafiles/ESS/ESS7e02_2.sav", to.data.frame = T, use.value.labels = T)
data1 <- data[which(data$cntry == "United Kingdom"), ] 
data1$trust <- as.numeric(data1$ppltrst) + as.numeric(data1$pplfair) + as.numeric(data1$pplhlp)
data1$agea <- as.numeric(as.character(data1$agea))
data1$eduyrs <- as.numeric(as.character(data1$eduyrs))
data1$hinctnta <- as.numeric(data1$hinctnta)
library(table1)
tab1 <- table1(~ trust + agea + eduyrs + hinctnta + marsts, data = data1)

Overall
(N=2264)
trust
Mean (SD) 20.1 (5.22)
Median [Min, Max] 21.0 [3.00, 33.0]
Missing 14 (0.6%)
agea
Mean (SD) 52.2 (18.4)
Median [Min, Max] 52.0 [15.0, 94.0]
Missing 21 (0.9%)
eduyrs
Mean (SD) 13.6 (3.75)
Median [Min, Max] 13.0 [0, 47.0]
Missing 14 (0.6%)
hinctnta
Mean (SD) 4.95 (3.00)
Median [Min, Max] 5.00 [1.00, 10.0]
Missing 363 (16.0%)
marsts
Legally married 119 (5.3%)
In a legally registered civil union 0 (0%)
Legally separated 0 (0%)
Legally divorced/civil union dissolved 280 (12.4%)
Widowed/civil partner died 278 (12.3%)
None of these (NEVER married or in legally registered civil union) 621 (27.4%)
Missing 966 (42.7%)

Распределение зависимой переменной

## 
##  Shapiro-Wilk normality test
## 
## data:  data1$trust
## W = 0.97308, p-value < 2.2e-16

Модель

model1 <- lm(trust ~ gndr + agea + eduyrs + hinctnta + marsts, data = data1)
summary(model1)
## 
## Call:
## lm(formula = trust ~ gndr + agea + eduyrs + hinctnta + marsts, 
##     data = data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4422  -3.1892   0.6636   3.7162  15.6552 
## 
## Coefficients:
##                                                                          Estimate
## (Intercept)                                                              13.65197
## gndrFemale                                                                0.21721
## agea                                                                      0.04946
## eduyrs                                                                    0.20883
## hinctnta                                                                  0.15818
## marstsLegally divorced/civil union dissolved                             -0.60940
## marstsWidowed/civil partner died                                          0.58870
## marstsNone of these (NEVER married or in legally registered civil union)  0.18298
##                                                                          Std. Error
## (Intercept)                                                                 1.20466
## gndrFemale                                                                  0.34950
## agea                                                                        0.01280
## eduyrs                                                                      0.04971
## hinctnta                                                                    0.06588
## marstsLegally divorced/civil union dissolved                                0.65411
## marstsWidowed/civil partner died                                            0.70527
## marstsNone of these (NEVER married or in legally registered civil union)    0.63993
##                                                                          t value
## (Intercept)                                                               11.333
## gndrFemale                                                                 0.621
## agea                                                                       3.865
## eduyrs                                                                     4.201
## hinctnta                                                                   2.401
## marstsLegally divorced/civil union dissolved                              -0.932
## marstsWidowed/civil partner died                                           0.835
## marstsNone of these (NEVER married or in legally registered civil union)   0.286
##                                                                          Pr(>|t|)
## (Intercept)                                                               < 2e-16
## gndrFemale                                                               0.534407
## agea                                                                     0.000118
## eduyrs                                                                   2.88e-05
## hinctnta                                                                 0.016517
## marstsLegally divorced/civil union dissolved                             0.351730
## marstsWidowed/civil partner died                                         0.404072
## marstsNone of these (NEVER married or in legally registered civil union) 0.774978
##                                                                             
## (Intercept)                                                              ***
## gndrFemale                                                                  
## agea                                                                     ***
## eduyrs                                                                   ***
## hinctnta                                                                 *  
## marstsLegally divorced/civil union dissolved                                
## marstsWidowed/civil partner died                                            
## marstsNone of these (NEVER married or in legally registered civil union)    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.438 on 1061 degrees of freedom
##   (1195 пропущенных наблюдений удалены)
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.0427 
## F-statistic: 7.806 on 7 and 1061 DF,  p-value: 3.082e-09

Таблица

tab_model(model1, show.ci = F)
  trust
Predictors Estimates p
(Intercept) 13.65 <0.001
gndr [Female] 0.22 0.534
agea 0.05 <0.001
eduyrs 0.21 <0.001
hinctnta 0.16 0.017
marsts [Legally
divorced/civil union
dissolved]
-0.61 0.352
marsts [Widowed/civil
partner died]
0.59 0.404
marsts [None of these
(NEVER married or in
legally registered civil
union)]
0.18 0.775
Observations 1069
R2 / R2 adjusted 0.049 / 0.043

Поменять опорную категорию в семейном положении

data1$marsts1 <- relevel(data1$marsts, ref = "Widowed/civil partner died")
model1 <- lm(trust ~ gndr + agea + eduyrs + hinctnta + marsts1, data = data1)
summary(model1)
## 
## Call:
## lm(formula = trust ~ gndr + agea + eduyrs + hinctnta + marsts1, 
##     data = data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4422  -3.1892   0.6636   3.7162  15.6552 
## 
## Coefficients:
##                                                                           Estimate
## (Intercept)                                                               14.24066
## gndrFemale                                                                 0.21721
## agea                                                                       0.04946
## eduyrs                                                                     0.20883
## hinctnta                                                                   0.15818
## marsts1Legally married                                                    -0.58870
## marsts1Legally divorced/civil union dissolved                             -1.19810
## marsts1None of these (NEVER married or in legally registered civil union) -0.40571
##                                                                           Std. Error
## (Intercept)                                                                  1.28284
## gndrFemale                                                                   0.34950
## agea                                                                         0.01280
## eduyrs                                                                       0.04971
## hinctnta                                                                     0.06588
## marsts1Legally married                                                       0.70527
## marsts1Legally divorced/civil union dissolved                                0.53831
## marsts1None of these (NEVER married or in legally registered civil union)    0.62926
##                                                                           t value
## (Intercept)                                                                11.101
## gndrFemale                                                                  0.621
## agea                                                                        3.865
## eduyrs                                                                      4.201
## hinctnta                                                                    2.401
## marsts1Legally married                                                     -0.835
## marsts1Legally divorced/civil union dissolved                              -2.226
## marsts1None of these (NEVER married or in legally registered civil union)  -0.645
##                                                                           Pr(>|t|)
## (Intercept)                                                                < 2e-16
## gndrFemale                                                                0.534407
## agea                                                                      0.000118
## eduyrs                                                                    2.88e-05
## hinctnta                                                                  0.016517
## marsts1Legally married                                                    0.404072
## marsts1Legally divorced/civil union dissolved                             0.026247
## marsts1None of these (NEVER married or in legally registered civil union) 0.519234
##                                                                              
## (Intercept)                                                               ***
## gndrFemale                                                                   
## agea                                                                      ***
## eduyrs                                                                    ***
## hinctnta                                                                  *  
## marsts1Legally married                                                       
## marsts1Legally divorced/civil union dissolved                             *  
## marsts1None of these (NEVER married or in legally registered civil union)    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.438 on 1061 degrees of freedom
##   (1195 пропущенных наблюдений удалены)
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.0427 
## F-statistic: 7.806 on 7 and 1061 DF,  p-value: 3.082e-09
model1scaled <- lm(scale(trust) ~ gndr + scale(as.numeric(agea)) + scale(as.numeric(eduyrs)) + 
                     scale(as.numeric(hinctnta)) + marsts1, data = data1)
summary(model1scaled)
## 
## Call:
## lm(formula = scale(trust) ~ gndr + scale(as.numeric(agea)) + 
##     scale(as.numeric(eduyrs)) + scale(as.numeric(hinctnta)) + 
##     marsts1, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5323 -0.6108  0.1271  0.7118  2.9985 
## 
## Coefficients:
##                                                                           Estimate
## (Intercept)                                                                0.05856
## gndrFemale                                                                 0.04160
## scale(as.numeric(agea))                                                    0.17419
## scale(as.numeric(eduyrs))                                                  0.15000
## scale(as.numeric(hinctnta))                                                0.09075
## marsts1Legally married                                                    -0.11276
## marsts1Legally divorced/civil union dissolved                             -0.22948
## marsts1None of these (NEVER married or in legally registered civil union) -0.07771
##                                                                           Std. Error
## (Intercept)                                                                  0.10017
## gndrFemale                                                                   0.06694
## scale(as.numeric(agea))                                                      0.04507
## scale(as.numeric(eduyrs))                                                    0.03571
## scale(as.numeric(hinctnta))                                                  0.03779
## marsts1Legally married                                                       0.13508
## marsts1Legally divorced/civil union dissolved                                0.10311
## marsts1None of these (NEVER married or in legally registered civil union)    0.12053
##                                                                           t value
## (Intercept)                                                                 0.585
## gndrFemale                                                                  0.621
## scale(as.numeric(agea))                                                     3.865
## scale(as.numeric(eduyrs))                                                   4.201
## scale(as.numeric(hinctnta))                                                 2.401
## marsts1Legally married                                                     -0.835
## marsts1Legally divorced/civil union dissolved                              -2.226
## marsts1None of these (NEVER married or in legally registered civil union)  -0.645
##                                                                           Pr(>|t|)
## (Intercept)                                                               0.558979
## gndrFemale                                                                0.534407
## scale(as.numeric(agea))                                                   0.000118
## scale(as.numeric(eduyrs))                                                 2.88e-05
## scale(as.numeric(hinctnta))                                               0.016517
## marsts1Legally married                                                    0.404072
## marsts1Legally divorced/civil union dissolved                             0.026247
## marsts1None of these (NEVER married or in legally registered civil union) 0.519234
##                                                                              
## (Intercept)                                                                  
## gndrFemale                                                                   
## scale(as.numeric(agea))                                                   ***
## scale(as.numeric(eduyrs))                                                 ***
## scale(as.numeric(hinctnta))                                               *  
## marsts1Legally married                                                       
## marsts1Legally divorced/civil union dissolved                             *  
## marsts1None of these (NEVER married or in legally registered civil union)    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.042 on 1061 degrees of freedom
##   (1195 пропущенных наблюдений удалены)
## Multiple R-squared:  0.04898,    Adjusted R-squared:  0.0427 
## F-statistic: 7.806 on 7 and 1061 DF,  p-value: 3.082e-09
tab_model(model1, model1scaled, show.ci = F)
  trust scale(trust)
Predictors Estimates p Estimates p
(Intercept) 14.24 <0.001 0.06 0.559
gndr [Female] 0.22 0.534 0.04 0.534
agea 0.05 <0.001
eduyrs 0.21 <0.001
hinctnta 0.16 0.017
marsts1 [Legally married] -0.59 0.404 -0.11 0.404
marsts1 [Legally
divorced/civil union
dissolved]
-1.20 0.026 -0.23 0.026
marsts1 [None of these
(NEVER married or in
legally registered civil
union)]
-0.41 0.519 -0.08 0.519
agea 0.17 <0.001
eduyrs 0.15 <0.001
hinctnta 0.09 0.017
Observations 1069 1069
R2 / R2 adjusted 0.049 / 0.043 0.049 / 0.043

Обобщение: основные составляющие регрессионного моделирования

Что нужно?

Что мы получим?

Что дальше?

Пример с драконами

Artwork by @allison_horst

Задание

Используя базу данных ESS round 5, на примере одной страны проанализируйте, как связаны идеальный возраст выхода на пенсию (agertr, зависимая переменная) и следующие независимые переменные: возраст (agea), пол (gndr) и семейное положение (marsts)

  1. Визуализировать распределения каждой из переменных
  2. Оценить регрессионную модель со всеми указанными предикторами
  3. Получить стандартизованные коэффициенты
  4. Представить модели в виде таблицы
  5. Представить результаты работы в формате html, полученном при помощи файла .rmd