L5P3: Regressão romântica

Pergunta: Nesta atividade queremos saber que fatores nos dados de speed dating têm efeito relevante na chance do casal ter um match?

Para isso, utilizaremos a Regressão Logística, pois permite analisar variáveis dependentes binomiais (0 ou 1). Essas variáveis categóricas podem ser reduzidas para duas categorias (positivo/negativo; presente/ausente; 0/1).

odds do evento

\(\frac{p(evento)}{1-p(evento)} = e^{b_0 + b_1.X_1+...+b_n.X_n}\)

Primeiramente, colocamos em distaque a probabilidade do evento \(p(evento) = \frac{1}{1 + e^-{(b_0 + b_1.X_1+...+b_n.X_n)}}\)

Estudando essa função, concluimos que ela varia de 0 a 1. Então, utilizando essa função podemos prever o fenomeno ocorre no presente dado de speed dating. Abaixo temos o código R para responde essa pergunta da atividade L5P3: Regressão romântica

Dados speed dating

dados_romanticos = read_csv(("https://raw.githubusercontent.com/nazareno/ciencia-de-dados-1/master/5-regressao/speed-dating/speed-dating2.csv"), 
  col_types = cols(
     .default = col_double(),
     field = col_character(),
     from = col_character(),
     career = col_character(),
     attr3_s = col_logical(),
     sinc3_s = col_logical(),
     intel3_s = col_logical(),
     fun3_s = col_logical(),
     amb3_s = col_logical(),
     dec = col_character()
))%>% 
     mutate(dec = as.factor(dec))

glimpse(dados_romanticos)
## Rows: 4,918
## Columns: 44
## $ iid      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ gender   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ order    <dbl> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, 8, 4, 5,~
## $ pid      <dbl> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 14, 15, 1~
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28, -0.36, ~
## $ samerace <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1~
## $ age_o    <dbl> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 23, 24, 2~
## $ age      <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24, 2~
## $ field    <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law"~
## $ race     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ from     <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chica~
## $ career   <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "~
## $ sports   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3~
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 8~
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
## $ dining   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ museums  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5~
## $ art      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5~
## $ hiking   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8~
## $ gaming   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4~
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5~
## $ reading  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ tv       <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8~
## $ theater  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 7~
## $ movies   <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 8, 8, ~
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7, 7, 7, ~
## $ music    <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5~
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8~
## $ yoga     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 7~
## $ attr     <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5, 7, 6, 7~
## $ sinc     <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8, 6, 7, 9~
## $ intel    <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7, 8, 8, 1~
## $ fun      <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5, 9, 7, 7~
## $ amb      <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4, 9, ~
## $ shar     <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5, 5, 8, 9~
## $ like     <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5, 8, 8~
## $ prob     <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6, 6, 7, ~
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, N~
## $ attr3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ sinc3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ fun3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ amb3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ dec      <fct> yes, yes, yes, yes, yes, no, yes, no, yes, yes, no, no, no, y~

Regressão Logística foi utilizada para analisar a associação entre variáveis dec que diz se houve match entre os dois participantes do encontro, age que diz a idade do participante 1, age_o que diz a idade do participante 2, sinc quão sincero o participante 1 achou do participante 2 e attr quão atraente o participante 1 achou do participante 2. O formato da função resultante odds: \(p(dec) = \frac{1}{1 + e^-{(0.013 + 0.993.age + 0.992.age_o + 1.03.sinc + 1.96.attr)}}\).

Portanto, ser atraente (attr) aumenta 1.96 de chance com ICs 95% [1.86;2.05], sincero (sinc) aumenta 1.03 de chance com ICs 95% [0.98;1.07], idade (age) do participante 1 aumenta 0.993 de chance com ICs 95% [0.97;1.01] e idade (age_o) do participante 2 aumenta 0.992 de chance com ICs 95% [0.97;1.01]. McFadden = 19% explica a variação da probabilidade.

model = glm(dec ~ age + age_o + sinc + attr, 
                  data = dados_romanticos, 
                  family = "binomial")

tidy(model, conf.int = TRUE, exponentiate = TRUE) %>% select(-statistic, -p.value)
## # A tibble: 5 x 5
##   term        estimate std.error conf.low conf.high
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   0.0138    0.407   0.00617    0.0305
## 2 age           0.993     0.0105  0.972      1.01  
## 3 age_o         0.992     0.0103  0.972      1.01  
## 4 sinc          1.03      0.0221  0.986      1.07  
## 5 attr          1.96      0.0246  1.87       2.05
glance(model)
## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1         6362.    4661 -2545. 5100. 5132.    5090.        4657  4662
pR2(model)
## fitting null model for pseudo-r2
##           llh       llhNull            G2      McFadden          r2ML 
## -2545.1299317 -3180.7969512  1271.3340389     0.1998452     0.2386799 
##          r2CU 
##     0.3205876
model %>% 
  augment(type.predict = "response")  
## # A tibble: 4,662 x 12
##    .rownames dec     age age_o  sinc  attr .fitted .resid .std.resid     .hat
##    <chr>     <fct> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>      <dbl>    <dbl>
##  1 1         yes      21    27     9     6   0.406  1.34       1.34  0.00141 
##  2 2         yes      21    22     8     7   0.575  1.05       1.05  0.00124 
##  3 3         yes      21    22     8     5   0.261  1.64       1.64  0.00130 
##  4 4         yes      21    23     6     7   0.559  1.08       1.08  0.00129 
##  5 5         yes      21    24     6     5   0.247  1.67       1.67  0.000959
##  6 6         no       21    25     9     4   0.154 -0.577     -0.578 0.00131 
##  7 7         yes      21    30     6     7   0.544  1.10       1.10  0.00183 
##  8 8         no       21    27     9     4   0.151 -0.573     -0.573 0.00132 
##  9 9         yes      21    28     6     7   0.549  1.10       1.10  0.00141 
## 10 10        yes      21    24     6     5   0.247  1.67       1.67  0.000959
## # ... with 4,652 more rows, and 2 more variables: .sigma <dbl>, .cooksd <dbl>