Intro to Logistic Regression and OkCupid Data

Albert Y. Kim
Wednesday 2015/03/9

Binary Outcome Variables

Let out outcomes be binary.

i.e. for \( i=1, \ldots, n \) observations

\( y_i = 1 \) if some condition holds
\( y_i = 0 \) if some condition does not hold

We're interested in \( p_i = \mbox{Pr}(y_i = 1) \).

Logistic Regression

Logistic regression is preferred over linear regression here because you might end up with fitted probabilities \( \widehat{p}_i = \widehat{\mbox{Pr}}(y_i = 1) \) that are either

less than 0
greater than 1

So we use the not the first model, but the second:

\[ \begin{eqnarray} p_i &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik}\\ \mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right) &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik} \end{eqnarray} \]

OkCupid Data

Result of a Python script that scraped the OkCupid website. We consider 59K users who were

members on 2012/06/26
within 25 miles of SF
online in the last year
have at least one photo

Their public profiles were pulled on 2012/06/30. i.e. only data that’s visible to the public

OkCupid Data

essay0- My self summary
essay1- What I’m doing with my life
essay2- I’m really good at
essay3- The first thing people usually notice about me
essay4- Favorite books, movies, show, music, and food

OkCupid Data

essay5- The six things I could never do without
essay6- I spend a lot of time thinking about
essay7- On a typical Friday night I am
essay8- The most private thing I am willing to admit
essay9- You should message me if…

Questions

Knowing nothing about a user, what is your best guess of the probability that the user is female?
Is height predictive of a user's sex?
Is the use of the word “wine” in a user's essay questions predictive of …

Acknowledgements

Thanks to Christian Rudder from OkCupid and OkTrends for agreeing to the data's use.

Journal of Statistics Education paper can be found here.