Albert Y. Kim
Wednesday 2015/03/9
Let out outcomes be binary.
i.e. for \( i=1, \ldots, n \) observations
We're interested in \( p_i = \mbox{Pr}(y_i = 1) \).
Logistic regression is preferred over linear regression here because you might end up with fitted probabilities \( \widehat{p}_i = \widehat{\mbox{Pr}}(y_i = 1) \) that are either
So we use the not the first model, but the second:
\[ \begin{eqnarray} p_i &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik}\\ \mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right) &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik} \end{eqnarray} \]
Result of a Python script that scraped the OkCupid website. We consider 59K users who were
Their public profiles were pulled on 2012/06/30. i.e. only data that’s visible to the public
Thanks to Christian Rudder from OkCupid and OkTrends for agreeing to the data's use.
Journal of Statistics Education paper can be found here.