Albert Y. Kim
Friday 2015/02/13
Nate Silver's 538 blog was formerly a New York Times blog where he made many predictions about the last federal election. He recently moved it to ESPN. They do data-centric journalism.
Say we have outcome variables that are binary. i.e. for \( i=1, \ldots, n \) observations
We are interested in the probability \( p_i = \mbox{Pr}(y_i = 1) \).
Logistic regression is preferred over standard linear regression in such situations because using the latter you might end up with fitted probabilities \( \widehat{p}_i = \widehat{\mbox{Pr}}(y_i = 1) \) that are either
So we use the not the first model, but the second
\[ \begin{eqnarray} p_i &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik}\\ \mbox{logit}(p_i)=\log\left(\frac{p_i}{1-p_i}\right) &=& \beta_1 X_{i1} + \ldots + \beta_k X_{ik} \end{eqnarray} \]
This is the result of a Python script that scraped the OkCupid website. We consider a sample of the n=5995 out of the approximately 59K users who were
Their public profiles were pulled on 2012/06/30. i.e. only data that’s visible to the public
Thanks to Christian Rudder from OkCupid and OkTrends for agreeing to the data's use.