Logistic regression

OkCupid Profiles

For this lab we’ll be investigating profile data from the free online dating website OkCupid. The original dataset consists of the public profiles of a randomly chosen sample of 10% of the 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible. The resulting data sets consists of 5995 rows, one for each user.

Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits). Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well:

Number	Question
essay0	My self summary
essay1	What I’m doing with my life
essay2	I’m really good at
essay3	The first thing people usually notice about me
essay4	Favorite books, movies, show, music, and food
essay5	The six things I could never do without
essay6	I spend a lot of time thinking about
essay7	On a typical Friday night I am
essay8	The most private thing I am willing to admit
essay9	You should message me if…

Analysis of similar data has received much press of late, including Amy Webb’s TED talk How I Hacked Online Dating and Wired magazine’s How a Math Genius Hacked OkCupid to Find True Love. Before we continue we note that even though this data consists of publicly facing material, one should proceed with caution before scraping and using data in fashion similar to ours, as the Computer Fraud and Abuse Act (CFAA) makes it a federal crime to access a computer without authorization from the owner. In our case, permission was given by the owners of the data.

This lab provides us an opportunity to fit a predictive model for sex using logistic regression. Specifically, we will be using information on users’ profiles to predict if the user if female. At the time, OkCupid only allowed for two possible sex choices (male or female, thus making sex a binary categorical variable; OkCupid has since relaxed these categorizations to allow for a broader range of choices. Mathematically speaking, the outcome variable of interest is

\[ Y_i = \left\{ \begin{array}{cl} 1 & \mbox{if the user is female}\\ 0 & \mbox{if the user is male} \end{array} \right. \]

The data

Before we can start this lab, we need to install and load the mosaic package for statistical education. We need this for our histogram plots below.

First, install this package by going to the Packages panel in RStudio -> Install -> mosaic. You should observe a lot of red text being generated as your package is being downloaded from the internet and installed
Second, load this package by running library(mosaic)

Furthemore, we load each users’ profiles and correspodning responses to the 10 essay questions in essays:

library(mosaic)
download.file("http://people.reed.edu/~albkim/MATH141/etc/OkCupid.RData", destfile = "OkCupid.RData")
load("OkCupid.RData")

Exploring the data

This data set is very rich and complex and necessitates many of the exploratory data analysis tools we’ve seen earlier.

In the “Environment” panel of RStudio, click on profiles. A spreadsheet of the data should pop up. What variables are numerical? What variables are categorical?
Perform an exploratory data analysis of the variables using the various commands we’ve learned so far: hist(), plot(), table(), barplot(), etc. Pick two categorical and two numerical variables. Recall for barplot() and mosaicplot(), you need to create a frequency/contingency table first. Ex: Say you have a categorical variable x you want to display a barplot() of, use barplot(table(x)). Say you have a categorical variables x and y you want to display a mosaicplot() of, use mosaicplot(table(x, y)).
What proportion of users are female?
The essay questions are quite large and do not fit into one screen. Type essays[6,] in R to view the contents of the sixth user, for example.
Create a mosaicplot of the relationship between sex and orientation. What can you say about the dating pool for San Francisco OkCupid users? What is the single largest dating demographic group in San Francisco?

Heights

We compare the distributions of male and female heights using histograms. While we could plot two separate histograms without regard to the scale of the x-axes, we instead use the histogram() function from the mosaic package to

Plot heights given sex by defining the formula: ~ height | sex. In other words plot heights conditional on sex.
Plot them simultaneously in a lattice consisting of one column and two rows by setting layout=c(1,2)
Plot them with bin widths matching the granularity of the observations (inches) by setting width=1. The histogram() function automatically matches the $x$-axis scales for both plots.

histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)", data=profiles)

We observe that some of the heights are nonsensical, including a height of 95 inches (equaling 7’11’’). We deem heights between 55 and 80 inches to be reasonable and remove the rest. While there is potential bias in discarding users with what we deem non-reasonable heights, since out of the 5995 users there are only 15 users who would be discarded, the effect would not be substantial. We keep only those users with heights between 55 and 80 inches using the subset() command and the & is the AND operator i.e. keep only those users with profiles$height >= 55 and profiles$height <= 80. Note also that we subset the essays dataset first.

essays <- subset(essays, profiles$height >= 55 & profiles$height <= 80)
profiles <- subset(profiles, profiles$height >= 55 & profiles$height <= 80)
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)", data=profiles)

What phenomenon do you think explains the unusual spike for males at 72 inches? Similarly, what phenomenon do you think explains the anomolous spike for females at 64 inches?
If I told you someone was 6’4’‘, what would you predict their gender to be? Similarly, what if I told you someone was 5’1’’?
For about which height do we start seeing a higher proportion of males than females? You can think of this as a “point of indifference” i.e. we have an similar proportion of males and females being of that height.
Make a similarly split histogram of age by sex. What do you observe?

Essay Data

We now turn our attention to the essay questions. The function profile.has.word() below searches all 10 essays for each user and returns TRUE if the user used the particular word. Note R treats spaces equally as text, so the following code would return TRUE for the word “wine” but FALSE for the word “winery”. Note the function is case-insensitive. For example, let’s view the first user’s essays responses and search

essays[1, ]
profile.has.word(essays[1,], word="i like a few lame reality shows")
profile.has.word(essays[1,], word="statistics")

We can use the profile.has.word() for all users at once as well. For example, we search for the use of the word “wine” and assign it to a new variable in profiles called has.wine:

profiles$has.wine <- profile.has.word(essays, word="wine")
table(profiles$has.wine)
barplot(table(profiles$has.wine))

We display the relationship between the two categorical variables profiles$has.wine and profiles$sex using mosaicplots.

table(profiles$sex, profiles$has.wine)
mosaicplot(table(profiles$sex, profiles$has.wine), xlab="sex", ylab="Word Use",
           main="Sex vs Word Use")

What is your interpretation of the mosaicplot?
Repeat the above exercise with a different word.

Using linear regression to predict sex

In order to reinforce the concepts of logistic regression, we initally keep things simple and consider only one predictor variable at a time in our regression models; in this case height. We first demonstrate that linear regression is an inappropriate tool to use when we have a binary categorial variable as an outcome variable. We first create a scatterplot of the outcome variable is.female over height:

plot(profiles$height, profiles$is.female, xlab="height", ylab="is female")

We observe this plot is not very informative since there is a large degree of overlap of the points. For example, how many users are female with height 70 inches. So as in a previous lab we add a little jitter() to the points, i.e. a little random noise, to better distinguish the points:

plot(jitter(profiles$height), jitter(profiles$is.female), xlab="height", ylab="is female")

For example, we see there aren’t as many females with height 60 inches as there are with height 68 inches. We now fit a linear regression as in the previous lab and plot the regression line in red.

m1 <- lm(profiles$is.female ~ profiles$height)
summary(m1)
plot(jitter(profiles$height), jitter(profiles$is.female), xlab="height", ylab="is female")
abline(m1, col="red")

We observe that linear regression yields fitted probabilities greater than 1 for heights less than about 60 inches for and fitted probabilities less than 0 for heights over about 78 inches, which do not make sense.

Using logistic regression to predict sex

Logistic regression is a tool more suited when the outcome variable is binary, as in our case. Our goal is to once again predict users’ sex using their height. This may seem silly as we already know each users sex; however we can fit the model pretending we don’t know each users’ sex, but then verifying how good our predictions are using the truth.

Unlike for linear regression, to run logistic regression we use the glm() (a generalized linear model) command with the argument family=binomial. Just like linear regression however, the formula the takes the form y ~ x1 + x2 + ... where y is the (in this case) binary outcome variable and x1, x2, … are the predictor variables. We fit the model and generate the results:

gender_height <- glm(profiles$is.female ~ profiles$height, family=binomial)
summary(gender_height)

We observe that the coefficient for height is -0.65281. The interpretation of this coefficient is not like with linear regression; it is a little more complicated and we leave this for a more advanced statistics class. However, we note that the coefficient is negative, indicating that height is inversely related to the probability that a user is female. The regression line is

\[ \begin{aligned} \log_{e} \left(\frac{p_i}{1-p_i}\right) &= \beta_0 + \beta_1 \mbox{height}_{i} \end{aligned} \]

In order to obtain the fitted probability $\widehat{p}_i$ of a particular user being female, we use the inverse logit function: \[ \begin{aligned} %\log_{e} \left(\frac{p_i}{1-p_i}\right) &= \beta_0 + \beta_1 \mbox{height}_{i}\\ \widehat{p}_i &= \frac{1}{1+ \exp\left(-\left( b_0 + b_1 \mbox{height}_{i} \right)\right)} \end{aligned} \]

The first user in the dataset’s height is 64 inches. What is the fitted probability $\widehat{p}$ that this user is female?
What is the fitted probability that a user is female if their height is 75 inches?

Prediction

We can obtain fitted probabilities $\widehat{p}_i$ for all 5995 users by applying the fitted() command to the model output instead of computing them all manually. The histogram of fitted probabilities looks as follows:

profiles$p.hat <- fitted(gender_height)
hist(profiles$p.hat, main="Fitted Probabilities", xlab=expression(hat(p)[i]))

For each user, however, we only have a fitted probability $\widehat{p}_i$ and not a “yes/no” prediction as to whether or not they are female. For this we need a “prediction threshold”: if the fitted probability exceeds this this threshold, predict that this user is female, otherwise predict male. We’re going to use a threshold of 50%: if the fitted probability $\widehat{p}_i$ exceeds this threshold, we predict this user is female. We indicate this on the histogram using a red line:

threshold <- 0.5
hist(profiles$p.hat, main="Fitted Probabilities", xlab=expression(hat(p)[i]))
abline(v=threshold, col="red")

We chose this value somewhat arbitrarily: because it segments the $\widehat{p}_i$ into two chunks. We did not choose this value because 50% of people are female; recall in this dataset, the proportion of users who are female is not close to 50%. We can then create the contigency table comparing our predictions to the truth:

profiles$predicted.female <- profiles$p.hat >= threshold
profiles$predicted.female[1:10]
table(truth=profiles$is.female, prediction=profiles$predicted.female)

On Your Own

Of the people we predicted to be female, what proportion did we predict correctly? What proportion did we predict incorrectly?
Of the people we predicted to be male, what proportion did we predict correctly? What proportion did we predict incorrectly?
Overall, what proportion of users’ sex did we predict correctly? Incorrectly?
We used 50% as a prediction threshold on the $p_i$’s to predict a user as female. What height does this correspond to? Compare this height to our “point of indifference” height from earlier.
Use 25% as a different prediction threshold on the $p_i$’s to declare a user as female and compare the prediction performance.
Fit a logistic regression using only has.wine as a predictor variable. Interpret the coefficient and compare the prediction performance using a prediction threshold of 50%. Did our prediction performance improve?
Fit a logistic regression using both height and has.wine as predictor variables. Interpret the coefficient and compare the prediction performance using a prediction threshold of 50%. Did our prediction performance improve?

Acknowledgements

We thank OkCupid president and co-founder Christian Rudder for agreeing to our use of this dataset (under the condition that the dataset remains public) and Everett Wetchler for providing the data; the original dataset can be found at https://github.com/rudeboybert/JSE_OkCupid.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written by Albert Y. Kim and Andrew Bray.