For this lab we’ll be investigating profile data from the free online dating website OkCupid. The original dataset consists of the public profiles of a randomly chosen sample of 10% of the 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible. The resulting data sets consists of 5995 rows, one for each user.
Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits). Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well:
| Number | Question |
|---|---|
| essay0 | My self summary |
| essay1 | What I’m doing with my life |
| essay2 | I’m really good at |
| essay3 | The first thing people usually notice about me |
| essay4 | Favorite books, movies, show, music, and food |
| essay5 | The six things I could never do without |
| essay6 | I spend a lot of time thinking about |
| essay7 | On a typical Friday night I am |
| essay8 | The most private thing I am willing to admit |
| essay9 | You should message me if… |
Analysis of similar data has received much press of late, including Amy Webb’s TED talk How I Hacked Online Dating and Wired magazine’s How a Math Genius Hacked OkCupid to Find True Love. Before we continue we note that even though this data consists of publicly facing material, one should proceed with caution before scraping and using data in fashion similar to ours, as the Computer Fraud and Abuse Act (CFAA) makes it a federal crime to access a computer without authorization from the owner. In our case, permission was given by the owners of the data.
This lab provides us an opportunity to fit a predictive model for sex using logistic regression. Specifically, we will be using information on users’ profiles to predict if the user if female. At the time, OkCupid only allowed for two possible sex choices (male or female, thus making sex a binary categorical variable; OkCupid has since relaxed these categorizations to allow for a broader range of choices. Mathematically speaking, the outcome variable of interest is
\[ Y_i = \left\{ \begin{array}{cl} 1 & \mbox{if the user is female}\\ 0 & \mbox{if the user is male} \end{array} \right. \]
Before we can start this lab, we need to install and load the mosaic package for statistical education. We need this for our histogram plots below.
library(mosaic)Furthemore, we load each users’ profiles and correspodning responses to the 10 essay questions in essays:
library(mosaic)
download.file("http://people.reed.edu/~albkim/MATH141/etc/OkCupid.RData", destfile = "OkCupid.RData")
load("OkCupid.RData")
This data set is very rich and complex and necessitates many of the exploratory data analysis tools we’ve seen earlier.
In the “Environment” panel of RStudio, click on profiles. A spreadsheet of the data should pop up. What variables are numerical? What variables are categorical?
Perform an exploratory data analysis of the variables using the various commands we’ve learned so far: hist(), plot(), table(), barplot(), etc. Pick two categorical and two numerical variables. Recall for barplot() and mosaicplot(), you need to create a frequency/contingency table first. Ex: Say you have a categorical variable x you want to display a barplot() of, use barplot(table(x)). Say you have a categorical variables x and y you want to display a mosaicplot() of, use mosaicplot(table(x, y)).
What proportion of users are female?
The essay questions are quite large and do not fit into one screen. Type essays[6,] in R to view the contents of the sixth user, for example.
Create a mosaicplot of the relationship between sex and orientation. What can you say about the dating pool for San Francisco OkCupid users? What is the single largest dating demographic group in San Francisco?
We compare the distributions of male and female heights using histograms. While we could plot two separate histograms without regard to the scale of the x-axes, we instead use the histogram() function from the mosaic package to
~ height | sex. In other words plot heights conditional on sex.layout=c(1,2)width=1. The histogram() function automatically matches the \(x\)-axis scales for both plots.histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)", data=profiles)
We observe that some of the heights are nonsensical, including a height of 95 inches (equaling 7’11’’). We deem heights between 55 and 80 inches to be reasonable and remove the rest. While there is potential bias in discarding users with what we deem non-reasonable heights, since out of the 5995 users there are only 15 users who would be discarded, the effect would not be substantial. We keep only those users with heights between 55 and 80 inches using the subset() command and the & is the AND operator i.e. keep only those users with profiles$height >= 55 and profiles$height <= 80. Note also that we subset the essays dataset first.
essays <- subset(essays, profiles$height >= 55 & profiles$height <= 80)
profiles <- subset(profiles, profiles$height >= 55 & profiles$height <= 80)
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)", data=profiles)
We now turn our attention to the essay questions. The function profile.has.word() below searches all 10 essays for each user and returns TRUE if the user used the particular word. Note R treats spaces equally as text, so the following code would return TRUE for the word “wine” but FALSE for the word “winery”. Note the function is case-insensitive. For example, let’s view the first user’s essays responses and search
essays[1, ]
profile.has.word(essays[1,], word="i like a few lame reality shows")
profile.has.word(essays[1,], word="statistics")
We can use the profile.has.word() for all users at once as well. For example, we search for the use of the word “wine” and assign it to a new variable in profiles called has.wine:
profiles$has.wine <- profile.has.word(essays, word="wine")
table(profiles$has.wine)
barplot(table(profiles$has.wine))
We display the relationship between the two categorical variables profiles$has.wine and profiles$sex using mosaicplots.
table(profiles$sex, profiles$has.wine)
mosaicplot(table(profiles$sex, profiles$has.wine), xlab="sex", ylab="Word Use",
main="Sex vs Word Use")
What is your interpretation of the mosaicplot?
Repeat the above exercise with a different word.
In order to reinforce the concepts of logistic regression, we initally keep things simple and consider only one predictor variable at a time in our regression models; in this case height. We first demonstrate that linear regression is an inappropriate tool to use when we have a binary categorial variable as an outcome variable. We first create a scatterplot of the outcome variable is.female over height:
plot(profiles$height, profiles$is.female, xlab="height", ylab="is female")
We observe this plot is not very informative since there is a large degree of overlap of the points. For example, how many users are female with height 70 inches. So as in a previous lab we add a little jitter() to the points, i.e. a little random noise, to better distinguish the points:
plot(jitter(profiles$height), jitter(profiles$is.female), xlab="height", ylab="is female")
For example, we see there aren’t as many females with height 60 inches as there are with height 68 inches. We now fit a linear regression as in the previous lab and plot the regression line in red.
m1 <- lm(profiles$is.female ~ profiles$height)
summary(m1)
plot(jitter(profiles$height), jitter(profiles$is.female), xlab="height", ylab="is female")
abline(m1, col="red")
We observe that linear regression yields fitted probabilities greater than 1 for heights less than about 60 inches for and fitted probabilities less than 0 for heights over about 78 inches, which do not make sense.
Logistic regression is a tool more suited when the outcome variable is binary, as in our case. Our goal is to once again predict users’ sex using their height. This may seem silly as we already know each users sex; however we can fit the model pretending we don’t know each users’ sex, but then verifying how good our predictions are using the truth.
Unlike for linear regression, to run logistic regression we use the glm() (a generalized linear model) command with the argument family=binomial. Just like linear regression however, the formula the takes the form y ~ x1 + x2 + ... where y is the (in this case) binary outcome variable and x1, x2, … are the predictor variables. We fit the model and generate the results:
gender_height <- glm(profiles$is.female ~ profiles$height, family=binomial)
summary(gender_height)
We observe that the coefficient for height is -0.65281. The interpretation of this coefficient is not like with linear regression; it is a little more complicated and we leave this for a more advanced statistics class. However, we note that the coefficient is negative, indicating that height is inversely related to the probability that a user is female. The regression line is
\[ \begin{aligned} \log_{e} \left(\frac{p_i}{1-p_i}\right) &= \beta_0 + \beta_1 \mbox{height}_{i} \end{aligned} \]
In order to obtain the fitted probability \(\widehat{p}_i\) of a particular user being female, we use the inverse logit function: \[ \begin{aligned} %\log_{e} \left(\frac{p_i}{1-p_i}\right) &= \beta_0 + \beta_1 \mbox{height}_{i}\\ \widehat{p}_i &= \frac{1}{1+ \exp\left(-\left( b_0 + b_1 \mbox{height}_{i} \right)\right)} \end{aligned} \]
We can obtain fitted probabilities \(\widehat{p}_i\) for all 5995 users by applying the fitted() command to the model output instead of computing them all manually. The histogram of fitted probabilities looks as follows:
profiles$p.hat <- fitted(gender_height)
hist(profiles$p.hat, main="Fitted Probabilities", xlab=expression(hat(p)[i]))
For each user, however, we only have a fitted probability \(\widehat{p}_i\) and not a “yes/no” prediction as to whether or not they are female. For this we need a “prediction threshold”: if the fitted probability exceeds this this threshold, predict that this user is female, otherwise predict male. We’re going to use a threshold of 50%: if the fitted probability \(\widehat{p}_i\) exceeds this threshold, we predict this user is female. We indicate this on the histogram using a red line:
threshold <- 0.5
hist(profiles$p.hat, main="Fitted Probabilities", xlab=expression(hat(p)[i]))
abline(v=threshold, col="red")
We chose this value somewhat arbitrarily: because it segments the \(\widehat{p}_i\) into two chunks. We did not choose this value because 50% of people are female; recall in this dataset, the proportion of users who are female is not close to 50%. We can then create the contigency table comparing our predictions to the truth:
profiles$predicted.female <- profiles$p.hat >= threshold
profiles$predicted.female[1:10]
table(truth=profiles$is.female, prediction=profiles$predicted.female)
has.wine as a predictor variable. Interpret the coefficient and compare the prediction performance using a prediction threshold of 50%. Did our prediction performance improve?height and has.wine as predictor variables. Interpret the coefficient and compare the prediction performance using a prediction threshold of 50%. Did our prediction performance improve?We thank OkCupid president and co-founder Christian Rudder for agreeing to our use of this dataset (under the condition that the dataset remains public) and Everett Wetchler for providing the data; the original dataset can be found at https://github.com/rudeboybert/JSE_OkCupid.