For this lab we’ll be investigating profile data from the free online dating website OkCupid. The original dataset consists of the public profiles of a randomly chosen sample of 10% of the 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible. The resulting datasets consist of 5995 rows, one for each user.
Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits). A list of all variables and their descrition can be found here. Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well:
| Number | Question |
|---|---|
| essay0 | My self summary |
| essay1 | What I’m doing with my life |
| essay2 | I’m really good at |
| essay3 | The first thing people usually notice about me |
| essay4 | Favorite books, movies, show, music, and food |
| essay5 | The six things I could never do without |
| essay6 | I spend a lot of time thinking about |
| essay7 | On a typical Friday night I am |
| essay8 | The most private thing I am willing to admit |
| essay9 | You should message me if… |
Analysis of similar data has received much press of late, including Amy Webb’s TED talk “How I Hacked Online Dating” and Wired magazine’s “How a Math Genius Hacked OkCupid to Find True Love”. Before proceeding, we note that even though this data consists of publicly facing material, one should proceed with caution before scraping and using data in fashion similar to ours, as the Computer Fraud and Abuse Act (CFAA) makes it a federal crime to access a computer without authorization from the owner. In our case, permission was given by the owners of the data.
This lab provides us an opportunity to fit a predictive model for sex using logistic regression. Specifically, we will be using information on users’ profiles to predict if the user is female. At the time, OkCupid only allowed for two possible sex choices (male or female), thus making sex a binary categorical variable (OkCupid has since relaxed these categorizations to allow for a broader range of choices). Mathematically speaking, the outcome variable of interest is
\[ Y_i = \left\{ \begin{array}{cl} 1 & \mbox{if the user is female}\\ 0 & \mbox{if the user is male} \end{array} \right. \]
Before we begin, we load the mosaic package,
library(mosaic)
and then the profiles and essays datasets:
load("OkCupid.RData")
This data set is very rich and complex and necessitates many of the exploratory data analysis tools we’ve seen earlier.
In the “Environment” panel of RStudio, click on profiles. A spreadsheet of the data should pop up. Which variables are numerical? Which variables are categorical?
histogram(), xyplot(), etc; in particular the commands we saw in Lab 3. Pick any two categorical and two numerical variables. Recall for barplots and mosaicplots, you need to create a frequency/contingency table first using the tally() command:
x, use barchart(tally(~x, data=profiles)).x and y, use mosaicplot(tally(x~y, data=profiles)).What proportion of users are female? What are some possible explanations for this result?
The essay questions are quite large and do not fit into one screen. Type essays[6,] in R to view the contents of the sixth user, for example.
Create a mosaicplot of the relationship between sex and orientation.
What can you say about the dating pool for San Francisco OkCupid users? What is the single largest dating demographic group in San Francisco?
We compare the distributions of male and female heights using histograms. While we could plot two separate histograms without regard to the scale of the axes, we use the following command to:
~ height | sex. In other words plot heights conditional on sex.layout=c(1,2), where the x and y axes matchwidth=1.histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)",
data=profiles)
We observe that some of the heights are nonsensical, including a height of 95 inches (equaling 7’11’’). We deem heights between 55 and 80 inches to be reasonable and remove the rest. While there is potential bias in discarding users with what we deem non-reasonable heights, since out of the 5995 users would be only 15 users discarded, the effect would not be substantial. We therefore keep only those users with heights between 55 and 80 inches using the filter() command and the & AND operator i.e. keep only those users with height >= 55 and height <= 80. We assign this subset of users to a new data frame profiles.subset.
profiles.subset <- filter(profiles, height >= 55 & height <= 80)
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height (in)",
data=profiles.subset)
What phenomenon do you think explains the unusual spike for males at 72 inches? Similarly, what phenomenon do you think explains the anomolous spike for females at 64 inches?
If I told you someone was 6’4’‘, what would you predict their gender to be?
Similarly, what if I told you someone was 5’1’’?
For about which height do we start seeing a higher proportion of males than females? You can think of this as a “point of indifference” i.e. we have an similar proportion of males and females being of that height.
Make a similarly split histogram of age by sex. What do you observe?
We now turn our attention to the essay questions. The function profile.has.word() below searches all 10 essays for each user and returns TRUE if the user used the particular word. Note the function is case-insensitive. For example, let’s view only the first user’s essays responses and search a few terms
essays[1, ]
profile.has.word(essays[1,], word="i like a few lame reality shows")
profile.has.word(essays[1,], word="statistics")
We can use the profile.has.word() for all users at once as well. For example, we search for the use of the word “wine” and assign it to a new variable in profiles called has.wine:
profiles$has.wine <- profile.has.word(essays, word="wine")
tally(~has.wine, data=profiles)
barchart(tally(~has.wine, data=profiles), horizontal=FALSE)
We now display the relationship between the two categorical variables profiles$has.wine and profiles$sex using mosaicplots.
tally(sex~has.wine, data=profiles)
mosaicplot(tally(sex~has.wine, data=profiles), xlab="Sex", ylab="Word Use",
main="Sex vs Word Use")
What is your interpretation of the mosaicplot?
Repeat the above exercise with a different word.
In order to reinforce the concepts of logistic regression, we initally keep things simple and consider only one predictor variable: height. We first demonstrate that linear regression is an inappropriate tool to use when we have a binary categorial variable as the outcome variable. We first create a scatterplot of the outcome variable is.female vs height:
xyplot(is.female~height, data=profiles, xlab="height", ylab="is female")
We observe this plot is not very informative since there is a large degree of overlap of the points. For example, how many users are female with height 70 inches? So we add a little jitter() to the points, i.e. a little random noise, to better distinguish the points:
xyplot(jitter(is.female)~jitter(height), data=profiles, xlab="height", ylab="is female")
For example, we see there aren’t as many females with height 60 inches as there are with height 68 inches. We now fit a linear regression as in the previous lab and plot the regression.
m1 <- lm(is.female ~ height, data=profiles)
summary(m1)
plotModel(m1)
We observe that linear regression yields fitted probabilities greater than 1 for heights less than 61 inches and fitted probabilities less than 0 for heights over 73 inches. These fitted probabilities do not make sense. Linear regression is not an appropriate tool to use in this case.
Logistic regression is a tool more suited for binary outcome variables, as in our case. Our goal is to once again predict/fit users’ sex using their height. This may seem silly as we already know each users’ sex; however we can fit the model pretending we don’t know each users’ sex, but then verifying how good our predictions are using the truth.
Unlike for linear regression, to run logistic regression we use the glm() (a generalized linear model) command with the argument family=binomial. Just like linear regression however, the formula the takes the form y ~ x1 + x2 + ... where y is the outcome variable and x1, x2, … are the predictor variables. We fit the model using glm() and report the results:
gender_height <- glm(is.female ~ height, family=binomial, data=profiles)
summary(gender_height)
We observe that the coefficient for height is -0.65281. The interpretation of this coefficient is not like with linear regression; it is a little more complicated and we leave this for a more advanced statistics class. However, we note that the coefficient is negative, indicating that height is inversely related to the probability that a user is female. The regression line is
\[ \begin{aligned} \log_{e} \left(\frac{\widehat{p}_i }{1-\widehat{p}_i }\right) &= b_0 + b_1 \mbox{height}_{i} \end{aligned} \]
In order to obtain the fitted probability \(\widehat{p}_i\) of a particular user being female, we use the inverse logit function:
\[ \begin{aligned} \widehat{p}_i &= \frac{1}{1+ \exp\left(-\left( b_0 + b_1 \mbox{height}_{i} \right)\right)} \end{aligned} \]
The first user in the dataset’s height is 64 inches. What is the fitted probability \(\widehat{p}\) that this user is female?
What is the fitted probability that a user is female if their height is 75 inches?
We can compute fitted probabilities \(\widehat{p}_i\) for all 5995 users by applying the fitted() command to the model output, instead of computing them all manually. We then plot the histogram of fitted probabilities. We observe a wide range of fitted probabilities that a user is female:
profiles$p.hat <- fitted(gender_height)
histogram(~p.hat, data=profiles, xlab="p-hat", type="count")
For each user, however, we only have a fitted probability \(\widehat{p}_i\). If we want a “yes/no” prediction as to whether or not they are female, we need a “prediction threshold”: if the fitted probability exceeds this threshold, we predict that this user is female. We’re going to use a threshold of 50%: if the fitted probability \(\widehat{p}_i\) exceeds this threshold, we predict this user is female. So imagine we draw a vertical line in the histogram at \(\widehat{p}=0.5\), and predict all users to the right of the line to be female, and all users to the left to be male.
It is important to note we are not restricted to using a threshold of 50%. Any value between 0% and 100% can be used and we chose 50% somewhat arbitrarily. We did not choose this value because 50% of people are female; recall in this dataset, the proportion of users who are female is not close to 50%.
How did our predictions fare? We can evaluate our model and prediction threshold’s performance using a contigency table comparing our predictions to the truth:
threshold <- 0.5
# For each user, see if p-hat is above threshold
profiles$predicted.female <- profiles$p.hat >= threshold
profiles$predicted.female[1:10]
tally(is.female~predicted.female, data=profiles)
We thank OkCupid president and co-founder Christian Rudder for agreeing to our use of this dataset (under the condition that the dataset remains public) and Everett Wetchler for providing the data; the original dataset can be found at https://github.com/rudeboybert/JSE_OkCupid.