Summary

This explores several relationships in the OkCupid data recently published on CRAN. Results explored in this document include the demographics of sex, age, income, religiosity, drinking, and ethnicity of OkCupid users and, as a specific analysis, the correlation of drinking habits to religiosity, income, sex and age.

Some key observations:
* “Most” OkCupid users are white, single, childless, like pets, and are in their late 20’s.
* 95% of users are less than 52 years old.
* Men outnumber women 3:2 overall, with strong age-dependencies.
* A majority of OkCupid users identify as religiously agnostic.
* Heavier drinking habits correlates strongly with higher incomes.
* Women’s average income is about 75% that of men’s.

Getting the Data

The OkCupid data is published on CRAN as a package for R users. The data set consists of user profile data for 59,946 San Francisco OkCupid users (a free online dating website) from June 2012. The data are describded in the paper: Albert Y. Kim, Adriana Escobedo-Land (2015). OkCupid Profile Data for Introductory Statistics and Data Science Courses. Journal of Statistics Education, 23(2), which you can find here

The raw data is loaded as a library

## load data
library(okcupiddata)

and it consist of these data fields (detailed descriptions of which can be found in the reference above).

library(dplyr)
## variable names
profiles %>% names
 [1] "age"         "body_type"   "diet"        "drinks"      "drugs"      
 [6] "education"   "ethnicity"   "height"      "income"      "job"        
[11] "last_online" "location"    "offspring"   "orientation" "pets"       
[16] "religion"    "sex"         "sign"        "smokes"      "speaks"     
[21] "status"      "essay0"     

The dataset is very rich with abundant oppty to explore.

The data comprise:
* 59946 observations.
* 22 variables.

This quick tour first looks at a few individual female - male behavior trends (in the first section below), and then focuses on the correlation of drinking habits to some of these trends in the second section.

OkCupid “Modal Hybrid User” (MHU)?

We can do an interesting thought-experiment by asking “what is the hybrid of the most frequent value (i.e. mode) of each user-variable?” For short hand call this composite the “Modal Hybrid User” or MHU.

It’s easily computed using a helper function which, when passed a data frame and a column names, returns the row value corresponding the most frequencly occurring (mode) of that set.

    max_freq <- function(col_name, df = profiles) {
        ## grabs the value of the max freq occurence in dataframe columns
        ## inputs: name: the name of a column and df: a dataframe
        ## output: the max frequency occurence in the selected column
        ##         more than one value may be returned, if none if found, NA is returned
        
        ## select column
        if (col_name %in% names(df)) {
            ## select the column. Note the use of "matches"
            sel_col <- df %>% select(matches(col_name))
        } else {
            sel_col <- NULL
        }
        
        ## compute frequency table
        col_table <- table(sel_col)
        
        ## choose maximum frequency
        if (length(col_table) >= 1) {
            names(col_table[col_table == max(col_table)]) %>% return
        } else {
            NA %>% return
        }
        
    }

The MHU of the entire data set is a male. For greater interest, I computed MHU_male and MHU_female by filtering the profiles for each sex. To make it more interesting, when a category returned “other” those were also filtered.

Factor MHU_male MHU_female
age 26 27
body_type athletic average
diet mostly anything mostly anything
drinks socially socially
drugs never never
education graduated from college/university graduated from college/university
ethnicity white white
height 70 64
income 20000 20000
job computer / hardware / software student
last_online 2012-06-30 11:55:00 2012-06-30 16:35:00
location san francisco, california san francisco, california
offspring doesn’t have kids doesn’t have kids
orientation straight straight
pets likes dogs and likes cats likes dogs and likes cats
religion agnosticism and laughing about it agnosticism
sex m f
sign virgo but it doesn’t matter gemini and it’s fun to think about
smokes no no
speaks english english
status single single

The above outcomes translate into a coherent story about the male and female MPHs.

The male MHU on OkCupid is:
a white, straight, single, male, who is 26 years old. He graduated from college/university and works in the computer / hardware / software industry. is body-type is athletic and is 5' 10" tall. He eats mostly anything, drinks socially, but never takes drugs. He lives in San Francisco, doesn’t have kids and likes dogs and cats. His astrological sign is virgo but it doesn’t matter and when asked about religion responds agnosticism and laughing about it. He speaks only english.

The female MHU on OkCupid is:
a white, straight, single, female, who is 27 years old. She graduated from college/university and is a student. Her body type is average and is 5' 4" tall. She eats mostly anything, drinks socially, but never takes drugs. She lives in San Francisco, doesn’t have kids and likes dogs and cats. Her astrologic sign is gemini and it’s fun to think about, and when asked about religion responds agnosticism. She speaks only english.

This suggests some interesting similarities and differences between male and female users:
* men and women express similar social drinking and eating habits.
* they are childless, single, but like pets.
* they are predominantly not religious.
* most men advertise themselves as athletic (as opposed to women, who are average).
* women enjoy thinking about their astrological signs more than men do.

We’ll explore some of these below.

Ethnicity of OkCupid users

Ethnicity is reported as below is to0 complex for analysis.

[1] "middle eastern, pacific islander, other"                                         
[2] "asian, black, native american, indian"                                           
[3] "asian, native american, other"                                                   
[4] "middle eastern, pacific islander"                                                
[5] "asian, indian, white, other"                                                     
[6] "pacific islander"                                                                
[7] "asian, native american, indian, pacific islander, hispanic / latin, white, other"

Indeed, since there are 218 unqiue categories, some reduction is needed.

To meet speed and simplicity goals, I stripped off everything except the first descriptor. This is a gross oversimplication of ethicity in an ever-more-diverse world, but it’s a reasonable first approach. It results in tangible categories which can be compared to existing data.

A majority of OkCupid users are found to be white, with small differences between the male and female populations of other ethnicities. The numbers above do not reflect the demographics of San Francisco’s population as a whole, which is 48.5% White, 33% asian, 15% Hispanic, and 6% Black.

Religious affilation of OkCupid Users

The religion data contains statements of both ‘affiliation’ and ‘devoutness.’ For example, here is a sample of the data.

set.seed(8675309)
sample(unique(profiles$religion), 4)
[1] "christianity and very serious about it"
[2] "judaism but not too serious about it"  
[3] "hinduism and laughing about it"        
[4] "buddhism and very serious about it"    

There is an affiliation with a base religion and then a statement of how serious (devout) one is about it.

For a macro view of demographics, let’s first strip off the devoutness and focus on affilation (so, for instance, whether someone typed “catholicism and somewhat serious about it”, or “catholicism and very serious about it”, they are affiliated with “catholicism”)

The data are cleaned by filtering NA’s and then grouped and counted as above. In this case I choose to eliminate the affilation “other” since it is not specific enough to facilitate comparison.

## get affiliation (strip devoutness modifiers) using gsub and simple regex
cleaned <- cleaned %>% mutate(religious_affil = gsub(" [A-z ]*", "", religion))
cleaned <- cleaned %>% filter(religious_affil != "other")

Using the familiar bar chart some trends are obvious.

The proportion of male users reporting an affiliation “atheism” and “agnosticism” is over 50%, with women beling slightly lower. Buddhism is about the same for both sexes, and of the major religions women outnumber men in all but Islam.

Drinking Habits of OkCupid Users

Drinking data has just six categories so this analysis didn’t require any processing beyond normal cleaning of NA’s.

profiles$drinks %>% unique 
[1] "socially"    "often"       "not at all"  "rarely"      NA           
[6] "very often"  "desperately"

After cleaning the NA’s from the data, we can use the now familiar bar chart to show there is little difference between men and women in drinking habits.

Clearly the large majority of OkCupid users are social drinkers, with men having a slightly greater tendency to drink “often” than women.

However, recall that over 50% of OkCupid users are in a very narrow age range. Below we’ll explore how this trend changes when other factors are taken into account.

Drinking habits with age

The machinery above is easily adapted to exploring the relationship of drinking and age.

In this case trends emerge which were masked in the basic analysis above.

A pronounced tendency toward lighter drinking in older age is apparent.

These findings are consistent with, for instance, results published by Annie Britton, Yoav Ben-Shlomo, Michaela Benzeval, Diana Kuh and Steven Bell, who report also that drinking tends to decrease with increasing age.

One apparent diveregence between men and women in the “drinks often” above age 55. In this category the number of male drinkers is 106 and female drinkers 58. These numbers together imply that the ratio

\[ r = \frac{"drinks\ often"\ males\ over\ 50}{"drinks\ often"\ females\ over\ 50} = 1.83 \]

but with an uncertainty from random error (so-called “counting statistics”) of

\[ \sigma = \sqrt{\frac{1}{"drinks\ often"\ males\ over\ 50} + \frac{1}{"drinks\ often"\ females\ over\ 50} } = 16 \% \]

While this is a substantial error margin, the finding appears to be just on the border of \(3 \sigma\) confidence levels, suggesting more detailed analysis is needed.

We get a better look at some of the more subtle trends in the data using a semi-log plot. For this the age data has been bucketing into groups of ten years.

Some observations:
* social drinking is the most frequent category at all ages.
* abstinence increases by about a factor of three between age 21 and age 70.
* heavy drinking decreases by a factor of about five between ages 21 and 70.

Drinking Habits and Income

This analysis shows some of the most interesting trends. Since we have seen drinking habbits and incomes of men and women differ, let’s look at the data normalized to both sex and income.

The most obvious in the graph above is that social drinking, the largest component of the spectrum, shows an obvious trend, with social drinking peaking in the middle of the income range and decreasing on the edges.

Some other observations:
* rare of no drinking decreases with increasing income.
* drinking often tends to be lower in the “middle” of the graph.
* heavy drinking increases with income.

We can get a better look with the graph below.

Some observations:
* The decrase in abstinence is apparent.
* Desperate drinking increase with age for both men and women.

Drinking and Religion

The data religion contains statements of both ‘affiliation’ and (what I call) ‘devoutness’. For example

head(unique(profiles$religion), 5)
[1] "agnosticism and very serious about it"   
[2] "agnosticism but not too serious about it"
[3] NA                                        
[4] "atheism"                                 
[5] "christianity"                            

For a first analysis, let’s just strip off the devoutness descriptors to focus on affilation (so, for instance, whether someone typed “catholicism and somewhat serious about it”, or “catholicism and very serious about it”, they would be have an affiliation of catholicism).

Drinking habits are characterized by self-described ratings of “not at all”, … “socially”, …“desperately”.

The data are cleaned by filtering NA’s and then grouped and counted.

    ## clean data 
    cleaned <- filter(profiles, !is.na(drinks), !is.na(religion), !is.na(sex)) %>% as_data_frame
    
    ## get affiliation (strip devoutness modifiers) using gsub and simple regex
    cleaned$religious_affil <- gsub(' [A-z ]*', '', cleaned$religion) %>% as.factor()

Some patterns reveal themselves.
* Social drinking is by far the largest category.
* Athesists and agnostics are the heaviest drinkers.
* Islamic males also have a higher frequency of heavy drinking.
* Jewish people have the lowest rate of abstinence.

Some Conclusions

Generally the OkCupid data reflects many trends in the general population, such as drinking habit and income disparities. Other trends, such as ethnicity, diverge from the more general demongraphics.

The data present an interesting oppty for more analysis. It would be interesting to:
* Dig deeper on the discrepancies of income based on ethnicity and age.
* Develop a cleaner statistical analysis of drinking habits with age.
* See what modeling opportunities the data present.

Thanks to the authors for publishing this interesting data set!