Hey there! Welcome to another R project of mine. This time, I’m going to be playing around with some bar charts using OK Cupid profile data that I found on Reddit (from ~ 2012). OK Cupid is a website/app for online dating, much like Tinder or Bumble, but with more detailed profiles.

Last week, somewhere in NYC, a dude named Chris Morgan made news by being a rude fella at a breakfast spot called Bagel Boss. He went on a rant about not being able to find a date on online sites because he’s five feet tall. Because this is 2019, someone filmed it. A quick google will give you 10 minutes of enjoyment.

As a fellow short dude (5’9 on Tinder) and a wannabe data scientist, I saw opporunity for analysis. Ideally, I want to have data from Hinge or Bumble on matches (yes or no) so I could cross-reference and run classification algorithms such as Naive Bayes or Logisitic Regression. Unfortunately, I only have profile data from OK Cupid so I still had fun but not as much as I wanted.

That being said, let’s enjoy some visuals. The data I received seemed clean at first, but ended up being quite messed up;, hence a bunch of filtering that takes us from 60k observations to about 9.5k usable observations. I eventually reduce that further to 3.8k but that was more intentional. I attribute the key reason behind the data being crappy is that OK Cupid records a profile on the first button click, before the full account is activated and usable. People who stop halfway are still included despite being useless.

And with that…

Chris “Bagel Boss” Morgan, the 5 ft fighter

Chris “Bagel Boss” Morgan, the 5 ft fighter

So here is where I’m going to import and clean that data locally: OKC is the raw data, OKCH is the data being made useful and OKCP is the data filtered down a little more just for the last visualization, since I wanted to get rid of irrelevant columns. A big challenge with this data was that everything, even height, was stored as string. Converting this to numeric values gave me a pretty big issue. Still working on it, but we can still do bar charts with strings.

Both the cleaning and visualization I do in this project are from the tidyverse package, specifically dplyr and ggplot2.

I included a head() function that is supposed to show the first 5 rows of a dataset. It’s messy here and that’s the point.

OKC<-read.csv(file ="profiles OKC.csv",header= TRUE, stringsAsFactors = FALSE)

OKC<-na.omit(OKC)

OKCH <- OKC %>%
 filter(!(drinks %in% "")) %>%
 filter(!(drugs %in% "")) %>%
 filter(!(education %in% 
    "")) %>%
 filter(!(job %in% "")) %>%
 filter(!(offspring %in% "")) %>%
 filter(!(orientation %in% 
    "")) %>%
 filter(!(pets %in% "")) %>%
 filter(!(religion %in% "")) %>%
 filter(!(sex %in% 
    "")) %>%
 filter(!(sign %in% "")) %>%
 filter(!(smokes %in% "")) %>%
 filter(!(status %in% 
    "")) %>%
  filter(!(height %in% 
             ""))

OKCP <- OKCH %>%
 filter(!(body_type %in% "")) %>%
 filter(!(diet %in% "")) %>%
 filter(!(pets %in% 
    c("likes dogs and dislikes cats", "dislikes dogs and has cats", 
    "has dogs and dislikes cats", "dislikes dogs and likes cats")))

head(OKCP)
##   age      body_type              diet   drinks     drugs
## 1  22 a little extra strictly anything socially     never
## 2  35        average      mostly other    often sometimes
## 3  31        average   mostly anything socially     never
## 4  29           thin   mostly anything socially     never
## 5  33       athletic   mostly anything socially     never
## 6  30            fit   mostly anything socially     never
##                           education               ethnicity height
## 1     working on college/university            asian, white     75
## 2             working on space camp                   white     70
## 3 graduated from college/university                   white     65
## 4     working on college/university hispanic / latin, white     62
## 5    graduated from masters program                   white     72
## 6 graduated from college/university                   white     69
##                            job      last_online
## 1               transportation 2012-06-28-20-30
## 2         hospitality / travel 2012-06-29-21-41
## 3  artistic / musical / writer 2012-06-29-12-30
## 4                        other 2012-06-29-08-55
## 5 science / tech / engineering 2012-06-27-21-41
## 6       executive / management 2012-06-28-15-22
##                          location
## 1 south san francisco, california
## 2             oakland, california
## 3       san francisco, california
## 4         san leandro, california
## 5       san francisco, california
## 6       san francisco, california
##                                      offspring orientation
## 1 doesn&rsquo;t have kids, but might want them    straight
## 2 doesn&rsquo;t have kids, but might want them    straight
## 3      doesn&rsquo;t have kids, but wants them    straight
## 4      doesn&rsquo;t have kids, but wants them    straight
## 5                      doesn&rsquo;t have kids    straight
## 6 doesn&rsquo;t have kids, but might want them    straight
##                        pets                                  religion sex
## 1 likes dogs and likes cats     agnosticism and very serious about it   m
## 2 likes dogs and likes cats  agnosticism but not too serious about it   m
## 3 likes dogs and likes cats                              christianity   f
## 4   likes dogs and has cats                               catholicism   f
## 5 likes dogs and likes cats  catholicism but not too serious about it   m
## 6 likes dogs and likes cats agnosticism and somewhat serious about it   m
##                                       sign    smokes
## 1                                   gemini sometimes
## 2                                   cancer        no
## 3                              sagittarius        no
## 4                                   taurus        no
## 5 pisces and it&rsquo;s fun to think about        no
## 6  sagittarius but it doesn&rsquo;t matter        no
##                                                  speaks status
## 1                                               english single
## 2 english (fluently), spanish (poorly), french (poorly) single
## 3                               english, spanish (okay) single
## 4                                               english single
## 5                                    english (fluently) single
## 6                             english, spanish (poorly) single

Great, now to some actual fun stuff.

Below is the first bar chart looking at the height (inches) breakdown across sexes on OK Cupid. Black bars are female, yellow are male. The red line indicates five feet tall, or Chris Morgan. We see that he really is on the low end of the spectrum for male heights. The analysis would be so much stronger if we could tie heights to success rates of matching. That way we would know if and how much women cared about height as a factor (and vice versa).

The blue line represents me, coming in at a staggering five feet and eight inches on a good day. I don’t have much to brag about either, but hey, I have a great personality.

I included the actual code here to show what it looks like, I hide the inputs later.

ggplot(OKCH) +
 aes(x = height, fill = sex) +
 geom_bar() +
 geom_vline(xintercept = 11, color = "red", size = 1)+
 geom_vline(xintercept = 19, color = "blue", size = 1)+
 scale_fill_viridis_d(option = "inferno") +
 labs(title = "OK Cupid Height vs Sex", subtitle = "Just how relatively short was this dude?") +
 theme_minimal()

Next, we move on to body types. Dad bods are supposedly in, so let’s look at the distribution. What is somewhat interesting about this question is the subjectivity. There is more opinion on body type than there is for height. I’m probably “athletic” or “used up,” but I put jacked because why not. I assume Chris Morgan is average since he looks like a regular dude. Apparently, however, he has signed on to fight other viral celebrities like Anton Dodson (or w.e his name is). Maybe Morgan can throw hands and is actually athletic.

The obvious bias that comes with someone filling out their body type is the same reason why I don’t want to just go around asking folk if height matters in attraction. People lie or fudge the truth, and I want biases out of my analysis.

For this plot, female data is navy blue and male data is gold.

The last visual, and a bit of a fun one, is “Pets”. I assume Chris Morgan hates cats and dogs because he is such an angry dude. I unfortunately grew up with five cats but I really love dogs. A good friend of mine, Kayla, got me into volunteering at an animal shelter where I worked with pitbulls. Everyone should spend some time with shelter pups. So damn cute.

I mentioned additional filtering I did on the data set to drop observations. I did this mostly so the visual here would read cleanly since there were a bunch of categories. The overall trend, however, is that people love pets, with dogs being a decent amount more popular than cats.

If Morgan gets a dog, it might add a few more inches to his situation.

For this plot, female data is salmon and male data is blue.

That’s all I have today. Let me know what you think and thanks for reading!