Hey there! Welcome to another R project of mine. This time, I’m going to be playing around with some bar charts using OK Cupid profile data that I found on Reddit (from ~ 2012). OK Cupid is a website/app for online dating, much like Tinder or Bumble, but with more detailed profiles.
Last week, somewhere in NYC, a dude named Chris Morgan made news by being a rude fella at a breakfast spot called Bagel Boss. He went on a rant about not being able to find a date on online sites because he’s five feet tall. Because this is 2019, someone filmed it. A quick google will give you 10 minutes of enjoyment.
As a fellow short dude (5’9 on Tinder) and a wannabe data scientist, I saw opporunity for analysis. Ideally, I want to have data from Hinge or Bumble on matches (yes or no) so I could cross-reference and run classification algorithms such as Naive Bayes or Logisitic Regression. Unfortunately, I only have profile data from OK Cupid so I still had fun but not as much as I wanted.
That being said, let’s enjoy some visuals. The data I received seemed clean at first, but ended up being quite messed up;, hence a bunch of filtering that takes us from 60k observations to about 9.5k usable observations. I eventually reduce that further to 3.8k but that was more intentional. I attribute the key reason behind the data being crappy is that OK Cupid records a profile on the first button click, before the full account is activated and usable. People who stop halfway are still included despite being useless.
And with that…
Chris “Bagel Boss” Morgan, the 5 ft fighter
So here is where I’m going to import and clean that data locally: OKC is the raw data, OKCH is the data being made useful and OKCP is the data filtered down a little more just for the last visualization, since I wanted to get rid of irrelevant columns. A big challenge with this data was that everything, even height, was stored as string. Converting this to numeric values gave me a pretty big issue. Still working on it, but we can still do bar charts with strings.
Both the cleaning and visualization I do in this project are from the tidyverse package, specifically dplyr and ggplot2.
I included a head() function that is supposed to show the first 5 rows of a dataset. It’s messy here and that’s the point.
OKC<-read.csv(file ="profiles OKC.csv",header= TRUE, stringsAsFactors = FALSE)
OKC<-na.omit(OKC)
OKCH <- OKC %>%
filter(!(drinks %in% "")) %>%
filter(!(drugs %in% "")) %>%
filter(!(education %in%
"")) %>%
filter(!(job %in% "")) %>%
filter(!(offspring %in% "")) %>%
filter(!(orientation %in%
"")) %>%
filter(!(pets %in% "")) %>%
filter(!(religion %in% "")) %>%
filter(!(sex %in%
"")) %>%
filter(!(sign %in% "")) %>%
filter(!(smokes %in% "")) %>%
filter(!(status %in%
"")) %>%
filter(!(height %in%
""))
OKCP <- OKCH %>%
filter(!(body_type %in% "")) %>%
filter(!(diet %in% "")) %>%
filter(!(pets %in%
c("likes dogs and dislikes cats", "dislikes dogs and has cats",
"has dogs and dislikes cats", "dislikes dogs and likes cats")))
head(OKCP)
## age body_type diet drinks drugs
## 1 22 a little extra strictly anything socially never
## 2 35 average mostly other often sometimes
## 3 31 average mostly anything socially never
## 4 29 thin mostly anything socially never
## 5 33 athletic mostly anything socially never
## 6 30 fit mostly anything socially never
## education ethnicity height
## 1 working on college/university asian, white 75
## 2 working on space camp white 70
## 3 graduated from college/university white 65
## 4 working on college/university hispanic / latin, white 62
## 5 graduated from masters program white 72
## 6 graduated from college/university white 69
## job last_online
## 1 transportation 2012-06-28-20-30
## 2 hospitality / travel 2012-06-29-21-41
## 3 artistic / musical / writer 2012-06-29-12-30
## 4 other 2012-06-29-08-55
## 5 science / tech / engineering 2012-06-27-21-41
## 6 executive / management 2012-06-28-15-22
## location
## 1 south san francisco, california
## 2 oakland, california
## 3 san francisco, california
## 4 san leandro, california
## 5 san francisco, california
## 6 san francisco, california
## offspring orientation
## 1 doesn’t have kids, but might want them straight
## 2 doesn’t have kids, but might want them straight
## 3 doesn’t have kids, but wants them straight
## 4 doesn’t have kids, but wants them straight
## 5 doesn’t have kids straight
## 6 doesn’t have kids, but might want them straight
## pets religion sex
## 1 likes dogs and likes cats agnosticism and very serious about it m
## 2 likes dogs and likes cats agnosticism but not too serious about it m
## 3 likes dogs and likes cats christianity f
## 4 likes dogs and has cats catholicism f
## 5 likes dogs and likes cats catholicism but not too serious about it m
## 6 likes dogs and likes cats agnosticism and somewhat serious about it m
## sign smokes
## 1 gemini sometimes
## 2 cancer no
## 3 sagittarius no
## 4 taurus no
## 5 pisces and it’s fun to think about no
## 6 sagittarius but it doesn’t matter no
## speaks status
## 1 english single
## 2 english (fluently), spanish (poorly), french (poorly) single
## 3 english, spanish (okay) single
## 4 english single
## 5 english (fluently) single
## 6 english, spanish (poorly) single
Great, now to some actual fun stuff.
Below is the first bar chart looking at the height (inches) breakdown across sexes on OK Cupid. Black bars are female, yellow are male. The red line indicates five feet tall, or Chris Morgan. We see that he really is on the low end of the spectrum for male heights. The analysis would be so much stronger if we could tie heights to success rates of matching. That way we would know if and how much women cared about height as a factor (and vice versa).
The blue line represents me, coming in at a staggering five feet and eight inches on a good day. I don’t have much to brag about either, but hey, I have a great personality.
I included the actual code here to show what it looks like, I hide the inputs later.
ggplot(OKCH) +
aes(x = height, fill = sex) +
geom_bar() +
geom_vline(xintercept = 11, color = "red", size = 1)+
geom_vline(xintercept = 19, color = "blue", size = 1)+
scale_fill_viridis_d(option = "inferno") +
labs(title = "OK Cupid Height vs Sex", subtitle = "Just how relatively short was this dude?") +
theme_minimal()
Next, we move on to body types. Dad bods are supposedly in, so let’s look at the distribution. What is somewhat interesting about this question is the subjectivity. There is more opinion on body type than there is for height. I’m probably “athletic” or “used up,” but I put jacked because why not. I assume Chris Morgan is average since he looks like a regular dude. Apparently, however, he has signed on to fight other viral celebrities like Anton Dodson (or w.e his name is). Maybe Morgan can throw hands and is actually athletic.
The obvious bias that comes with someone filling out their body type is the same reason why I don’t want to just go around asking folk if height matters in attraction. People lie or fudge the truth, and I want biases out of my analysis.
For this plot, female data is navy blue and male data is gold.
The last visual, and a bit of a fun one, is “Pets”. I assume Chris Morgan hates cats and dogs because he is such an angry dude. I unfortunately grew up with five cats but I really love dogs. A good friend of mine, Kayla, got me into volunteering at an animal shelter where I worked with pitbulls. Everyone should spend some time with shelter pups. So damn cute.
I mentioned additional filtering I did on the data set to drop observations. I did this mostly so the visual here would read cleanly since there were a bunch of categories. The overall trend, however, is that people love pets, with dogs being a decent amount more popular than cats.
If Morgan gets a dog, it might add a few more inches to his situation.
For this plot, female data is salmon and male data is blue.
That’s all I have today. Let me know what you think and thanks for reading!