Hey there! Welcome to another R project of mine. This time, I played around with my Instagram, Tinder, and Hinge data.
The hardest part of doing these projects is finding interesting data. Fortunately, you can request your data for most apps these days. Instagram and Tinder were very good about sending my data to me in Json files. Hinge was absolutely horrible, I had to request my data five times before manually going through my app and transcribing the data myself. If anyone from Hinge corporate reads this, fix your data access process.
For those who don’t know: Instagram is an app for posting photos/videos. Tinder is an app for finding “love” based on swiping right (yes) or left (no) on pictures of potential partners. Hinge is a more advanced Tinder that allows for filtering based on characteristics such as education, height, race, etc.
The three things I did with my data were:
The overall process for each mini-project is to import the data, clean it, then apply whatever tool I want. Generally, 90% of the work is importing and cleaning.
The main packages used for this project are tidyverse, rJson, alluvial, & lubridate.
And with that…
Not my Tinder but still pretty funny
So, here is where I’m going to import and adjust the data locally: “connections”
The way this data is structured from IG is that it’s actually a list of six lists, two of them include my followers and who I’m following. To further complicate things, the names of the followers and followings are stored as the column names. So what I need to do is extract the column names from these two lists and store them in two custom lists: “follower_names” and “following_names.” There’s also additional text added to these values that I trim off. Lastly, I convert both lists into data frames so I can use an anti-join in the next step.
An anti-join will look for what observations in the data frames are not shared. I use this to find out who I’m following, but does not follow me back. Yes, I know there are apps for this. This is more fun and keeps my data private (kinda).
After the anti-join, we can take a look at a few records in the object “assholes” and check for accuracy on my app. Spoiler: these folks didn’t follow me. I ended up cleaning out 100 or so followers. The head() function will show the first couple of records or observations from “assholes”.
# import data from JSON file
connections <- fromJSON(file = "connections.json")
# pulling followers and following lists, but converting to a data frame
followers <- as.data.frame(connections[3])
following <- as.data.frame(connections[4])
# this is just some cleaning to cut IG handles down to their shortest form
follower_names<- colnames(followers)
follower_names <- str_replace_all(follower_names, "followers.", "")
follower_names<- as.data.frame(follower_names)
following_names<- colnames(following)
following_names <- str_replace_all(following_names, "following.", "")
following_names<- as.data.frame(following_names)
# using an anti-join to find what names are on the following list, but not the follower list
assholes<-anti_join(following_names, follower_names, by=c("following_names" = "follower_names"), copy = FALSE)
## Warning: Column `following_names`/`follower_names` joining factors with
## different levels, coercing to character vector
# looking at the first 5 names
head(assholes)
## following_names
## 1 keshavreddy
## 2 njdotcom
## 3 kingmaryiv
## 4 ndbarstool
## 5 tori_wilton
## 6 lexzworld
The downside of Tinder is the little descriptive data that is present. There is an “about me” section where users can put anything, such as height, college, etc. That section can’t be easily pulled and grouped however without contextual text mining, which I’m too lazy to do. The best thing to do with Tinder data is to look at usage stats and success rates over time.
The data importing and cleanup is very similar to the IG part above. I imported a massive list of lists from a Json file then selected a list from there: “usage”. Inside “usage” is actually a data frame which has several fields I work with. In general the respective fields are just counts on a specific date. What makes this a little annoying is that the date information is actually stored as column names, which we dealt with in a similar manner in the IG part. After extracting and cleaing the data and realizing I can make plots without using the cleaned date field, we can finally do some analysis.
I stored matches, lefts (dislikes), and rights (likes) in a data frame: “tinder”. First I had fun and created some ratios. pull_rate is just my matches divided by my likes (3%), selectiveness is my rights divided by total girls I’ve seen (47%), and days is just the timespan of all the data (103 days). Without a bench mark of more users, I have no idea how I’m actually doing. So send me your data and let’s see if I’m better than you.
Next I just made a plot looking at the general trend of my total activity over time. I used geom_smooth to keep the graph easy to digest. Overall my activity fell then picked up at the end. Why did my activity start so high? T-Mobile gave away Tinder premium when I started using Tinder, which has unlimited swipes. Normally Tinder caps right swipes at 200 or something per day. That promotion fell off sometime in mid February. Additionally, the black lines represent a period where I wasn’t using the app as much because I met a girl on Hinge I liked. Covid kind of ended that (sad Kevin).
# import data from JSON file, the file is 2.3 Mb of lists within lists
raw_data <- fromJSON(file = "data.json")
# pulling just usage stats from the giant list
usage <- (raw_data[8])
usage<-usage$Usage
# I don't remember exactly why I did this, but it's definitely cleaning
# pulling right swipes, or likes
rights<-usage$swipes_likes <- as.numeric(usage$swipes_likes)
# pulling left swipes, or passes
lefts<-usage$swipes_passes <- as.numeric(usage$swipes_passes)
# pulling matches
matches<-usage$matches<-usage$matches <- as.numeric(usage$matches)
# pulling index
usage$date <- colnames(as.data.frame(usage[[1]]))
# pulling date
usage$date <- str_replace_all(usage$date,"X","")
# converting date from string to date
date<-usage$date <- ymd(usage$date)
# creating a data frame, named "try" with matches, passes and likes
tinder <- data.frame(matches,lefts,rights)
# creating some ratios, such as matches divided by likes
tinder %>% summarize(pull_rate = mean(sum(matches)/sum(rights)),
selectiveness = mean(sum(rights)/sum(rights + lefts)),
days = sum(nrow(tinder)))
## pull_rate selectiveness days
## 1 0.02815013 0.4734209 103
# creating a visualization of likes over time
ggplot(tinder, aes(x = date, y = rights+lefts )) +
geom_smooth() +
geom_vline( aes(xintercept =date[81]))+
geom_vline( aes(xintercept =date[92]))+
ggtitle("KLove's Tinder", subtitle = "It was nice while it lasted")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This was the most fun, mostly because I experimented with a new type of graph: an alluvial chart. An alluvial chart passes frequencies of observartions through several categories and one can see what’s popular. I use the chart here to find out what my type is for girls I’d like to go on a date with.
The data cleaning here was super easy since I manually transcribed this data from the app into a CSV file. The only strange thing here is that the alluvial chart looks weird when I have one off observations. For instance, I have all the heights for all my matches. To put all the heights together on the chart just looks smushed and ugly. So to fix that I put heights into two buckets: shorter and my height. I don’t like taller girls and they don’t like me. But in essence, as the number of buckets in a category goes down, the graph looks cleaner.
When applying the alluvial chart you can see how this can get really messy but also pretty cool. I decided highlight the girls who were my height, but I can highlight pretty much anything I want. I experimented with marking liberal girls blue, red conservative, and moderate as purple but that stil looked messy.
I obviously have a type though: short girls who drink.
I also provided a head() function here to give an idea of the data I had.
Note: Alluvial charts do not do well with NA data, so I had to rename some NAs as “unknown.”
#loading in the data
hinge_data<-read.csv("Manual Hinge Data.csv")
#Previewing data
head(hinge_data)
## Index Date Name Age Height.ft Height..in. Town State
## 1 1 1/7/2020 Kristen 23 5'4 64 Morganville NJ
## 2 2 1/8/2020 Kara R. 26 5'3 63 Morganville NJ
## 3 3 1/8/2020 Jaime B. 23 5'4 64 Freehold NJ
## 4 4 1/9/2012 Brianna 26 5'0 60 Freehold NJ
## 5 5 1/10/2020 Michelle K. 23 5'6 66 Ocean NJ
## 6 6 1/10/2020 Mary 24 5'7 67 West Freehold NJ
## School Race Political Religion Drink Smoke
## 1 The College of New Jersey White Unknown Catholic Sometimes No
## 2 Drexel University White Conservative Unknown Yes Unknown
## 3 Middlesex County College White Unknown Unknown Yes No
## 4 Monmouth University White Unknown Unknown Sometimes No
## 5 <NA> White Conservative Christian Sometimes No
## 6 Seton Hall White Unknown Unknown Yes No
## Drugs Insta Job Tattoo Children Hot.
## 1 No No Student No No 7
## 2 Unknown Yes Nurse No No 9
## 3 Sometimes No Transporter Yes Want 9
## 4 Unknown Yes Social Worker No No 7
## 5 No No <NA> No Want 8
## 6 Unknown Yes Accountant No No 5
#Converting the string date data to actual date data, even though I don't use it
hinge_data$Date <- mdy(hinge_data$Date)
#making a bucket for age
hinge_data$Age_bucket <- ifelse(hinge_data$Age <27, "Younger",
ifelse(hinge_data$Age > 27, "Older", "Same Age"))
#making a bucket for height
hinge_data$Height_bucket <-ifelse(hinge_data$Height..in. <65, "Shorter",
ifelse(hinge_data$Height..in. > 68, "Taller", "My Height"))
#making a bucket for state
hinge_data$State_bucket <-ifelse(hinge_data$State =="NJ", "NJ", "Not NJ")
#making a bucket for drink, I don't like sober girls
hinge_data$Drink_bucket <-ifelse(hinge_data$Drink =="Yes", "Yes",
ifelse(hinge_data$Drink =="Sometimes", "Sometimes","Unknown"))
#selecting only the columns I might want for the alluvial chart
hinge_cut <- select(hinge_data, c(Age_bucket,Height_bucket,State_bucket,Race,Drink_bucket,Tattoo, Political, Religion))
#adding a "count" to each row that can be summed later
hinge_cut$Freq <- 1
#preview of the raw counts
summary(hinge_cut)
## Age_bucket Height_bucket State_bucket Race
## Length:153 Length:153 Length:153 Asian : 6
## Class :character Class :character Class :character Hispanic : 8
## Mode :character Mode :character Mode :character Indian : 3
## Middle Eastern: 1
## White :135
##
## Drink_bucket Tattoo Political Religion Freq
## Length:153 No :137 Conservative: 12 Catholic :33 Min. :1
## Class :character Yes: 16 Liberal : 14 Christian:11 1st Qu.:1
## Mode :character Moderate : 15 Jewish : 7 Median :1
## Unknown :112 Spiritual: 8 Mean :1
## Unknown :94 3rd Qu.:1
## Max. :1
#creating sub variable "try22" that has the counts by variable
hinge_cut %>% group_by(Height_bucket, Age_bucket, Religion, Drink_bucket, Political,State_bucket) %>%
summarise(n = sum(Freq)) -> try22
#creation of the actual alluvial chart, which includes some minor edits for readability
alluvial(try22[,1:6], freq=try22$n,
col = ifelse(try22$Height_bucket == "My Height", "light green", "gray"),
border = ifelse(try22$Height_bucket == "My Height", "light green", "gray"),
hide=try22$n < 1, cex.axis = .75, cex = .7,cw = .3,
axis_labels = c("Height?", "Age?","Religion?", "Alcohol?", "Politics?", "State?"))
So, this was me having fun with my personal data. Thanks!