Hey there! Welcome to another R project of mine. This time, I played around with my Instagram, Tinder, and Hinge data.

The hardest part of doing these projects is finding interesting data. Fortunately, you can request your data for most apps these days. Instagram and Tinder were very good about sending my data to me in Json files. Hinge was absolutely horrible, I had to request my data five times before manually going through my app and transcribing the data myself. If anyone from Hinge corporate reads this, fix your data access process.

For those who don’t know: Instagram is an app for posting photos/videos. Tinder is an app for finding “love” based on swiping right (yes) or left (no) on pictures of potential partners. Hinge is a more advanced Tinder that allows for filtering based on characteristics such as education, height, race, etc.

The three things I did with my data were:

  1. Instagram: Find out who I follow that doesn’t follow me back
  2. Tinder: Look at my performance over time (swipes, success, etc)
  3. Hinge: Create an alluvial chart to look at my peference for girls

The overall process for each mini-project is to import the data, clean it, then apply whatever tool I want. Generally, 90% of the work is importing and cleaning.

The main packages used for this project are tidyverse, rJson, alluvial, & lubridate.

And with that…

Not my Tinder but still pretty funny

Not my Tinder but still pretty funny

  1. Instagram

So, here is where I’m going to import and adjust the data locally: “connections”

The way this data is structured from IG is that it’s actually a list of six lists, two of them include my followers and who I’m following. To further complicate things, the names of the followers and followings are stored as the column names. So what I need to do is extract the column names from these two lists and store them in two custom lists: “follower_names” and “following_names.” There’s also additional text added to these values that I trim off. Lastly, I convert both lists into data frames so I can use an anti-join in the next step.

An anti-join will look for what observations in the data frames are not shared. I use this to find out who I’m following, but does not follow me back. Yes, I know there are apps for this. This is more fun and keeps my data private (kinda).

After the anti-join, we can take a look at a few records in the object “assholes” and check for accuracy on my app. Spoiler: these folks didn’t follow me. I ended up cleaning out 100 or so followers. The head() function will show the first couple of records or observations from “assholes”.

# import data from JSON file
connections <- fromJSON(file = "connections.json")
# pulling followers and following lists, but converting to a data frame
followers <- as.data.frame(connections[3])
following <- as.data.frame(connections[4])
# this is just some cleaning to cut IG handles down to their shortest form
follower_names<- colnames(followers)
follower_names <- str_replace_all(follower_names, "followers.", "")
follower_names<- as.data.frame(follower_names) 
following_names<- colnames(following)
following_names <- str_replace_all(following_names, "following.", "")
following_names<- as.data.frame(following_names) 
# using an anti-join to find what names are on the following list, but not the follower list
assholes<-anti_join(following_names, follower_names, by=c("following_names" = "follower_names"), copy = FALSE)
## Warning: Column `following_names`/`follower_names` joining factors with
## different levels, coercing to character vector
# looking at the first 5 names
head(assholes)
##   following_names
## 1     keshavreddy
## 2        njdotcom
## 3      kingmaryiv
## 4      ndbarstool
## 5     tori_wilton
## 6       lexzworld
  1. Tinder

The downside of Tinder is the little descriptive data that is present. There is an “about me” section where users can put anything, such as height, college, etc. That section can’t be easily pulled and grouped however without contextual text mining, which I’m too lazy to do. The best thing to do with Tinder data is to look at usage stats and success rates over time.

The data importing and cleanup is very similar to the IG part above. I imported a massive list of lists from a Json file then selected a list from there: “usage”. Inside “usage” is actually a data frame which has several fields I work with. In general the respective fields are just counts on a specific date. What makes this a little annoying is that the date information is actually stored as column names, which we dealt with in a similar manner in the IG part. After extracting and cleaing the data and realizing I can make plots without using the cleaned date field, we can finally do some analysis.

I stored matches, lefts (dislikes), and rights (likes) in a data frame: “tinder”. First I had fun and created some ratios. pull_rate is just my matches divided by my likes (3%), selectiveness is my rights divided by total girls I’ve seen (47%), and days is just the timespan of all the data (103 days). Without a bench mark of more users, I have no idea how I’m actually doing. So send me your data and let’s see if I’m better than you.

Next I just made a plot looking at the general trend of my total activity over time. I used geom_smooth to keep the graph easy to digest. Overall my activity fell then picked up at the end. Why did my activity start so high? T-Mobile gave away Tinder premium when I started using Tinder, which has unlimited swipes. Normally Tinder caps right swipes at 200 or something per day. That promotion fell off sometime in mid February. Additionally, the black lines represent a period where I wasn’t using the app as much because I met a girl on Hinge I liked. Covid kind of ended that (sad Kevin).

# import data from JSON file, the file is 2.3 Mb of lists within lists
raw_data <- fromJSON(file = "data.json")
# pulling just usage stats from the giant list
usage <- (raw_data[8])
usage<-usage$Usage
# I don't remember exactly why I did this, but it's definitely cleaning
# pulling right swipes, or likes
rights<-usage$swipes_likes <- as.numeric(usage$swipes_likes)
# pulling left swipes, or passes
lefts<-usage$swipes_passes <- as.numeric(usage$swipes_passes)
# pulling matches
matches<-usage$matches<-usage$matches <- as.numeric(usage$matches)
# pulling index
usage$date <- colnames(as.data.frame(usage[[1]])) 
# pulling date
usage$date <- str_replace_all(usage$date,"X","")
# converting date from string to date
date<-usage$date <- ymd(usage$date)
# creating a data frame, named "try" with matches, passes and likes
tinder <-  data.frame(matches,lefts,rights)

# creating some ratios, such as matches divided by likes
tinder %>% summarize(pull_rate = mean(sum(matches)/sum(rights)), 
                  selectiveness = mean(sum(rights)/sum(rights + lefts)),
                  days = sum(nrow(tinder)))
##    pull_rate selectiveness days
## 1 0.02815013     0.4734209  103
# creating a visualization of likes over time
ggplot(tinder, aes(x = date, y = rights+lefts )) +
  geom_smooth() +
  geom_vline( aes(xintercept =date[81]))+
  geom_vline( aes(xintercept =date[92]))+
  ggtitle("KLove's Tinder", subtitle = "It was nice while it lasted")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. Hinge

This was the most fun, mostly because I experimented with a new type of graph: an alluvial chart. An alluvial chart passes frequencies of observartions through several categories and one can see what’s popular. I use the chart here to find out what my type is for girls I’d like to go on a date with.

The data cleaning here was super easy since I manually transcribed this data from the app into a CSV file. The only strange thing here is that the alluvial chart looks weird when I have one off observations. For instance, I have all the heights for all my matches. To put all the heights together on the chart just looks smushed and ugly. So to fix that I put heights into two buckets: shorter and my height. I don’t like taller girls and they don’t like me. But in essence, as the number of buckets in a category goes down, the graph looks cleaner.

When applying the alluvial chart you can see how this can get really messy but also pretty cool. I decided highlight the girls who were my height, but I can highlight pretty much anything I want. I experimented with marking liberal girls blue, red conservative, and moderate as purple but that stil looked messy.

I obviously have a type though: short girls who drink.

I also provided a head() function here to give an idea of the data I had.

Note: Alluvial charts do not do well with NA data, so I had to rename some NAs as “unknown.”

#loading in the data
hinge_data<-read.csv("Manual Hinge Data.csv")
#Previewing data
head(hinge_data)
##   Index      Date        Name Age Height.ft Height..in.          Town State
## 1     1  1/7/2020     Kristen  23       5'4          64   Morganville    NJ
## 2     2  1/8/2020     Kara R.  26       5'3          63   Morganville    NJ
## 3     3  1/8/2020    Jaime B.  23       5'4          64      Freehold    NJ
## 4     4  1/9/2012     Brianna  26       5'0          60      Freehold    NJ
## 5     5 1/10/2020 Michelle K.  23       5'6          66         Ocean    NJ
## 6     6 1/10/2020        Mary  24       5'7          67 West Freehold    NJ
##                      School  Race    Political  Religion     Drink   Smoke
## 1 The College of New Jersey White      Unknown  Catholic Sometimes      No
## 2         Drexel University White Conservative   Unknown       Yes Unknown
## 3  Middlesex County College White      Unknown   Unknown       Yes      No
## 4       Monmouth University White      Unknown   Unknown Sometimes      No
## 5                      <NA> White Conservative Christian Sometimes      No
## 6                Seton Hall White      Unknown   Unknown       Yes      No
##       Drugs Insta           Job Tattoo Children Hot.
## 1        No    No       Student     No       No    7
## 2   Unknown   Yes         Nurse     No       No    9
## 3 Sometimes    No   Transporter    Yes     Want    9
## 4   Unknown   Yes Social Worker     No       No    7
## 5        No    No          <NA>     No     Want    8
## 6   Unknown   Yes    Accountant     No       No    5
#Converting the string date data to actual date data, even though I don't use it
hinge_data$Date <- mdy(hinge_data$Date)

#making a bucket for age  
hinge_data$Age_bucket <- ifelse(hinge_data$Age <27, "Younger",
                       ifelse(hinge_data$Age > 27, "Older", "Same Age"))
#making a bucket for height 
hinge_data$Height_bucket <-ifelse(hinge_data$Height..in. <65, "Shorter",
                          ifelse(hinge_data$Height..in. > 68, "Taller", "My Height"))
#making a bucket for state 
hinge_data$State_bucket <-ifelse(hinge_data$State =="NJ", "NJ", "Not NJ")
#making a bucket for drink, I don't like sober girls
hinge_data$Drink_bucket <-ifelse(hinge_data$Drink =="Yes", "Yes",
                                 ifelse(hinge_data$Drink =="Sometimes", "Sometimes","Unknown"))
#selecting only the columns I might want for the alluvial chart
hinge_cut <- select(hinge_data, c(Age_bucket,Height_bucket,State_bucket,Race,Drink_bucket,Tattoo, Political, Religion))
#adding a "count" to each row that can be summed later
hinge_cut$Freq <- 1
#preview of the raw counts
summary(hinge_cut)     
##   Age_bucket        Height_bucket      State_bucket                   Race    
##  Length:153         Length:153         Length:153         Asian         :  6  
##  Class :character   Class :character   Class :character   Hispanic      :  8  
##  Mode  :character   Mode  :character   Mode  :character   Indian        :  3  
##                                                           Middle Eastern:  1  
##                                                           White         :135  
##                                                                               
##  Drink_bucket       Tattoo           Political        Religion       Freq  
##  Length:153         No :137   Conservative: 12   Catholic :33   Min.   :1  
##  Class :character   Yes: 16   Liberal     : 14   Christian:11   1st Qu.:1  
##  Mode  :character             Moderate    : 15   Jewish   : 7   Median :1  
##                               Unknown     :112   Spiritual: 8   Mean   :1  
##                                                  Unknown  :94   3rd Qu.:1  
##                                                                 Max.   :1
#creating sub variable "try22" that has the counts by variable
hinge_cut %>% group_by(Height_bucket, Age_bucket, Religion, Drink_bucket, Political,State_bucket) %>%
  summarise(n = sum(Freq)) -> try22
#creation of the actual alluvial chart, which includes some minor edits for readability
alluvial(try22[,1:6], freq=try22$n,
         col = ifelse(try22$Height_bucket == "My Height", "light green", "gray"),
                  border = ifelse(try22$Height_bucket == "My Height", "light green", "gray"),
                  hide=try22$n < 1, cex.axis = .75, cex = .7,cw = .3,
                  axis_labels = c("Height?", "Age?","Religion?", "Alcohol?", "Politics?", "State?"))

So, this was me having fun with my personal data. Thanks!