Summary

It has been a little more than a month since I joined a dating website called Okcupid. It is a very pupular dating website which uses complex mathmatical algorithm to match people based on their personalities and perferences.

This is the first time I have used such dating werbsite, and it has been an interesting experience where I can find all different kind of people who I do not normally meet in real life. It is fascinating that some of them actually visited my profile and even liked / messaged me.

As a data enthusiast, it only way to make sense of this whole dynamic is through data analysis. Luckily, there are so many data points available through person’s own personal profile that I can use to understand not only what kind of people like me but also what kind of people I’m attracted to.

Key questions I want to understand

  • What kind of people who visit and like my profile
  • What kind of people I find them attractive
  • Out of all the people visited my profile, who will eventually like it and even send or response to my messages. Can I even predict it?

Type of data available from personal’s profile for this analysis

  • Personal information : age, status, location, height, race, body type, education level, etc.
  • Matching % : one of the core advertising feature from Okcupid. It gives you a matching socre against another personl based on mathematical algorithm.
  • Self description : Your self-descriped / written answers to the common questions provided by Okcupid like “What I’m doing with my life”, “I’m really good at”, “Favorite books, movies, shows, music, and food”, “You should message me if”, ect.
  • Personality : It shows if a personal is more or less in some personality traits like “Conservative”, “Playful”, “Polite”, “Compassionate” against other users on the website.

About Me

Here is my basic profile so it gives a perspective and baseline on the analysis.

  • Male
  • 36 years old
  • Asian
  • Live in Manhattan, NY
  • Post Grad
  • 5’10"
  • Average build
  • Attractiveness. (Rating my own attractive is very subjective and people usually think they are better looking than they actually are which is called Illusory superiority Bias. So here I have to painfully admit that as a 36 years old Asian American, I am probably a 2.5, but a 3.5 among Asian users, scale from 1 to 5).

Process Data

library(dplyr)
library(ggplot2)
library(ggmap)
library(leaflet)
library(tidyr)
library(ggmosaic)
library(lubridate)
library(rpart)
library(rpart.plot)
library(randomForest)
library(DT)
library(caTools)
dat <- read.csv("data_analysis.csv")

Perform data imputation and manipulation before analysis

dat$Date <- as.Date(dat$Date, format = "%m/%d/%Y")
dat$Like <- factor(dat$Like)
dat$Height_cm[dat$Height_cm == 213.36] <- NA
dat$Height_cm[is.na(dat$Height_cm)] <- mean(dat$Height_cm, na.rm = TRUE)
dat$Body_Type <- factor(dat$Body_Type, levels(dat$Body_Type)[c(1,8,5,3,2,4,6,7)])
dat$Ethnicity[dat$Ethnicity %in% c("Pacific Islander", "Indian")] <- "Asian"
dat$Ethnicity <- factor(dat$Ethnicity)
dat$Education[dat$Education == "Dropped out of University"] <- "High school"
dat$Education[dat$Education == "Working on Post grad"] <- "Post grad"
dat$Education[dat$Education == "Working on University"] <- "University"
dat$Education[dat$Education == "Working on Space camp"] <- ""
dat$Education <- factor(dat$Education)
dat$Education <- factor(dat$Education, levels(dat$Education)[c(1,2,4,3)])

Analysis - Part I : Who Visit My Profile

Visiting Trend

I got about 4 vistors per day. But as we can see that a lot of my visits came from the first couple of days since I joined. There is significant drop off after it. The averge went from 4.2 to 3, a 30% reduction. Of course, there is a pay fearure that allow people to boot their visibility as I used on May-5 which brought my visits back to initial run rate.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.194   5.000  16.000

Average without boosting

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   2.966   4.000   8.000

Calendar Heatmap

It looks like more people visited me during Sunday and Saturday which is a little surpising to me since I thought people might use it more often during the week.

Age Distribution

I set my matching age range from 23 years old to 38 years old (I’m 36) which can probably explain the left-skewed distribution. However, it is interesting to see the overall distribution a multimodal distribution where one peak is around age 33 and one is around 38.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   30.00   33.00   32.97   37.00   40.00

So I looked into what might be the underlying factors by breaking the age distribution into different ethnicity groups. I can now clearly see the two peaks are coming from different ethnicity populations. Asian visits have peaks around 32 years old, while black/hispanic is around 34, and white is around 37.

Visits by Ethnicity and Education Level

More than 40% of my visits are Asian which shows that people might still perfer their own race. Also 40% of visits have postgraduation degree. It shows that not only race, people also also are looking for someone who has similar education background. It is one of the cognitive biases that we perfer people who are similar to us.

Location

Most of the visits came from Manhattan with Brooklyn being second.

Top Personality

When I setup my profile, Okcupid asked me a lot of behavioral and personality questions. One of the outputs from these assessments is to compare your personality against other users. Some of top personaliies from my visits are wholesome, conservative, ambitious, etc. On the other hand, Sex.driven has the lowest score. I am not sure if I should be happy or not about this result.

Match Score

Match score is the key core competency from Okcupid since it prides itself as a analytics driven dating website that can find the best matches based on its algorithm. And the way people show up from the search is also heavily driven by the match scores.

The overall match score from my visits are mostly above 50% with average around 66%. The match score distribution seems equally accross different backgrounds.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   61.50   69.00   66.42   77.00   89.00

Analysis - Part II : What Kind of People I’m Attracted to

I give every visits a rating from 1 to 5 based on their profile pictures. 1 is the least attractive and 5 is the most attractive. This exercise is to help me understand what kind of people I’m attracted to, and do I have perference toward certain race, age, or body type?

Overall Attractiveness Rating

The average rating I gave all of my vistors is 2.7. The overall distribtuion is right-skewed which is either because I have high standard or people who are very attractive don’t usually visit me. I believe the later one might make more sense.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.669   3.500   5.000

Attractiveness against Ethnicity

Based on the boxplot, it seems like I do have different perference among different ethnicity groups. The attractiveness score for black is noticable lower and white is higher.

So I ran a simple linear regression using ethnicity as explanatory variable and attractivenss as response variable. The p-values for coefficient of EthnicityBlack is less than 0.05 which means I do find black people less attractive. Same conclusion from ANOVA analysis.

lm.att.eth <- lm(Attractiveness ~ Ethnicity, data = dat)
summary(lm.att.eth)
## 
## Call:
## lm(formula = Attractiveness ~ Ethnicity, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10000 -0.72464  0.07692  0.90000  2.55556 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.72464    0.13325  20.448   <2e-16 ***
## EthnicityBlack            -0.80156    0.25470  -3.147    0.002 ** 
## EthnicityHispanic / Latin -0.28019    0.39227  -0.714    0.476    
## EthnicityOther            -0.01035    0.43906  -0.024    0.981    
## EthnicityWhite             0.37536    0.21996   1.706    0.090 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.107 on 146 degrees of freedom
## Multiple R-squared:  0.1121, Adjusted R-squared:  0.08776 
## F-statistic: 4.607 on 4 and 146 DF,  p-value: 0.001563
anova(lm.att.eth)
## Analysis of Variance Table
## 
## Response: Attractiveness
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## Ethnicity   4  22.579  5.6447  4.6075 0.001563 **
## Residuals 146 178.865  1.2251                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Attractiveness against Age

There is no strong correlation between age and attractiveness. The linear trend seems downward but it is not significant as we can see from the p-value from regression analysis.

## 
## Call:
## lm(formula = Attractiveness ~ Age, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9183 -0.8088  0.2381  0.8007  2.4258 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.70036    0.78241   4.729 5.18e-06 ***
## Age         -0.03128    0.02356  -1.328    0.186    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.156 on 149 degrees of freedom
## Multiple R-squared:  0.0117, Adjusted R-squared:  0.005064 
## F-statistic: 1.764 on 1 and 149 DF,  p-value: 0.1862
## Analysis of Variance Table
## 
## Response: Attractiveness
##            Df  Sum Sq Mean Sq F value Pr(>F)
## Age         1   2.356  2.3563  1.7635 0.1862
## Residuals 149 199.087  1.3362

Attractiveness against Body Type

I also seem to have perference against heavier women. Based on result from the linear regression and ANOVA analysis, “A little extra” build, Full figured, and Overweight are rated less attractive with statistical significance.

lm.att.body <- lm(Attractiveness ~ Body_Type, data = dat)
summary(lm.att.body)
## 
## Call:
## lm(formula = Attractiveness ~ Body_Type, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1212 -0.7768  0.0625  0.8788  2.2381 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.76190    0.16857  16.385   <2e-16 ***
## Body_TypeThin                    0.17560    0.32094   0.547   0.5851    
## Body_TypeFit                     0.35931    0.25412   1.414   0.1596    
## Body_TypeAverage build           0.02976    0.27954   0.106   0.9154    
## Body_Type"A little extra" build -1.09524    0.47678  -2.297   0.0231 *  
## Body_TypeCurvy                  -0.56190    0.29679  -1.893   0.0603 .  
## Body_TypeFull figured           -1.04762    0.44599  -2.349   0.0202 *  
## Body_TypeOverweight             -1.42857    0.65286  -2.188   0.0303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 143 degrees of freedom
## Multiple R-squared:  0.1528, Adjusted R-squared:  0.1114 
## F-statistic: 3.685 on 7 and 143 DF,  p-value: 0.001084
anova(lm.att.body)
## Analysis of Variance Table
## 
## Response: Attractiveness
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## Body_Type   7  30.785  4.3979  3.6851 0.001084 **
## Residuals 143 170.659  1.1934                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis - Part III : Who Likes Me After Visited My Profile

Visisting profile is only the very first step in the dating process. Having people like me after the visit is important because it gives me an indication that they are intesteing in me. So it is my turn now to send the people who I find attractive the messages. Of course, most of the messages sent out will probably not get any response back. But nevertheless, knowing who likes you definitely is a great advantage that can save you a lot of time and energy from sending the messages to the people who have zero interesting in you. I believe it is also premium feature to be able to see who like you.

Likes by Date

In the first analysis, I analyized how many visits I have by date. But more important question is out of these visits, how many of them like me.

About 60% of people visited my profile eventually liked me as well. So with this rate, it seems like my profile is okay. So to have more likes is simply to have more visits. That’s why another pay feature is I can boost my visibility.

Likes by Age, Matching score, Ethnicity, and Attractivenss

Now I want to analyze are there any specific groups of people who are more likely to give me likes? And based on the charts and statistical analysis below, it seems Attractiveness and Ethnicity(being white) are the only factors that have the significant difference.

Attractiveness makes sense since more attractive people probably are getting a lot of messages already without having to initiate the connection. They can just wait for people to make first move. On the other hand, for less attractive people, they will have to be more aggressive.

## 
##  Welch Two Sample t-test
## 
## data:  dat$Age[dat$Like == 1] and dat$Age[dat$Like == 0]
## t = -0.58821, df = 109.64, p-value = 0.5576
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7779954  0.9641281
## sample estimates:
## mean of x mean of y 
##  32.81720  33.22414

## 
##  Welch Two Sample t-test
## 
## data:  dat$Match[dat$Like == 1] and dat$Match[dat$Like == 0]
## t = 0.23783, df = 125.1, p-value = 0.8124
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.959706  6.314545
## sample estimates:
## mean of x mean of y 
##  66.67742  66.00000

## 
## Call:
## glm(formula = Like ~ Attractiveness, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6357  -1.3425   0.7802   1.0208   1.2953  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      1.3600     0.4488   3.030  0.00244 **
## Attractiveness  -0.3266     0.1501  -2.175  0.02959 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 201.14  on 150  degrees of freedom
## Residual deviance: 196.24  on 149  degrees of freedom
## AIC: 200.24
## 
## Number of Fisher Scoring iterations: 4

## 
## Call:
## glm(formula = Like ~ Ethnicity, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.734  -1.215   0.802   1.141   1.302  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                0.08701    0.24100   0.361   0.7181  
## EthnicityBlack             0.72392    0.48850   1.482   0.1384  
## EthnicityHispanic / Latin  1.16575    0.83722   1.392   0.1638  
## EthnicityOther            -0.37469    0.80088  -0.468   0.6399  
## EthnicityWhite             0.88239    0.42834   2.060   0.0394 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 201.14  on 150  degrees of freedom
## Residual deviance: 193.77  on 146  degrees of freedom
## AIC: 203.77
## 
## Number of Fisher Scoring iterations: 4

Can I Predict Who Will Like Me after Visited My Profile?

Based on above analysis, it is very intesting to see if I can somehow predict or at least have an estimated possibility on whether or not this person will like me after visit.

Logistic Regression

set.seed(2)
spl = sample.split(dat$Like, SplitRatio = 0.7)
train = subset(dat, spl == TRUE)
test = subset(dat, spl == FALSE)

First model for logistic regression I included as many variables as I think might make sense.

With this model, variables like Attractiveness, Ethnicity, and Language_2nd all have p-value less than 0.05. However, when I look at the prediction results against training and test sets, this model is definitely overfitting. And the accuracy of the test set is even less than the average of 60%.

## 
## Call:
## glm(formula = Like ~ Attractiveness + Age + Match + Height_cm + 
##     Body_Type + Ethnicity + Language_2nd + Education + Age_min + 
##     Age_max + Wholesome + Conservative + Ambitious + Old.fashioned + 
##     Playful + Love.driven + Artsy + Spiritual + Independent + 
##     Polite + Sex.driven, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2067  -0.6886   0.2302   0.7167   2.1769  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                       -4.06647    8.97916  -0.453  0.65064   
## Attractiveness                    -0.99823    0.37520  -2.661  0.00780 **
## Age                               -0.07353    0.13129  -0.560  0.57546   
## Match                             -0.02162    0.01870  -1.156  0.24762   
## Height_cm                          0.08019    0.05415   1.481  0.13864   
## Body_TypeThin                     -1.56656    1.41100  -1.110  0.26689   
## Body_TypeFit                       0.44464    1.02205   0.435  0.66353   
## Body_TypeAverage build            -0.33465    1.12797  -0.297  0.76671   
## Body_Type"A little extra" build    1.20622    1.80293   0.669  0.50347   
## Body_TypeCurvy                    -0.65461    1.21023  -0.541  0.58858   
## Body_TypeFull figured             -2.09721    1.60385  -1.308  0.19100   
## Body_TypeOverweight               16.98851 4145.39449   0.004  0.99673   
## EthnicityBlack                     1.10937    1.10463   1.004  0.31524   
## EthnicityHispanic / Latin          2.87433    1.76138   1.632  0.10271   
## EthnicityOther                    -0.05254    2.31973  -0.023  0.98193   
## EthnicityWhite                     3.32522    1.20359   2.763  0.00573 **
## Language_2ndAfrikaans             18.69106 6522.63944   0.003  0.99771   
## Language_2ndChinese                2.41905    1.16853   2.070  0.03844 * 
## Language_2ndFrench                 0.24612    1.78147   0.138  0.89012   
## Language_2ndGerman                13.36437 6522.63927   0.002  0.99837   
## Language_2ndHindi                 19.63780 6522.63874   0.003  0.99760   
## Language_2ndJapanese              -1.46447    1.57212  -0.932  0.35158   
## Language_2ndKorean                 0.18027    1.82372   0.099  0.92126   
## Language_2ndRussian              -32.61555 4350.57556  -0.007  0.99402   
## Language_2ndSpanish               -0.84618    1.23552  -0.685  0.49342   
## Language_2ndTagalog               12.40203 6522.63911   0.002  0.99848   
## Language_2ndVietnamese           -17.27336 6522.63873  -0.003  0.99789   
## EducationHigh school              11.88991 3076.32164   0.004  0.99692   
## EducationUniversity                0.70463    0.96807   0.728  0.46669   
## EducationPost grad                 0.61863    1.04920   0.590  0.55545   
## Age_min                           -0.02655    0.13786  -0.193  0.84728   
## Age_max                           -0.07039    0.10091  -0.698  0.48545   
## Wholesome                         -0.91960    0.89248  -1.030  0.30283   
## Conservative                       0.99339    0.89796   1.106  0.26861   
## Ambitious                          0.94307    0.93104   1.013  0.31110   
## Old.fashioned                      1.28857    0.85712   1.503  0.13274   
## Playful                           -1.20998    1.11809  -1.082  0.27917   
## Love.driven                        1.90457    1.17358   1.623  0.10462   
## Artsy                             -0.81847    1.35199  -0.605  0.54493   
## Spiritual                          0.39037    0.77317   0.505  0.61363   
## Independent                       -0.38964    1.52416  -0.256  0.79823   
## Polite                            -0.81106    1.30019  -0.624  0.53276   
## Sex.driven                         1.49828    1.06614   1.405  0.15992   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.466  on 105  degrees of freedom
## Residual deviance:  89.366  on  63  degrees of freedom
## AIC: 175.37
## 
## Number of Fisher Scoring iterations: 17

Accuracy for test set

##    
##     FALSE TRUE
##   0     7   10
##   1    11   17
## [1] 0.5333333

Accuracy for traing set

##    
##     FALSE TRUE
##   0    28   13
##   1    10   55
## [1] 0.7830189

So now I reduce the number of variables. The accuracy for training set went down to 65% and test set is now 53% which we can see that the overfitting has been reduced but the accuracy for test set is worse than the baseline.

## 
## Call:
## glm(formula = Like ~ Attractiveness + Age + Ethnicity, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9177  -1.1378   0.6785   0.9528   1.4366  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                4.30225    1.93465   2.224   0.0262 * 
## Attractiveness            -0.59527    0.21873  -2.722   0.0065 **
## Age                       -0.07852    0.05438  -1.444   0.1487   
## EthnicityBlack            -0.02418    0.59740  -0.040   0.9677   
## EthnicityHispanic / Latin  1.53015    1.16025   1.319   0.1872   
## EthnicityOther            -0.34555    1.09381  -0.316   0.7521   
## EthnicityWhite             1.19164    0.56665   2.103   0.0355 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.47  on 105  degrees of freedom
## Residual deviance: 127.23  on  99  degrees of freedom
## AIC: 141.23
## 
## Number of Fisher Scoring iterations: 4

Accuracy for test set

##    
##     FALSE TRUE
##   0     3   14
##   1    10   18
## [1] 0.4666667

Accuracy for training set

##    
##     FALSE TRUE
##   0    16   25
##   1     9   56
## [1] 0.6792453

Decision Tree

Now I’m using decision tree to fit my model. The first decision tree includes many variables. As we can see that variables Attractiveness, Age, Body_Type, Spiritual, and Education at different levels have the impact on whether or like the person will leave a like.

However, this model still overfitted and the prediction result is worse than benchmark.

Accuracy on training set

##    
##     FALSE TRUE
##   0    20   21
##   1     6   59
## [1] 0.745283

Accuracy on test set

##    
##     FALSE TRUE
##   0     3   14
##   1     9   19
## [1] 0.4888889

Now I fit the second decision tree model with less variables. However, just like logistic regression, even I was able to reduce the overfitting on the model, I still do not have a good prediction model. (accuracy on test set is 53% whihc is less than 62% as benchmark)

##    
##     FALSE TRUE
##   0    21   20
##   1    14   51
## [1] 0.6792453
##    
##     FALSE TRUE
##   0     5   12
##   1     9   19
## [1] 0.5333333

Random Forest

Last modeling I tried is Random Forest. It is more complicated but every popular model in machine learning.

Based on the chart below, March Score, Body Type, Age, Age_max, Height, Age_min, and Attractiveness are important compared to others with big drop off. However, based on the prediction result, there is no way I can predict based on the data that I have.

Accuracy of training set

##    m1.rf.train
##      0  1
##   0  3 38
##   1 18 47
## [1] 0.5471698

Accuracy of test set

##    m1.rf.test
##      0  1
##   0  4 13
##   1  8 20
## [1] 0.4888889

Next Steps

  • Using Self-Description Text Data - There is a huge part of data I haven’t used throughout the analysis which is the self-description data. I can use text mining technique to find out what are the most used words from my visits and likes. Then I can make those words as additional explanatory variables when I fit the models.
  • Cluster Analysis - For now I tend to analyze my visits and likes based on single variable. However, it will be fun to see if I can group my visits and likes into seperate demographic groups.
  • Message Response - As I mentioned before, once I got likes, the next step will be send out the messages to the people who I like. So it will also be intesting to see out of all people who I send the message out to, who will response to my message. However, sadly at this point, I do not get enough reponses to do any meaningful analysis.

Additional Analysis

Attractiveness By Week of the day

dat$week <- weekdays(dat$Date)
dat$week <- factor(dat$week, c("Sunday",
                             "Monday",
                             "Tuesday",
                             "Wednesday",
                             "Thursday",
                             "Friday",
                             "Saturday"))
ggplot(dat, aes(x = week, y = Attractiveness)) +
  geom_boxplot() +
  geom_point(position = position_jitter(width = 0.2))