It has been a little more than a month since I joined a dating website called Okcupid. It is a very pupular dating website which uses complex mathmatical algorithm to match people based on their personalities and perferences.
This is the first time I have used such dating werbsite, and it has been an interesting experience where I can find all different kind of people who I do not normally meet in real life. It is fascinating that some of them actually visited my profile and even liked / messaged me.
As a data enthusiast, it only way to make sense of this whole dynamic is through data analysis. Luckily, there are so many data points available through person’s own personal profile that I can use to understand not only what kind of people like me but also what kind of people I’m attracted to.
Here is my basic profile so it gives a perspective and baseline on the analysis.
library(dplyr)
library(ggplot2)
library(ggmap)
library(leaflet)
library(tidyr)
library(ggmosaic)
library(lubridate)
library(rpart)
library(rpart.plot)
library(randomForest)
library(DT)
library(caTools)
dat <- read.csv("data_analysis.csv")
Perform data imputation and manipulation before analysis
dat$Date <- as.Date(dat$Date, format = "%m/%d/%Y")
dat$Like <- factor(dat$Like)
dat$Height_cm[dat$Height_cm == 213.36] <- NA
dat$Height_cm[is.na(dat$Height_cm)] <- mean(dat$Height_cm, na.rm = TRUE)
dat$Body_Type <- factor(dat$Body_Type, levels(dat$Body_Type)[c(1,8,5,3,2,4,6,7)])
dat$Ethnicity[dat$Ethnicity %in% c("Pacific Islander", "Indian")] <- "Asian"
dat$Ethnicity <- factor(dat$Ethnicity)
dat$Education[dat$Education == "Dropped out of University"] <- "High school"
dat$Education[dat$Education == "Working on Post grad"] <- "Post grad"
dat$Education[dat$Education == "Working on University"] <- "University"
dat$Education[dat$Education == "Working on Space camp"] <- ""
dat$Education <- factor(dat$Education)
dat$Education <- factor(dat$Education, levels(dat$Education)[c(1,2,4,3)])
I got about 4 vistors per day. But as we can see that a lot of my visits came from the first couple of days since I joined. There is significant drop off after it. The averge went from 4.2 to 3, a 30% reduction. Of course, there is a pay fearure that allow people to boot their visibility as I used on May-5 which brought my visits back to initial run rate.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.194 5.000 16.000
Average without boosting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 3.000 2.966 4.000 8.000
It looks like more people visited me during Sunday and Saturday which is a little surpising to me since I thought people might use it more often during the week.
I set my matching age range from 23 years old to 38 years old (I’m 36) which can probably explain the left-skewed distribution. However, it is interesting to see the overall distribution a multimodal distribution where one peak is around age 33 and one is around 38.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 30.00 33.00 32.97 37.00 40.00
So I looked into what might be the underlying factors by breaking the age distribution into different ethnicity groups. I can now clearly see the two peaks are coming from different ethnicity populations. Asian visits have peaks around 32 years old, while black/hispanic is around 34, and white is around 37.
More than 40% of my visits are Asian which shows that people might still perfer their own race. Also 40% of visits have postgraduation degree. It shows that not only race, people also also are looking for someone who has similar education background. It is one of the cognitive biases that we perfer people who are similar to us.
Most of the visits came from Manhattan with Brooklyn being second.
When I setup my profile, Okcupid asked me a lot of behavioral and personality questions. One of the outputs from these assessments is to compare your personality against other users. Some of top personaliies from my visits are wholesome, conservative, ambitious, etc. On the other hand, Sex.driven has the lowest score. I am not sure if I should be happy or not about this result.
Match score is the key core competency from Okcupid since it prides itself as a analytics driven dating website that can find the best matches based on its algorithm. And the way people show up from the search is also heavily driven by the match scores.
The overall match score from my visits are mostly above 50% with average around 66%. The match score distribution seems equally accross different backgrounds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 61.50 69.00 66.42 77.00 89.00
I give every visits a rating from 1 to 5 based on their profile pictures. 1 is the least attractive and 5 is the most attractive. This exercise is to help me understand what kind of people I’m attracted to, and do I have perference toward certain race, age, or body type?
The average rating I gave all of my vistors is 2.7. The overall distribtuion is right-skewed which is either because I have high standard or people who are very attractive don’t usually visit me. I believe the later one might make more sense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.669 3.500 5.000
Based on the boxplot, it seems like I do have different perference among different ethnicity groups. The attractiveness score for black is noticable lower and white is higher.
So I ran a simple linear regression using ethnicity as explanatory variable and attractivenss as response variable. The p-values for coefficient of EthnicityBlack is less than 0.05 which means I do find black people less attractive. Same conclusion from ANOVA analysis.
lm.att.eth <- lm(Attractiveness ~ Ethnicity, data = dat)
summary(lm.att.eth)
##
## Call:
## lm(formula = Attractiveness ~ Ethnicity, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10000 -0.72464 0.07692 0.90000 2.55556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72464 0.13325 20.448 <2e-16 ***
## EthnicityBlack -0.80156 0.25470 -3.147 0.002 **
## EthnicityHispanic / Latin -0.28019 0.39227 -0.714 0.476
## EthnicityOther -0.01035 0.43906 -0.024 0.981
## EthnicityWhite 0.37536 0.21996 1.706 0.090 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.107 on 146 degrees of freedom
## Multiple R-squared: 0.1121, Adjusted R-squared: 0.08776
## F-statistic: 4.607 on 4 and 146 DF, p-value: 0.001563
anova(lm.att.eth)
## Analysis of Variance Table
##
## Response: Attractiveness
## Df Sum Sq Mean Sq F value Pr(>F)
## Ethnicity 4 22.579 5.6447 4.6075 0.001563 **
## Residuals 146 178.865 1.2251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no strong correlation between age and attractiveness. The linear trend seems downward but it is not significant as we can see from the p-value from regression analysis.
##
## Call:
## lm(formula = Attractiveness ~ Age, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9183 -0.8088 0.2381 0.8007 2.4258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.70036 0.78241 4.729 5.18e-06 ***
## Age -0.03128 0.02356 -1.328 0.186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.156 on 149 degrees of freedom
## Multiple R-squared: 0.0117, Adjusted R-squared: 0.005064
## F-statistic: 1.764 on 1 and 149 DF, p-value: 0.1862
## Analysis of Variance Table
##
## Response: Attractiveness
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 2.356 2.3563 1.7635 0.1862
## Residuals 149 199.087 1.3362
I also seem to have perference against heavier women. Based on result from the linear regression and ANOVA analysis, “A little extra” build, Full figured, and Overweight are rated less attractive with statistical significance.
lm.att.body <- lm(Attractiveness ~ Body_Type, data = dat)
summary(lm.att.body)
##
## Call:
## lm(formula = Attractiveness ~ Body_Type, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1212 -0.7768 0.0625 0.8788 2.2381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.76190 0.16857 16.385 <2e-16 ***
## Body_TypeThin 0.17560 0.32094 0.547 0.5851
## Body_TypeFit 0.35931 0.25412 1.414 0.1596
## Body_TypeAverage build 0.02976 0.27954 0.106 0.9154
## Body_Type"A little extra" build -1.09524 0.47678 -2.297 0.0231 *
## Body_TypeCurvy -0.56190 0.29679 -1.893 0.0603 .
## Body_TypeFull figured -1.04762 0.44599 -2.349 0.0202 *
## Body_TypeOverweight -1.42857 0.65286 -2.188 0.0303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.092 on 143 degrees of freedom
## Multiple R-squared: 0.1528, Adjusted R-squared: 0.1114
## F-statistic: 3.685 on 7 and 143 DF, p-value: 0.001084
anova(lm.att.body)
## Analysis of Variance Table
##
## Response: Attractiveness
## Df Sum Sq Mean Sq F value Pr(>F)
## Body_Type 7 30.785 4.3979 3.6851 0.001084 **
## Residuals 143 170.659 1.1934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Visisting profile is only the very first step in the dating process. Having people like me after the visit is important because it gives me an indication that they are intesteing in me. So it is my turn now to send the people who I find attractive the messages. Of course, most of the messages sent out will probably not get any response back. But nevertheless, knowing who likes you definitely is a great advantage that can save you a lot of time and energy from sending the messages to the people who have zero interesting in you. I believe it is also premium feature to be able to see who like you.
In the first analysis, I analyized how many visits I have by date. But more important question is out of these visits, how many of them like me.
About 60% of people visited my profile eventually liked me as well. So with this rate, it seems like my profile is okay. So to have more likes is simply to have more visits. That’s why another pay feature is I can boost my visibility.
Now I want to analyze are there any specific groups of people who are more likely to give me likes? And based on the charts and statistical analysis below, it seems Attractiveness and Ethnicity(being white) are the only factors that have the significant difference.
Attractiveness makes sense since more attractive people probably are getting a lot of messages already without having to initiate the connection. They can just wait for people to make first move. On the other hand, for less attractive people, they will have to be more aggressive.
##
## Welch Two Sample t-test
##
## data: dat$Age[dat$Like == 1] and dat$Age[dat$Like == 0]
## t = -0.58821, df = 109.64, p-value = 0.5576
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.7779954 0.9641281
## sample estimates:
## mean of x mean of y
## 32.81720 33.22414
##
## Welch Two Sample t-test
##
## data: dat$Match[dat$Like == 1] and dat$Match[dat$Like == 0]
## t = 0.23783, df = 125.1, p-value = 0.8124
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.959706 6.314545
## sample estimates:
## mean of x mean of y
## 66.67742 66.00000
##
## Call:
## glm(formula = Like ~ Attractiveness, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6357 -1.3425 0.7802 1.0208 1.2953
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3600 0.4488 3.030 0.00244 **
## Attractiveness -0.3266 0.1501 -2.175 0.02959 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 201.14 on 150 degrees of freedom
## Residual deviance: 196.24 on 149 degrees of freedom
## AIC: 200.24
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = Like ~ Ethnicity, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.734 -1.215 0.802 1.141 1.302
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.08701 0.24100 0.361 0.7181
## EthnicityBlack 0.72392 0.48850 1.482 0.1384
## EthnicityHispanic / Latin 1.16575 0.83722 1.392 0.1638
## EthnicityOther -0.37469 0.80088 -0.468 0.6399
## EthnicityWhite 0.88239 0.42834 2.060 0.0394 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 201.14 on 150 degrees of freedom
## Residual deviance: 193.77 on 146 degrees of freedom
## AIC: 203.77
##
## Number of Fisher Scoring iterations: 4
Based on above analysis, it is very intesting to see if I can somehow predict or at least have an estimated possibility on whether or not this person will like me after visit.
Logistic Regression
set.seed(2)
spl = sample.split(dat$Like, SplitRatio = 0.7)
train = subset(dat, spl == TRUE)
test = subset(dat, spl == FALSE)
First model for logistic regression I included as many variables as I think might make sense.
With this model, variables like Attractiveness, Ethnicity, and Language_2nd all have p-value less than 0.05. However, when I look at the prediction results against training and test sets, this model is definitely overfitting. And the accuracy of the test set is even less than the average of 60%.
##
## Call:
## glm(formula = Like ~ Attractiveness + Age + Match + Height_cm +
## Body_Type + Ethnicity + Language_2nd + Education + Age_min +
## Age_max + Wholesome + Conservative + Ambitious + Old.fashioned +
## Playful + Love.driven + Artsy + Spiritual + Independent +
## Polite + Sex.driven, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2067 -0.6886 0.2302 0.7167 2.1769
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.06647 8.97916 -0.453 0.65064
## Attractiveness -0.99823 0.37520 -2.661 0.00780 **
## Age -0.07353 0.13129 -0.560 0.57546
## Match -0.02162 0.01870 -1.156 0.24762
## Height_cm 0.08019 0.05415 1.481 0.13864
## Body_TypeThin -1.56656 1.41100 -1.110 0.26689
## Body_TypeFit 0.44464 1.02205 0.435 0.66353
## Body_TypeAverage build -0.33465 1.12797 -0.297 0.76671
## Body_Type"A little extra" build 1.20622 1.80293 0.669 0.50347
## Body_TypeCurvy -0.65461 1.21023 -0.541 0.58858
## Body_TypeFull figured -2.09721 1.60385 -1.308 0.19100
## Body_TypeOverweight 16.98851 4145.39449 0.004 0.99673
## EthnicityBlack 1.10937 1.10463 1.004 0.31524
## EthnicityHispanic / Latin 2.87433 1.76138 1.632 0.10271
## EthnicityOther -0.05254 2.31973 -0.023 0.98193
## EthnicityWhite 3.32522 1.20359 2.763 0.00573 **
## Language_2ndAfrikaans 18.69106 6522.63944 0.003 0.99771
## Language_2ndChinese 2.41905 1.16853 2.070 0.03844 *
## Language_2ndFrench 0.24612 1.78147 0.138 0.89012
## Language_2ndGerman 13.36437 6522.63927 0.002 0.99837
## Language_2ndHindi 19.63780 6522.63874 0.003 0.99760
## Language_2ndJapanese -1.46447 1.57212 -0.932 0.35158
## Language_2ndKorean 0.18027 1.82372 0.099 0.92126
## Language_2ndRussian -32.61555 4350.57556 -0.007 0.99402
## Language_2ndSpanish -0.84618 1.23552 -0.685 0.49342
## Language_2ndTagalog 12.40203 6522.63911 0.002 0.99848
## Language_2ndVietnamese -17.27336 6522.63873 -0.003 0.99789
## EducationHigh school 11.88991 3076.32164 0.004 0.99692
## EducationUniversity 0.70463 0.96807 0.728 0.46669
## EducationPost grad 0.61863 1.04920 0.590 0.55545
## Age_min -0.02655 0.13786 -0.193 0.84728
## Age_max -0.07039 0.10091 -0.698 0.48545
## Wholesome -0.91960 0.89248 -1.030 0.30283
## Conservative 0.99339 0.89796 1.106 0.26861
## Ambitious 0.94307 0.93104 1.013 0.31110
## Old.fashioned 1.28857 0.85712 1.503 0.13274
## Playful -1.20998 1.11809 -1.082 0.27917
## Love.driven 1.90457 1.17358 1.623 0.10462
## Artsy -0.81847 1.35199 -0.605 0.54493
## Spiritual 0.39037 0.77317 0.505 0.61363
## Independent -0.38964 1.52416 -0.256 0.79823
## Polite -0.81106 1.30019 -0.624 0.53276
## Sex.driven 1.49828 1.06614 1.405 0.15992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141.466 on 105 degrees of freedom
## Residual deviance: 89.366 on 63 degrees of freedom
## AIC: 175.37
##
## Number of Fisher Scoring iterations: 17
Accuracy for test set
##
## FALSE TRUE
## 0 7 10
## 1 11 17
## [1] 0.5333333
Accuracy for traing set
##
## FALSE TRUE
## 0 28 13
## 1 10 55
## [1] 0.7830189
So now I reduce the number of variables. The accuracy for training set went down to 65% and test set is now 53% which we can see that the overfitting has been reduced but the accuracy for test set is worse than the baseline.
##
## Call:
## glm(formula = Like ~ Attractiveness + Age + Ethnicity, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9177 -1.1378 0.6785 0.9528 1.4366
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.30225 1.93465 2.224 0.0262 *
## Attractiveness -0.59527 0.21873 -2.722 0.0065 **
## Age -0.07852 0.05438 -1.444 0.1487
## EthnicityBlack -0.02418 0.59740 -0.040 0.9677
## EthnicityHispanic / Latin 1.53015 1.16025 1.319 0.1872
## EthnicityOther -0.34555 1.09381 -0.316 0.7521
## EthnicityWhite 1.19164 0.56665 2.103 0.0355 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141.47 on 105 degrees of freedom
## Residual deviance: 127.23 on 99 degrees of freedom
## AIC: 141.23
##
## Number of Fisher Scoring iterations: 4
Accuracy for test set
##
## FALSE TRUE
## 0 3 14
## 1 10 18
## [1] 0.4666667
Accuracy for training set
##
## FALSE TRUE
## 0 16 25
## 1 9 56
## [1] 0.6792453
Decision Tree
Now I’m using decision tree to fit my model. The first decision tree includes many variables. As we can see that variables Attractiveness, Age, Body_Type, Spiritual, and Education at different levels have the impact on whether or like the person will leave a like.
However, this model still overfitted and the prediction result is worse than benchmark.
Accuracy on training set
##
## FALSE TRUE
## 0 20 21
## 1 6 59
## [1] 0.745283
Accuracy on test set
##
## FALSE TRUE
## 0 3 14
## 1 9 19
## [1] 0.4888889
Now I fit the second decision tree model with less variables. However, just like logistic regression, even I was able to reduce the overfitting on the model, I still do not have a good prediction model. (accuracy on test set is 53% whihc is less than 62% as benchmark)
##
## FALSE TRUE
## 0 21 20
## 1 14 51
## [1] 0.6792453
##
## FALSE TRUE
## 0 5 12
## 1 9 19
## [1] 0.5333333
Random Forest
Last modeling I tried is Random Forest. It is more complicated but every popular model in machine learning.
Based on the chart below, March Score, Body Type, Age, Age_max, Height, Age_min, and Attractiveness are important compared to others with big drop off. However, based on the prediction result, there is no way I can predict based on the data that I have.
Accuracy of training set
## m1.rf.train
## 0 1
## 0 3 38
## 1 18 47
## [1] 0.5471698
Accuracy of test set
## m1.rf.test
## 0 1
## 0 4 13
## 1 8 20
## [1] 0.4888889
dat$week <- weekdays(dat$Date)
dat$week <- factor(dat$week, c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
ggplot(dat, aes(x = week, y = Attractiveness)) +
geom_boxplot() +
geom_point(position = position_jitter(width = 0.2))