This analysis is exploring the topic of speed-dating. It aims to provide a basic understanding of people’s behavior in such events, including attributes most relevant for successful dating. The data set was gathered from speed dating events from 2002-2004. The research was carried out by Columbia Business School professors Ray Fisman and Sheena Iyengar as basis for their paper “Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment”. Whilst the data set itself is quite large (8378 records) it is worth mentioning that it has a high likelihood of being biased as all participants were Colombia Business School students and thus any conclusions from this project may not necessarily be generalizable.
The data set can be obtained from the following link:
https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv.zip
Metadata can be found here:
The data set includes ratings of speed-dating partners based on six attributes:
The speed-dates lasted four minutes each. The participants met ~10-20 partners on one event. 21 events were recorded in total.
In addition, a questionnaire provided key personal data: demographics, dating habits, self-perception on six attributes (as mentioned above), beliefs on what others find valuable in a mate, lifestyle information etc.
Each row in the data set corresponds to a unique ‘event’ where participant ‘iid’ had a four minute date with their partner (identified by the corresponding ‘pid’).
Packages used:
Because events 6-9 use a different methodology for rating attributes (see metadata), they are excluded from our data set:
sdd<-sdd %>%
filter(wave>9 | wave <6)
sdd = sdd %>%
mutate(income=as.numeric(gsub(",","",income))) %>%
mutate(tuition=as.numeric(gsub(",","",tuition)))
In rating attributes, users were asked to distribute 100 point to six variables. Sometimes, they did not add up the numbers correctly. To correct for their mistakes, the numbers were normalized for the variables attr1_1 to shar1_1. These variables refer to attributes that are important in another person, as rated by the speed-dating participant.
sdd %>%
mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
do(tidy(.$attr1_1/(.$sum1_1/100)))
Next, we used mutate to introduce a new (normalized variable):
sdd <- sdd %>%
mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
mutate(attr1_1n=attr1_1/(sum1_1/100)) %>%
mutate(sinc1_1n=sinc1_1/(sum1_1/100)) %>%
mutate(intel1_1n=intel1_1/(sum1_1/100)) %>%
mutate(amb1_1n=amb1_1/(sum1_1/100)) %>%
mutate(shar1_1n=shar1_1/(sum1_1/100))
One of the problems the data set has is that it is biased. This is because it was conducted in Colombia Business School meaning the conclusion of the analysis is based on the specific sample, not universal. Another problem we found in the data set is that there are many null values in the data set which makes the analysis more difficult and the results less accurate.
About the survey data
The surveys conducted to reflect on the people them self and their partner has 3 different timelines:
We first took a quick look into some of the general data we have.
Number of rows are6816 and we have 201number of variables.
One of the explorations we found to have a relation was the wave. The further in the event the lower the chance of match gets. Until we arrive at the last 4 dates where it seems people are getting nervous to get there dates and the chance of a match increases again.
We looked at the number of dates there we per event. As you can see we took out 6-9 because of the data structure.
Next we went into the demographic outlook of the people in our data set. We decided to look into age, gender and race. We did this to give a general idea of what kind of people we have in our data set. For this reason we looked at demographics and backgrounds.
We found that our gender distribution Male/Female distribution was divided as 0.4992664, 0.5007336
We took a look at the average age and the distribution of age. We found that the average age is 26.2817425
And our age distribution:
Out of curiosity we took a look at the racial distribution of the data set. Later we found that the variable race had neglect able impact on the outcome of the date.
We also wanted to know more about the background of the people in our data set. We looked at study and interest.
And we wanted to know where the people of our data set are interested in. We did this because we had a suspicion that shared interest could be an important variable.
## sports tvsports exercise dining museums art hiking gaming
## 1 6.406408 4.540492 6.159745 7.808514 6.954761 6.71062 5.704835 3.848858
## clubbing tv theater movies concerts shopping yoga
## 1 5.695194 5.294571 6.801988 7.940967 6.909374 5.654257 4.349748
We wanted to see what our candidates thought about what they were looking for and what they thought others were looking for. After this we wanted to know how they rated the people they actually choose for. This way we wanted to show that what people think they want is different from what they actually go for.
For the regression analysis we decided to first look at a derived variable ‘positive responses’ (the percentage of people who said yes to seeing the participate again after the 4 minutes dates) as dependent variable as this seems a good indicator of how successful candidates were at dating. To do this we created a variable >positive_responses< by summing how many ‘dec_o = 1’ each ‘iid’ had and dividing this by total date event they went on. This is a useful analysis to do as it may shed some light on what attributes people look for in a dating partner. This could be useful for speed dating agencies or research looking at dating in modern society.
We also created a series of variables unique to each ‘iid’ as the independent variables. The most important of these are scores of their average attributes. These were created by averaging their the score that partners gave each ‘iid’ for each of 7 attributes (for example each got a score of 0 to 100 based on their attractiveness).
All variables were also run in Ggaly:ggpairs to see if there are other potential IVs we had overlooked (income was removed to save processing time)
The next step was to build and experiment with numerous linear regression models looking at each independent variable alone and then built up using numerous combinations of these variables. We ultimately found six models which should a significant p-value for linear regression.
##
## Call:
## lm(formula = fm1, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48459 -0.08640 -0.00720 0.09401 0.48513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.547853 0.035219 -15.56 <2e-16 ***
## attr 0.157051 0.005597 28.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1414 on 447 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.637
## F-statistic: 787.3 on 1 and 447 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = fm3, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38416 -0.09119 -0.00766 0.08506 0.46093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.706788 0.040965 -17.253 < 2e-16 ***
## attr 0.122397 0.007393 16.555 < 2e-16 ***
## fun 0.058245 0.008601 6.772 4.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1348 on 446 degrees of freedom
## Multiple R-squared: 0.6716, Adjusted R-squared: 0.6701
## F-statistic: 456.1 on 2 and 446 DF, p-value: < 2.2e-16
Next we ran anova tests between our nested models to see if they were significantly different, which they appeared to be.
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 446 8.1807 1 0.76139 41.51 3.045e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + fun
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 446 8.1085 1 0.83367 45.855 4.029e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + fun + shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 445 7.9263 2 1.0159 28.517 2.224e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: y ~ attr + fun + shar
## Model 2: y ~ attr * fun * shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 445 7.9263
## 2 441 7.4838 4 0.44242 6.5176 4.207e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr + fun:shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 441 7.4838
## 2 446 7.8499 -5 -0.36602 4.3137 0.0007659 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To validate our models we used a 10-folded K-Fold cross validation.
## 10 -folded r square: 0.621652 formula: y ~ attr
## 10 -folded r square: 0.6512411 formula: y ~ attr + shar
## 10 -folded r square: 0.6514702 formula: y ~ attr + fun
## 10 -folded r square: 0.6577879 formula: y ~ attr + fun + shar
## 10 -folded r square: 0.6697246 formula: y ~ attr*fun*shar
## 10 -folded r square: 0.6624008 formula: y ~ attr + fun:shar
We also created a regression tree to verify the variable importance:
After looking at our validation we decided to go for model 6. For two reasons. Although we can see that it does not has the highest R-squared we do feel that this model is most explanatory. Model 5, with a higher, R-Squared, has too much interaction. Also we think that the R-squared of this model is sufficiently high enough.
Checking our first assumption " A linear relationship between IVs and DP exists“:
Whilst there does seem to be some kind of limit resulting in there being no residuals in the bottom left corner of the graph (i.e.) all residuals are not totally randomly distributed - overall there seems no clear relationship between the residential and the predicted values (e.g. a linear, exponential, or sin pattern) which suggests that generally the IVs and linearly correlated with the DV.
Next we checked if our residuals were distributed normally.
All but a few points early on are below the theoretical line and a few points at the end are above, generally the residuals seem to follow a very clear normal distribution.
Whilst this assumption is clearly violated to some extend (as there are no two lines that can easily capture all points) this is mostly due to there being no values in the bottom right of the graph. Whilst this does still clearly mean that the assumption of homoscedasticity does not hold to some extent, it is not severe enough to undermine the entire linear regression model.
We also decided to test Homoscedasticity using functions from the ‘car’ package to double check results
##
## Suggested power transformation: 0.9310697
As the Spreadlevelplot above shows, the red line is not horizontal (which would mean homoscedasticity assumptions are totally justifiable), though the slope is quite general suggesting the issue is not too serious.
After this we ran an ncvTest on model 6
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 6.183993 Df = 1 p = 0.0128911
The NCV test also shows a fairly low P-value suggesting the Homoscedasticity assumption has been violated in this case, though only marginally. Whilst we must be clear then that the current linear regression model does violate assumption 3 we will choose to continue with the results regardless until the model can be improved.
## lag Autocorrelation D-W Statistic p-value
## 1 0.1196148 1.755918 0.004
## Alternative hypothesis: rho != 0
Unfortunately the linear regression model does not pass the test of independence (with a Stat of 1.76 and a p-value<0.05) meaning that we cannot consider all the variables to be independent. One reasons for this could be that the data set is biased as all subjects were students at Colombia university rather than a random sample.
## attr fun:shar
## 2.059498 2.059498
As none of the statistics are >5 it seems there is no strong multicolinearity between them - justifying keeping all of them in the model.
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 140 3.465379 0.00058084 0.2608
There appears to be no outlier more than three standard deviations away from the mean. There is thus no good reason to consider removing any of the observations.
As you can see we have some points of influence. But we choose to neglect as the model is strong enough.
In conclusion we have created several linear regression models to try and explain the attribute factors that affect how many positive responses candidates got after going on a date. We selected the linear regression model that had one of the highest R2 (after K-fold validation testing), but that was also explainable and helpful to understanding the problem (e.g. with minimum explainable interaction effects). The final R2 value for our model is 0.66 (model_6), including the explanatory variables attractiveness (‘attr_o’) and the interaction effect between how fun the candidate was rated (fun) and the candidate and partners shared interests (shar). Overall the model shows that attractiveness is the most important variable for explaining positive responses, though the interaction of fun and share in also significant. These insights could be used either to build understanding of how to pair people in a speed dating agency, or for research into dating in modern society.
Stepwise regression is hated by statisticians - read criticism below. Yet we wanted to take a quick look at the stepwise regression to see if it would be similar to our model.
## Start: AIC=-644.08
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + attr 1 7.3489 4.7415 -849.88
## + fun 1 5.2604 6.8300 -768.86
## + shar 1 4.4567 7.6337 -744.16
## + amb 1 1.6008 10.4896 -673.61
## + intel 1 1.5756 10.5147 -673.08
## + sinc 1 1.5232 10.5672 -671.97
## + male 1 0.4530 11.6374 -650.56
## + date 1 0.2947 11.7957 -647.56
## + gout 1 0.2065 11.8839 -645.90
## + age 1 0.1770 11.9134 -645.35
## + goal 1 0.1152 11.9752 -644.20
## <none> 12.0904 -644.08
## + inc 1 0.0176 12.0728 -642.40
## + cint 1 0.0052 12.0852 -642.17
## + iid 1 0.0013 12.0891 -642.10
## + exph 1 0.0005 12.0899 -642.09
##
## Step: AIC=-849.88
## y ~ attr
##
## Df Sum of Sq RSS AIC
## + fun 1 0.36839 4.3731 -865.84
## + shar 1 0.20926 4.5323 -857.90
## + sinc 1 0.08534 4.6562 -851.92
## + intel 1 0.06992 4.6716 -851.18
## <none> 4.7415 -849.88
## + amb 1 0.04039 4.7011 -849.78
## + inc 1 0.03837 4.7031 -849.69
## + age 1 0.03397 4.7075 -849.48
## + goal 1 0.02046 4.7211 -848.84
## + gout 1 0.02028 4.7212 -848.83
## + exph 1 0.01784 4.7237 -848.72
## + male 1 0.01169 4.7298 -848.43
## + date 1 0.00633 4.7352 -848.18
## + iid 1 0.00273 4.7388 -848.01
## + cint 1 0.00088 4.7406 -847.92
##
## Step: AIC=-865.84
## y ~ attr + fun
##
## Df Sum of Sq RSS AIC
## + goal 1 0.042571 4.3306 -866.01
## <none> 4.3731 -865.84
## + age 1 0.029251 4.3439 -865.33
## + inc 1 0.028543 4.3446 -865.29
## + gout 1 0.021939 4.3512 -864.95
## + intel 1 0.020323 4.3528 -864.87
## + shar 1 0.017080 4.3560 -864.71
## + sinc 1 0.013913 4.3592 -864.55
## + exph 1 0.013786 4.3593 -864.54
## + male 1 0.010973 4.3622 -864.40
## + date 1 0.004039 4.3691 -864.04
## + amb 1 0.003222 4.3699 -864.00
## + iid 1 0.001729 4.3714 -863.93
## + cint 1 0.000020 4.3731 -863.84
##
## Step: AIC=-866.01
## y ~ attr + fun + goal
##
## Df Sum of Sq RSS AIC
## <none> 4.3306 -866.01
## + inc 1 0.037741 4.2928 -865.95
## + intel 1 0.030046 4.3005 -865.56
## + gout 1 0.023290 4.3073 -865.21
## + sinc 1 0.022663 4.3079 -865.17
## + age 1 0.019077 4.3115 -864.99
## + shar 1 0.016683 4.3139 -864.87
## + exph 1 0.014120 4.3164 -864.74
## + male 1 0.008219 4.3223 -864.43
## + date 1 0.005294 4.3253 -864.28
## + amb 1 0.004235 4.3263 -864.23
## + iid 1 0.001596 4.3290 -864.09
## + cint 1 0.000448 4.3301 -864.03
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## y ~ 1
##
## Final Model:
## y ~ attr + fun + goal
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 221 12.090388 -644.0792
## 2 + attr 1 7.34887647 220 4.741512 -849.8833
## 3 + fun 1 0.36838754 219 4.373124 -865.8383
## 4 + goal 1 0.04257121 218 4.330553 -866.0100
##
## Call:
## lm(formula = y ~ attr + fun + goal, data = ngsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39987 -0.09493 -0.00725 0.07484 0.44372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.680014 0.064670 -10.515 < 2e-16 ***
## attr 0.123588 0.011383 10.857 < 2e-16 ***
## fun 0.056796 0.012810 4.434 1.47e-05 ***
## goal -0.009072 0.006197 -1.464 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1409 on 218 degrees of freedom
## Multiple R-squared: 0.6418, Adjusted R-squared: 0.6369
## F-statistic: 130.2 on 3 and 218 DF, p-value: < 2.2e-16