1. INTRODUCTION

1.1. Background

This analysis is exploring the topic of speed-dating. It aims to provide a basic understanding of people’s behavior in such events, including attributes most relevant for successful dating. The data set was gathered from speed dating events from 2002-2004. The research was carried out by Columbia Business School professors Ray Fisman and Sheena Iyengar as basis for their paper “Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment”. Whilst the data set itself is quite large (8378 records) it is worth mentioning that it has a high likelihood of being biased as all participants were Colombia Business School students and thus any conclusions from this project may not necessarily be generalizable.

1.2. DATA DESCRIPTION

The data set can be obtained from the following link:

https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv.zip

Metadata can be found here:

https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data%20Key.doc

The data set includes ratings of speed-dating partners based on six attributes:

  • Attractiveness (attr_o)
  • Sincerity (sinc_o)
  • Intelligence (intel_o)
  • Fun (fun_o)
  • Ambition (amb_o)
  • Shared Interests (shar_o)

The speed-dates lasted four minutes each. The participants met ~10-20 partners on one event. 21 events were recorded in total.

In addition, a questionnaire provided key personal data: demographics, dating habits, self-perception on six attributes (as mentioned above), beliefs on what others find valuable in a mate, lifestyle information etc.

Each row in the data set corresponds to a unique ‘event’ where participant ‘iid’ had a four minute date with their partner (identified by the corresponding ‘pid’).

Packages used:

  • dplyr
  • ggplot2
  • rpart
  • rattle
  • rpart.plot
  • RColorBrewer
  • foreign
  • broom
  • corrgram
  • effects
  • car
  • caret
  • GGally
  • gridExtra

2. Data Cleaning

2.1 Excluding events 6-9 due to different methodology

Because events 6-9 use a different methodology for rating attributes (see metadata), they are excluded from our data set:

sdd<-sdd %>%
  filter(wave>9 | wave <6)

sdd = sdd %>% 
  mutate(income=as.numeric(gsub(",","",income))) %>% 
  mutate(tuition=as.numeric(gsub(",","",tuition)))

2.2 Normalizing attributes

In rating attributes, users were asked to distribute 100 point to six variables. Sometimes, they did not add up the numbers correctly. To correct for their mistakes, the numbers were normalized for the variables attr1_1 to shar1_1. These variables refer to attributes that are important in another person, as rated by the speed-dating participant.

sdd %>%
  mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
  do(tidy(.$attr1_1/(.$sum1_1/100)))

Next, we used mutate to introduce a new (normalized variable):

sdd <- sdd %>%
  mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
  mutate(attr1_1n=attr1_1/(sum1_1/100)) %>%
  mutate(sinc1_1n=sinc1_1/(sum1_1/100)) %>%
  mutate(intel1_1n=intel1_1/(sum1_1/100)) %>%
  mutate(amb1_1n=amb1_1/(sum1_1/100)) %>%
  mutate(shar1_1n=shar1_1/(sum1_1/100))

3. Descriptive Analysis

One of the problems the data set has is that it is biased. This is because it was conducted in Colombia Business School meaning the conclusion of the analysis is based on the specific sample, not universal. Another problem we found in the data set is that there are many null values in the data set which makes the analysis more difficult and the results less accurate.

About the survey data

The surveys conducted to reflect on the people them self and their partner has 3 different timelines:

  1. Prior to the date.
  2. Followup survey taken day after the date.
  3. Second Followup survey is taken 3-4weeks after submitting match.

3.1. General Data

We first took a quick look into some of the general data we have.

Data set numbers

Number of rows are6816 and we have 201number of variables.

Match by Number of date

One of the explorations we found to have a relation was the wave. The further in the event the lower the chance of match gets. Until we arrive at the last 4 dates where it seems people are getting nervous to get there dates and the chance of a match increases again.

Number of dates per event

We looked at the number of dates there we per event. As you can see we took out 6-9 because of the data structure.

3.2. Demographic Data

Next we went into the demographic outlook of the people in our data set. We decided to look into age, gender and race. We did this to give a general idea of what kind of people we have in our data set. For this reason we looked at demographics and backgrounds.

We found that our gender distribution Male/Female distribution was divided as 0.4992664, 0.5007336

Age

We took a look at the average age and the distribution of age. We found that the average age is 26.2817425

And our age distribution:

Race

Out of curiosity we took a look at the racial distribution of the data set. Later we found that the variable race had neglect able impact on the outcome of the date.

3.3. Background

We also wanted to know more about the background of the people in our data set. We looked at study and interest.

Field of study

Interest

And we wanted to know where the people of our data set are interested in. We did this because we had a suspicion that shared interest could be an important variable.

##     sports tvsports exercise   dining  museums     art   hiking   gaming
## 1 6.406408 4.540492 6.159745 7.808514 6.954761 6.71062 5.704835 3.848858
##   clubbing       tv  theater   movies concerts shopping     yoga
## 1 5.695194 5.294571 6.801988 7.940967 6.909374 5.654257 4.349748

3.4. Comparing importance of attributes according to surveys

We wanted to see what our candidates thought about what they were looking for and what they thought others were looking for. After this we wanted to know how they rated the people they actually choose for. This way we wanted to show that what people think they want is different from what they actually go for.

4. Predictive Analysis

For the regression analysis we decided to first look at a derived variable ‘positive responses’ (the percentage of people who said yes to seeing the participate again after the 4 minutes dates) as dependent variable as this seems a good indicator of how successful candidates were at dating. To do this we created a variable >positive_responses< by summing how many ‘dec_o = 1’ each ‘iid’ had and dividing this by total date event they went on. This is a useful analysis to do as it may shed some light on what attributes people look for in a dating partner. This could be useful for speed dating agencies or research looking at dating in modern society.

We also created a series of variables unique to each ‘iid’ as the independent variables. The most important of these are scores of their average attributes. These were created by averaging their the score that partners gave each ‘iid’ for each of 7 attributes (for example each got a score of 0 to 100 based on their attractiveness).

All variables were also run in Ggaly:ggpairs to see if there are other potential IVs we had overlooked (income was removed to save processing time)

4.1. Building the models

The next step was to build and experiment with numerous linear regression models looking at each independent variable alone and then built up using numerous combinations of these variables. We ultimately found six models which should a significant p-value for linear regression.

Model 1: y ~ attrativeness

## 
## Call:
## lm(formula = fm1, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48459 -0.08640 -0.00720  0.09401  0.48513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.547853   0.035219  -15.56   <2e-16 ***
## attr         0.157051   0.005597   28.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1414 on 447 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.637 
## F-statistic: 787.3 on 1 and 447 DF,  p-value: < 2.2e-16

Model 2: y ~ Attrativeness + Shared interests

## 
## Call:
## lm(formula = fm2, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46077 -0.08450 -0.00428  0.08880  0.45175 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.666995   0.038461 -17.342  < 2e-16 ***
## attr         0.126427   0.007164  17.649  < 2e-16 ***
## shar         0.056663   0.008795   6.443 3.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 446 degrees of freedom
## Multiple R-squared:  0.6687, Adjusted R-squared:  0.6672 
## F-statistic: 450.1 on 2 and 446 DF,  p-value: < 2.2e-16

Model 3: y ~ Attrativeness + Fun

## 
## Call:
## lm(formula = fm3, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38416 -0.09119 -0.00766  0.08506  0.46093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.706788   0.040965 -17.253  < 2e-16 ***
## attr         0.122397   0.007393  16.555  < 2e-16 ***
## fun          0.058245   0.008601   6.772 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1348 on 446 degrees of freedom
## Multiple R-squared:  0.6716, Adjusted R-squared:  0.6701 
## F-statistic: 456.1 on 2 and 446 DF,  p-value: < 2.2e-16

Model 4: y ~ Attrativeness + Shared interests + Fun

## 
## Call:
## lm(formula = fm4, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39479 -0.08417 -0.00393  0.09095  0.45076 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.726006   0.040991 -17.712  < 2e-16 ***
## attr         0.115435   0.007635  15.120  < 2e-16 ***
## fun          0.039242   0.010382   3.780 0.000178 ***
## shar         0.033801   0.010568   3.198 0.001481 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1335 on 445 degrees of freedom
## Multiple R-squared:  0.679,  Adjusted R-squared:  0.6768 
## F-statistic: 313.8 on 3 and 445 DF,  p-value: < 2.2e-16

Model 5: y ~ Attrativeness * Shared interests * Fun

## 
## Call:
## lm(formula = fm5, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40341 -0.07962 -0.00837  0.08123  0.46743 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.660435   0.646451   2.569 0.010541 *  
## attr          -0.407726   0.137494  -2.965 0.003187 ** 
## fun           -0.298734   0.112897  -2.646 0.008434 ** 
## shar          -0.289769   0.133071  -2.178 0.029969 *  
## attr:fun       0.073937   0.021306   3.470 0.000571 ***
## attr:shar      0.075859   0.026118   2.904 0.003864 ** 
## fun:shar       0.042748   0.020802   2.055 0.040469 *  
## attr:fun:shar -0.010243   0.003736  -2.742 0.006363 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1303 on 441 degrees of freedom
## Multiple R-squared:  0.6969, Adjusted R-squared:  0.6921 
## F-statistic: 144.9 on 7 and 441 DF,  p-value: < 2.2e-16

Model 6: y ~ Attrativeness + Shared interests:Fun

## 
## Call:
## lm(formula = fm6, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39249 -0.08549 -0.00247  0.08594  0.45351 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5190200  0.0332373 -15.616  < 2e-16 ***
## attr         0.1144801  0.0075343  15.195  < 2e-16 ***
## fun:shar     0.0065750  0.0008346   7.878 2.57e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1327 on 446 degrees of freedom
## Multiple R-squared:  0.6821, Adjusted R-squared:  0.6807 
## F-statistic: 478.5 on 2 and 446 DF,  p-value: < 2.2e-16

4.2. Anova testing of the models

Next we ran anova tests between our nested models to see if they were significantly different, which they appeared to be.

ANOVA model 1, model 2

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + shar
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    447 8.9421                                 
## 2    446 8.1807  1   0.76139 41.51 3.045e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA model 1, model 3

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + fun
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    447 8.9421                                  
## 2    446 8.1085  1   0.83367 45.855 4.029e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA model 1, model 4

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + fun + shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    447 8.9421                                  
## 2    445 7.9263  2    1.0159 28.517 2.224e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA model 4, model 5

## Analysis of Variance Table
## 
## Model 1: y ~ attr + fun + shar
## Model 2: y ~ attr * fun * shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    445 7.9263                                  
## 2    441 7.4838  4   0.44242 6.5176 4.207e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA model 5, model 6

## Analysis of Variance Table
## 
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr + fun:shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    441 7.4838                                  
## 2    446 7.8499 -5  -0.36602 4.3137 0.0007659 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.3. MODEL VALIDATION: K-FOLD

To validate our models we used a 10-folded K-Fold cross validation.

## 10 -folded r square:  0.621652  formula:  y ~ attr 
## 10 -folded r square:  0.6512411  formula:  y ~ attr + shar 
## 10 -folded r square:  0.6514702  formula:  y ~ attr + fun 
## 10 -folded r square:  0.6577879  formula:  y ~ attr + fun + shar 
## 10 -folded r square:  0.6697246  formula:  y ~ attr*fun*shar 
## 10 -folded r square:  0.6624008  formula:  y ~ attr + fun:shar

We also created a regression tree to verify the variable importance:

After looking at our validation we decided to go for model 6. For two reasons. Although we can see that it does not has the highest R-squared we do feel that this model is most explanatory. Model 5, with a higher, R-Squared, has too much interaction. Also we think that the R-squared of this model is sufficiently high enough.

5. DIAGNOSTICS

5.1. Linear relation

Checking our first assumption " A linear relationship between IVs and DP exists“:

Whilst there does seem to be some kind of limit resulting in there being no residuals in the bottom left corner of the graph (i.e.) all residuals are not totally randomly distributed - overall there seems no clear relationship between the residential and the predicted values (e.g. a linear, exponential, or sin pattern) which suggests that generally the IVs and linearly correlated with the DV.

5.2. Normal distribution

Next we checked if our residuals were distributed normally.

All but a few points early on are below the theoretical line and a few points at the end are above, generally the residuals seem to follow a very clear normal distribution.

5.3. Homoscedasticity of residuals

Whilst this assumption is clearly violated to some extend (as there are no two lines that can easily capture all points) this is mostly due to there being no values in the bottom right of the graph. Whilst this does still clearly mean that the assumption of homoscedasticity does not hold to some extent, it is not severe enough to undermine the entire linear regression model.

We also decided to test Homoscedasticity using functions from the ‘car’ package to double check results

## 
## Suggested power transformation:  0.9310697

As the Spreadlevelplot above shows, the red line is not horizontal (which would mean homoscedasticity assumptions are totally justifiable), though the slope is quite general suggesting the issue is not too serious.

After this we ran an ncvTest on model 6

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 6.183993    Df = 1     p = 0.0128911

The NCV test also shows a fairly low P-value suggesting the Homoscedasticity assumption has been violated in this case, though only marginally. Whilst we must be clear then that the current linear regression model does violate assumption 3 we will choose to continue with the results regardless until the model can be improved.

5.4. Independence of IVs

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1196148      1.755918   0.004
##  Alternative hypothesis: rho != 0

Unfortunately the linear regression model does not pass the test of independence (with a Stat of 1.76 and a p-value<0.05) meaning that we cannot consider all the variables to be independent. One reasons for this could be that the data set is biased as all subjects were students at Colombia university rather than a random sample.

5.5. Multicolinearity

##     attr fun:shar 
## 2.059498 2.059498

As none of the statistics are >5 it seems there is no strong multicolinearity between them - justifying keeping all of them in the model.

5.6. Outliers

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 140 3.465379         0.00058084       0.2608

There appears to be no outlier more than three standard deviations away from the mean. There is thus no good reason to consider removing any of the observations.

5.6. Influential points

As you can see we have some points of influence. But we choose to neglect as the model is strong enough.

6. CONCLUSION

In conclusion we have created several linear regression models to try and explain the attribute factors that affect how many positive responses candidates got after going on a date. We selected the linear regression model that had one of the highest R2 (after K-fold validation testing), but that was also explainable and helpful to understanding the problem (e.g. with minimum explainable interaction effects). The final R2 value for our model is 0.66 (model_6), including the explanatory variables attractiveness (‘attr_o’) and the interaction effect between how fun the candidate was rated (fun) and the candidate and partners shared interests (shar). Overall the model shows that attractiveness is the most important variable for explaining positive responses, though the interaction of fun and share in also significant. These insights could be used either to build understanding of how to pair people in a speed dating agency, or for research into dating in modern society.

7. APPENDIX

7.1. QUICK LOOK AT STEPWISE REGRESSION

Stepwise regression is hated by statisticians - read criticism below. Yet we wanted to take a quick look at the stepwise regression to see if it would be similar to our model.

## Start:  AIC=-644.08
## y ~ 1
## 
##         Df Sum of Sq     RSS     AIC
## + attr   1    7.3489  4.7415 -849.88
## + fun    1    5.2604  6.8300 -768.86
## + shar   1    4.4567  7.6337 -744.16
## + amb    1    1.6008 10.4896 -673.61
## + intel  1    1.5756 10.5147 -673.08
## + sinc   1    1.5232 10.5672 -671.97
## + male   1    0.4530 11.6374 -650.56
## + date   1    0.2947 11.7957 -647.56
## + gout   1    0.2065 11.8839 -645.90
## + age    1    0.1770 11.9134 -645.35
## + goal   1    0.1152 11.9752 -644.20
## <none>               12.0904 -644.08
## + inc    1    0.0176 12.0728 -642.40
## + cint   1    0.0052 12.0852 -642.17
## + iid    1    0.0013 12.0891 -642.10
## + exph   1    0.0005 12.0899 -642.09
## 
## Step:  AIC=-849.88
## y ~ attr
## 
##         Df Sum of Sq    RSS     AIC
## + fun    1   0.36839 4.3731 -865.84
## + shar   1   0.20926 4.5323 -857.90
## + sinc   1   0.08534 4.6562 -851.92
## + intel  1   0.06992 4.6716 -851.18
## <none>               4.7415 -849.88
## + amb    1   0.04039 4.7011 -849.78
## + inc    1   0.03837 4.7031 -849.69
## + age    1   0.03397 4.7075 -849.48
## + goal   1   0.02046 4.7211 -848.84
## + gout   1   0.02028 4.7212 -848.83
## + exph   1   0.01784 4.7237 -848.72
## + male   1   0.01169 4.7298 -848.43
## + date   1   0.00633 4.7352 -848.18
## + iid    1   0.00273 4.7388 -848.01
## + cint   1   0.00088 4.7406 -847.92
## 
## Step:  AIC=-865.84
## y ~ attr + fun
## 
##         Df Sum of Sq    RSS     AIC
## + goal   1  0.042571 4.3306 -866.01
## <none>               4.3731 -865.84
## + age    1  0.029251 4.3439 -865.33
## + inc    1  0.028543 4.3446 -865.29
## + gout   1  0.021939 4.3512 -864.95
## + intel  1  0.020323 4.3528 -864.87
## + shar   1  0.017080 4.3560 -864.71
## + sinc   1  0.013913 4.3592 -864.55
## + exph   1  0.013786 4.3593 -864.54
## + male   1  0.010973 4.3622 -864.40
## + date   1  0.004039 4.3691 -864.04
## + amb    1  0.003222 4.3699 -864.00
## + iid    1  0.001729 4.3714 -863.93
## + cint   1  0.000020 4.3731 -863.84
## 
## Step:  AIC=-866.01
## y ~ attr + fun + goal
## 
##         Df Sum of Sq    RSS     AIC
## <none>               4.3306 -866.01
## + inc    1  0.037741 4.2928 -865.95
## + intel  1  0.030046 4.3005 -865.56
## + gout   1  0.023290 4.3073 -865.21
## + sinc   1  0.022663 4.3079 -865.17
## + age    1  0.019077 4.3115 -864.99
## + shar   1  0.016683 4.3139 -864.87
## + exph   1  0.014120 4.3164 -864.74
## + male   1  0.008219 4.3223 -864.43
## + date   1  0.005294 4.3253 -864.28
## + amb    1  0.004235 4.3263 -864.23
## + iid    1  0.001596 4.3290 -864.09
## + cint   1  0.000448 4.3301 -864.03
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## y ~ 1
## 
## Final Model:
## y ~ attr + fun + goal
## 
## 
##     Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                            221  12.090388 -644.0792
## 2 + attr  1 7.34887647       220   4.741512 -849.8833
## 3  + fun  1 0.36838754       219   4.373124 -865.8383
## 4 + goal  1 0.04257121       218   4.330553 -866.0100
## 
## Call:
## lm(formula = y ~ attr + fun + goal, data = ngsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39987 -0.09493 -0.00725  0.07484  0.44372 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.680014   0.064670 -10.515  < 2e-16 ***
## attr         0.123588   0.011383  10.857  < 2e-16 ***
## fun          0.056796   0.012810   4.434 1.47e-05 ***
## goal        -0.009072   0.006197  -1.464    0.145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1409 on 218 degrees of freedom
## Multiple R-squared:  0.6418, Adjusted R-squared:  0.6369 
## F-statistic: 130.2 on 3 and 218 DF,  p-value: < 2.2e-16