1. INTRODUCTION

1.1. Background

This analysis is exploring the topic of speed-dating. It aims to provide a basic understanding of people’s behaviour in such events, including attributes most relevant for successful dating. The data set was gathered from speed dating events from 2002-2004. The research was carried out by Columbia Business School professors Ray Fisman and Sheena Iyengar as basis for their paper “Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment”. Whilst the data set itself is quite large (8378 records) it is worth mentioning that it has a high likelihood of being biased as all participants were Colombia Business School students and thus any conclusions from this project may not necessarily be generalizable.

1.2. DATA DESCRIPTION

The data set can be obtained from the following link:

https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv.zip

Metadata can be found here:

https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data%20Key.doc

The data set includes ratings of speed-dating partners based on six attributes:

Attractiveness (attr_o)
Sincerity (sinc_o)
Intelligence (intel_o)
Fun (fun_o)
Ambition (amb_o)
Shared Interests (shar_o)

The speed-dates lasted four minutes each. The participants met ~10-20 partners on one event. 21 events were recorded in total.

In addition, a questionnaire provided key personal data: demographics, dating habits, self-perception on six attributes (as mentioned above), beliefs on what others find valuable in a mate, lifestyle information etc.

Each row in the data set corresponds to a unique ‘event’ where participant ‘iid’ had a four minute date with their partner (identified by the corresponding ‘pid’).

Packages used:

dplyr
ggplot2
rpart
rattle
rpart.plot
RColorBrewer
foreign
broom
corrgram
effects
car
caret
GGally
gridExtra

2. Data Cleaning

2.1 Excluding events 6-9 due to different methodology

Because events 6-9 use a different methodology for rating attributes (see metadata), they are excluded from our data set:

sdd<-sdd %>%
  filter(wave>9 | wave <6)

sdd = sdd %>% 
  mutate(income=as.numeric(gsub(",","",income))) %>% 
  mutate(tuition=as.numeric(gsub(",","",tuition)))

2.2 Normalizing attributes

In rating attributes, users were asked to distribute 100 point to six variables. Sometimes, they did not add up the numbers correctly. To correct for their mistakes, the numbers were normalized for the variables attr1_1 to shar1_1. These variables refer to attributes that are important in another person, as rated by the speed-dating participant.

sdd %>%
  mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
  do(tidy(.$attr1_1/(.$sum1_1/100)))

## # A tibble: 6,816 × 1
##        x
##    <dbl>
## 1     15
## 2     15
## 3     15
## 4     15
## 5     15
## 6     15
## 7     15
## 8     15
## 9     15
## 10    15
## # ... with 6,806 more rows

Next, we used mutate to introduce a new (normalized variable):

sdd <- sdd %>%
  mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
  mutate(attr1_1n=attr1_1/(sum1_1/100)) %>%
  mutate(sinc1_1n=sinc1_1/(sum1_1/100)) %>%
  mutate(intel1_1n=intel1_1/(sum1_1/100)) %>%
  mutate(amb1_1n=amb1_1/(sum1_1/100)) %>%
  mutate(shar1_1n=shar1_1/(sum1_1/100))

3. Descriptive Analysis

One of the problems the data set has is that it is biased. This is because it was conducted in Colombia Business School meaning the conclusion of the analysis is based on the specific sample, not universal. Another problem we found in the dataset is that there are many null values in the data set which makes the analysis more difficult and the results less accurate.

About the survey data

The surveys conducted to reflect on the people themself and their partner has 3 different timelines:

Prior to the date.
Followup survey taken day after the date.
Second Followup survey is taken 3-4weeks after submitting match.

3.1. General Data

We first took a quick look into some of the general data we have.

Quick numbers

#row of dataset
nrow(sdd)
## [1] 6816
#number of variables
ncol(sdd)
## [1] 201
#number of matches and dismatches
number_of_match=sdd%>%group_by(match)%>%dplyr::summarise(number=n())
prop.table(number_of_match$number)
## [1] 0.8350939 0.1649061

Match by Number of date

by_order<-sdd %>%
  dplyr::group_by(order) %>%
  dplyr::summarize(average=mean(match, na.rm=TRUE))
View(sdd)
qplot(data=by_order, x=order, y=average, xlab="Order of Meeting", ylab="Match") + geom_smooth()

Number of dates per event

qplot(data=sdd, wave, binwidth=1, colour=I("white"), xlab="Event", ylab="Number of Dates", main="Number of Dates per Event")

3.2. Demographic Data

Next we went into the demographic outlook of the people in our data set. We decided to look into age, gender and race.

Age

mean(sdd$age,na.rm = TRUE)

## [1] 26.28174

qplot(data=sdd,x=age,geom="histogram",colour=I("white"))

Gender

number_of_gender=sdd%>%group_by(gender)%>%dplyr::summarise(number=n())
prop.table(number_of_gender$number)

## [1] 0.4992664 0.5007336

Race

qplot(data=sdd,race,colour=I("white"),binwidth=1)

number_of_race=sdd%>%select(race)%>%na.omit()%>%group_by(race)%>%summarise(number=n())
prop.table(number_of_race$number)

## [1] 0.04557561 0.56022492 0.08419651 0.24400710 0.06599586

3.3. Background

We also wanted to know more about the background of the people in our dataset.

Field of study

qplot(data=sdd,field_cd,colour=I("white"),binwidth=1)

sdd%>%select(field)%>%na.omit()%>%group_by(field)%>%summarise(n())

## # A tibble: 215 × 2
##                               field `n()`
##                              <fctr> <int>
## 1                                      58
## 2                            Acting    22
## 3  African-American Studies/History    15
## 4                  American Studies     9
## 5                      anthropology    10
## 6                      Anthropology     9
## 7            Anthropology/Education    14
## 8               Applied Maths/Econs    16
## 9    Applied Physiology & Nutrition    18
## 10                     Architecture    10
## # ... with 205 more rows

Interest

sdd%>%select(sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,tv,theater,movies,concerts,shopping,yoga)%>%na.omit()%>%summarise_each(funs(mean))

##     sports tvsports exercise   dining  museums     art   hiking   gaming
## 1 6.406408 4.540492 6.159745 7.808514 6.954761 6.71062 5.704835 3.848858
##   clubbing       tv  theater   movies concerts shopping     yoga
## 1 5.695194 5.294571 6.801988 7.940967 6.909374 5.654257 4.349748

3.4. Comparing importance of attributes according to surveys

#join information of both partners
decd <- sdd %>% filter(dec==1)
join <- inner_join(x=decd, y=sdd, by=c("pid" = "iid")) 
other <- join %>% filter(pid.y == iid)

#transform data form tabular format to long format
ldata <- rbind(
  data.frame(measure="look for", feat="attr", rating=sdd$attr1_1, gender=sdd$gender),
  data.frame(measure="look for", feat="sinc", rating=sdd$sinc1_1, gender=sdd$gender),
  data.frame(measure="look for", feat="shar", rating=sdd$shar1_1, gender=sdd$gender),
  data.frame(measure="look for", feat="intel", rating=sdd$intel1_1, gender=sdd$gender),
  data.frame(measure="look for", feat="fun", rating=sdd$fun1_1, gender=sdd$gender),
  data.frame(measure="look for", feat="amb", rating=sdd$amb1_1, gender=sdd$gender),
  
  data.frame(measure="fellows look for", feat="attr", rating=sdd$attr4_1, gender=sdd$gender),
  data.frame(measure="fellows look for", feat="sinc", rating=sdd$sinc4_1, gender=sdd$gender),
  data.frame(measure="fellows look for", feat="shar", rating=sdd$shar4_1, gender=sdd$gender),
  data.frame(measure="fellows look for", feat="intel", rating=sdd$intel4_1, gender=sdd$gender),
  data.frame(measure="fellows look for", feat="fun", rating=sdd$fun4_1, gender=sdd$gender),
  data.frame(measure="fellows look for", feat="amb", rating=sdd$amb4_1, gender=sdd$gender),
  
  data.frame(measure="i rate my dec", feat="attr", rating=other$attr_o.y, gender=other$gender.x),
  data.frame(measure="i rate my dec", feat="sinc", rating=other$sinc_o.y, gender=other$gender.x),
  data.frame(measure="i rate my dec", feat="shar", rating=other$shar_o.y, gender=other$gender.x),
  data.frame(measure="i rate my dec", feat="intel", rating=other$intel_o.y, gender=other$gender.x),
  data.frame(measure="i rate my dec", feat="fun", rating=other$fun_o.y, gender=other$gender.x),
  data.frame(measure="i rate my dec", feat="amb", rating=other$amb_o.y, gender=other$gender.x),
  
  data.frame(measure="they rate my dec", feat="attr", rating=other$attr5_1.y, gender=other$gender.x),
  data.frame(measure="they rate my dec", feat="sinc", rating=other$sinc5_1.y, gender=other$gender.x),
  data.frame(measure="they rate my dec", feat="intel", rating=other$intel5_1.y, gender=other$gender.x),
  data.frame(measure="they rate my dec", feat="fun", rating=other$fun5_1.y, gender=other$gender.x),
  data.frame(measure="they rate my dec", feat="amb", rating=other$amb5_1.y, gender=other$gender.x)
)

#Calculate zScore to compensate for different measurements
stats <- ldata %>% group_by(measure) %>% summarise(avg = mean(rating, na.rm=T), sdv = sd(rating, na.rm=T))
zldata <- rbind(
  ldata %>% filter(measure=="look for") %>% mutate(rating=(rating-stats$avg[1])/stats$sdv[1]),
  ldata %>% filter(measure=="fellows look for") %>% mutate(rating=(rating-stats$avg[2])/stats$sdv[2]),
  ldata %>% filter(measure=="i rate my dec") %>% mutate(rating=(rating-stats$avg[3])/stats$sdv[3]),
  ldata %>% filter(measure=="they rate my dec") %>% mutate(rating=(rating-stats$avg[4])/stats$sdv[4])
)

#
ggplot(subset(zldata, rating<3 & rating >-3), aes(y=rating,x=feat, colour=measure), na.rm=T) +
  geom_boxplot(outlier.size=0, fill = "white", alpha=.8) +
  facet_grid(gender ~ .)

4. Predictive Analysis

For the regression analysis we decided to first look at a dervied variable ‘positive responses’ (the percentage of people who said yes to seeing the participate again after the 4 minutes dates) as dependent variable as this seems a good indicator of how successful candidates were at dating. To do this we created a variable >postive_responses< by summing how many ‘dec_o = 1’ each ‘iid’ had and dividing this by total date event they went on.

We also created a series of variables unique to each ‘iid’ as the independent variables. The most important of these are scores of their average attribtes. These were created by averaging their the score that partners gave each ‘iid’ for each of 7 attributes (for example each got a score of 0 to 100 based on their attractivness):

gsdd <- sdd %>% group_by(iid, age, gender, income, exphappy, goal, go_out, date) %>% 
    summarise(y = mean(dec_o, na.rm = T),
              attr = mean(attr_o, na.rm = T), 
              sinc = mean(sinc_o, na.rm = T), 
              intel = mean(intel_o, na.rm = T), 
              fun = mean(fun_o, na.rm = T), 
              amb = mean(amb_o, na.rm = T), 
              shar = mean(shar_o, na.rm = T), 
              cint = mean(int_corr, na.rm = T)) %>%
    rename(exph = exphappy, inc = income, male=gender, gout = go_out)

The next step was to build and experiment with numerous linear regression models looking at each independent variable alone and then built up using numerous combinations of these variables. We ultimatley found four models which should a significant p-value for linear regression:

All variables were also run in Ggaly:ggpairs to see if there are other potential IVs we had overlooked (income was removed to save processing time)

corrgram(gsdd, lower.panel=panel.cor,
  upper.panel=panel.pie, text.panel=panel.txt,
  main="Correlation of interesting variables")

ggpairs(gsdd)

fm1 = "y ~ attr"
model_1 = lm(data=gsdd, fm1)
summary(model_1)

## 
## Call:
## lm(formula = fm1, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48459 -0.08640 -0.00720  0.09401  0.48513 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.547853   0.035219  -15.56   <2e-16 ***
## attr         0.157051   0.005597   28.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1414 on 447 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.637 
## F-statistic: 787.3 on 1 and 447 DF,  p-value: < 2.2e-16

fm2 = "y ~ attr + shar"
model_2 = lm(data=gsdd, fm2)
summary(model_2)

## 
## Call:
## lm(formula = fm2, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46077 -0.08450 -0.00428  0.08880  0.45175 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.666995   0.038461 -17.342  < 2e-16 ***
## attr         0.126427   0.007164  17.649  < 2e-16 ***
## shar         0.056663   0.008795   6.443 3.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 446 degrees of freedom
## Multiple R-squared:  0.6687, Adjusted R-squared:  0.6672 
## F-statistic: 450.1 on 2 and 446 DF,  p-value: < 2.2e-16

fm3 = "y ~ attr + fun"
model_3 = lm(data=gsdd, fm3)
summary(model_3)

## 
## Call:
## lm(formula = fm3, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38416 -0.09119 -0.00766  0.08506  0.46093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.706788   0.040965 -17.253  < 2e-16 ***
## attr         0.122397   0.007393  16.555  < 2e-16 ***
## fun          0.058245   0.008601   6.772 4.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1348 on 446 degrees of freedom
## Multiple R-squared:  0.6716, Adjusted R-squared:  0.6701 
## F-statistic: 456.1 on 2 and 446 DF,  p-value: < 2.2e-16

fm4 = "y ~ attr + fun + shar"
model_4 = lm(data=gsdd, fm4)
summary(model_4)

## 
## Call:
## lm(formula = fm4, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39479 -0.08417 -0.00393  0.09095  0.45076 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.726006   0.040991 -17.712  < 2e-16 ***
## attr         0.115435   0.007635  15.120  < 2e-16 ***
## fun          0.039242   0.010382   3.780 0.000178 ***
## shar         0.033801   0.010568   3.198 0.001481 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1335 on 445 degrees of freedom
## Multiple R-squared:  0.679,  Adjusted R-squared:  0.6768 
## F-statistic: 313.8 on 3 and 445 DF,  p-value: < 2.2e-16

fm5 = "y ~ attr*fun*shar"
model_5 = lm(data=gsdd, fm5)
summary(model_5)

## 
## Call:
## lm(formula = fm5, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40341 -0.07962 -0.00837  0.08123  0.46743 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.660435   0.646451   2.569 0.010541 *  
## attr          -0.407726   0.137494  -2.965 0.003187 ** 
## fun           -0.298734   0.112897  -2.646 0.008434 ** 
## shar          -0.289769   0.133071  -2.178 0.029969 *  
## attr:fun       0.073937   0.021306   3.470 0.000571 ***
## attr:shar      0.075859   0.026118   2.904 0.003864 ** 
## fun:shar       0.042748   0.020802   2.055 0.040469 *  
## attr:fun:shar -0.010243   0.003736  -2.742 0.006363 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1303 on 441 degrees of freedom
## Multiple R-squared:  0.6969, Adjusted R-squared:  0.6921 
## F-statistic: 144.9 on 7 and 441 DF,  p-value: < 2.2e-16

fm6 = "y ~ attr + fun:shar"
model_6 = lm(data=gsdd, fm6)
summary(model_6)

## 
## Call:
## lm(formula = fm6, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39249 -0.08549 -0.00247  0.08594  0.45351 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5190200  0.0332373 -15.616  < 2e-16 ***
## attr         0.1144801  0.0075343  15.195  < 2e-16 ***
## fun:shar     0.0065750  0.0008346   7.878 2.57e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1327 on 446 degrees of freedom
## Multiple R-squared:  0.6821, Adjusted R-squared:  0.6807 
## F-statistic: 478.5 on 2 and 446 DF,  p-value: < 2.2e-16

fm7 = "y ~ attr:fun + fun:shar + attr:shar"
model_7 = lm(data=gsdd, fm7)
summary(model_7)

## 
## Call:
## lm(formula = fm7, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39422 -0.08562 -0.00354  0.08259  0.46371 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.184185   0.021851  -8.429 4.88e-16 ***
## attr:fun     0.010466   0.001419   7.375 8.10e-13 ***
## fun:shar    -0.003923   0.001410  -2.782  0.00564 ** 
## attr:shar    0.009403   0.001829   5.141 4.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.131 on 445 degrees of freedom
## Multiple R-squared:  0.6906, Adjusted R-squared:  0.6885 
## F-statistic:   331 on 3 and 445 DF,  p-value: < 2.2e-16

fm8 = "y ~ attr:fun + attr:shar + shar:sinc"
model_8 = lm(data=gsdd, fm8)
summary(model_8)

## 
## Call:
## lm(formula = fm8, data = gsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39512 -0.08233 -0.00157  0.08388  0.47759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.139444   0.030981  -4.501 8.65e-06 ***
## attr:fun     0.008380   0.001470   5.700 2.18e-08 ***
## attr:shar    0.010729   0.002130   5.036 6.90e-07 ***
## shar:sinc   -0.003697   0.001350  -2.738  0.00644 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1311 on 445 degrees of freedom
## Multiple R-squared:  0.6904, Adjusted R-squared:  0.6883 
## F-statistic: 330.8 on 3 and 445 DF,  p-value: < 2.2e-16

# Next we ran anova tests between our nested models to see if they were significantly different, which they appeared to be.
anova(model_1,model_2)

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + shar
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    447 8.9421                                 
## 2    446 8.1807  1   0.76139 41.51 3.045e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model_1,model_3)

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + fun
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    447 8.9421                                  
## 2    446 8.1085  1   0.83367 45.855 4.029e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model_1,model_4)

## Analysis of Variance Table
## 
## Model 1: y ~ attr
## Model 2: y ~ attr + fun + shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    447 8.9421                                  
## 2    445 7.9263  2    1.0159 28.517 2.224e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model_4,model_5)

## Analysis of Variance Table
## 
## Model 1: y ~ attr + fun + shar
## Model 2: y ~ attr * fun * shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    445 7.9263                                  
## 2    441 7.4838  4   0.44242 6.5176 4.207e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model_5,model_6)

## Analysis of Variance Table
## 
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr + fun:shar
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    441 7.4838                                  
## 2    446 7.8499 -5  -0.36602 4.3137 0.0007659 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model_5,model_7)

## Analysis of Variance Table
## 
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr:fun + fun:shar + attr:shar
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    441 7.4838                              
## 2    445 7.6405 -4  -0.15669 2.3083 0.05725 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fms <- c(fm1, fm2, fm3, fm4, fm5, fm6, fm7, fm8)

4.2 MODEL VALIDATION: K-FOLD

## 10 -folded r square:  0.6151375  formula:  y ~ attr 
## 10 -folded r square:  0.6495073  formula:  y ~ attr + shar 
## 10 -folded r square:  0.6530901  formula:  y ~ attr + fun 
## 10 -folded r square:  0.6595179  formula:  y ~ attr + fun + shar 
## 10 -folded r square:  0.6745532  formula:  y ~ attr*fun*shar 
## 10 -folded r square:  0.6648757  formula:  y ~ attr + fun:shar 
## 10 -folded r square:  0.6709955  formula:  y ~ attr:fun + fun:shar + attr:shar 
## 10 -folded r square:  0.6711298  formula:  y ~ attr:fun + attr:shar + shar:sinc

##5-FOLD VALIDATION
# 
# yy = iid_table$iid
# flds <- createFolds(yy, k = 5, list = TRUE, returnTrain = FALSE)
# flds
# 
# training_no5 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
# training_no4 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[5]], ])
# training_no3 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[5]], ],iid_table[ flds[[4]], ])
# training_no2 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[5]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
# training_no1 <- rbind(iid_table[ flds[[5]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
# 
# testing_5 <- iid_table[ flds[[5]], ]
# testing_4 <- iid_table[ flds[[4]], ]
# testing_3 <- iid_table[ flds[[3]], ]
# testing_2 <- iid_table[ flds[[2]], ]
# testing_1 <- iid_table[ flds[[1]], ]
# 
# model_4 = lm(data=iid_table, y ~ attr_scores + fun + shar_scores)
# summary(model_4)
# 
# model_4_no5 = lm(data=training_no5, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no5)
# Rsqtraining_no5 <- summary(model_4_no5)$r.squared
# Rsqtraining_no5
# 
# model_4_no4 = lm(data=training_no4, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no4)
# Rsqtraining_no4 <- summary(model_4_no4)$r.squared
# Rsqtraining_no4
# 
# model_4_no3 = lm(data=training_no3, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no3)
# Rsqtraining_no3 <- summary(model_4_no3)$r.squared
# Rsqtraining_no3
# 
# model_4_no2 = lm(data=training_no2, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no2)
# Rsqtraining_no2 <- summary(model_4_no2)$r.squared
# Rsqtraining_no2
# 
# model_4_no1 = lm(data=training_no1, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no1)
# Rsqtraining_no1 <- summary(model_4_no1)$r.squared
# Rsqtraining_no1

# testing 
# 
# ybar_test5 = mean(testing_5$y)
# yhat_test5 = predict(model_4_no5, newdata = testing_5)
# SSE_5 = (yhat_test5 - testing_5$y)^2 %>% sum()
# SST_5 = (ybar_test5 - testing_5$y)^2 %>% sum()
# Rsqtesting_5 = 1-(SSE_5/SST_5)
# Rsqtesting_5
# 
# ybar_test4 = mean(testing_4$y)
# yhat_test4 = predict(model_4_no4, newdata = testing_4)
# SSE_4 = (yhat_test4 - testing_4$y)^2 %>% sum()
# SST_4 = (ybar_test4 - testing_4$y)^2 %>% sum()
# Rsqtesting_4 = 1-(SSE_4/SST_4)
# Rsqtesting_4
# 
# ybar_test3 = mean(testing_3$y)
# yhat_test3 = predict(model_4_no3, newdata = testing_3)
# SSE_3 = (yhat_test3 - testing_3$y)^2 %>% sum()
# SST_3 = (ybar_test3 - testing_3$y)^2 %>% sum()
# Rsqtesting_3 = 1-(SSE_3/SST_3)
# Rsqtesting_3
# 
# ybar_test2 = mean(testing_2$y)
# yhat_test2 = predict(model_4_no2, newdata = testing_2)
# SSE_2 = (yhat_test2 - testing_2$y)^2 %>% sum()
# SST_2 = (ybar_test2 - testing_2$y)^2 %>% sum()
# Rsqtesting_2 = 1-(SSE_2/SST_2)
# Rsqtesting_2
# 
# ybar_test1 = mean(testing_1$y)
# yhat_test1 = predict(model_4_no1, newdata = testing_1)
# SSE_1 = (yhat_test1 - testing_1$y)^2 %>% sum()
# SST_1 = (ybar_test1 - testing_1$y)^2 %>% sum()
# Rsqtesting_1 = 1-(SSE_1/SST_1)
# Rsqtesting_1
# 
# 
# Final_Rsq_for_model_4 <- mean(Rsqtesting_1, Rsqtesting_2, Rsqtesting_3, Rsqtesting_4, Rsqtesting_5)
# Final_Rsq_for_model_4

## Final Rsquare value:  0.6697428

# Using a regression tree for the same variable
reg_tree_1<-rpart(data=gsdd, y ~ .)
fancyRpartPlot(reg_tree_1)

5. DIAGNOSTICS

## Assumption 1: A linear relationship between IVs and DP exists
qplot(predict(model_4), rstandard(model_4), geom="point") + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))

qplot(predict(model_5), rstandard(model_5), geom="point") + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))

# Whilst there does seem to be some kind of limit resulting in there being no residuals in the bottom left corner of the graph (i.e.) all residuals are not totally randomly distributed - overall there seems no clear relationship between the residentuals and the predicted values (e.g. a linear, exponential, or sin pattern) which suggests that generally the IVs and linearly correlated with the DV. 


## A2: Residuals are distributed normally 
q1 = qplot(rstandard(model_4), geom="blank") +
  geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.5)+
  stat_function(fun=dnorm, args=list(mean=0, sd=1),
                colour=I("red"), alpha=I(0.5))

q2 = qplot(sample=rstandard(model_4)) +
  geom_abline(slope=1,intercept=0)

grid.arrange(q1, q2, nrow=1)

qq1 = qplot(rstandard(model_5), geom="blank") +
  geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.5)+
  stat_function(fun=dnorm, args=list(mean=0, sd=1),
                colour=I("red"), alpha=I(0.5))

qq2 = qplot(sample=rstandard(model_5)) +
  geom_abline(slope=1,intercept=0)

grid.arrange(qq1, qq2, nrow=1)

# All a few points early on are below the theoretical line and a few point at the end are above, generally the resideuals seem to follow a very clear normal distribution.


## A3: Homoscedasticity of residuals 
qplot(predict(model_4), rstandard(model_4), geom="point") + geom_hline(yintercept=0) +
  geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
  geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))

qplot(predict(model_5), rstandard(model_5), geom="point") + geom_hline(yintercept=0) +
  geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
  geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))

# Whilst this assumption is clearly violated to some extend (as there are no two lines that can easily capture all points) this is mostly due to there being no values in the bottom right of the graph. Whilst this does still clearly mean that the assumption of homoscedasticity does not hold to some extent, it is not severe enough to undermine the entire linear regression model.

# We also decided to test Homoscedasticity using functions from the 'car' package to double check results 


spreadLevelPlot(model_4)

## Warning in spreadLevelPlot.lm(model_4): 12 negative fitted values removed

## 
## Suggested power transformation:  0.9404072

spreadLevelPlot(model_5)

## Warning in spreadLevelPlot.lm(model_5): 1 negative fitted value removed

## 
## Suggested power transformation:  0.8267515

# As the Spreadlevelplot above shows, the red line is not horizontal (which would mean homoscedasticity assumptions are totally justifiable), though the slope is quite general suggesting the issue is not too serious. 

#ncvTest(model_4)
ncvTest(model_5)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 10.66148    Df = 1     p = 0.001093899

# The NCV test also shows a fairly low P-value suggesting the Homoscedasticity assumption has been violated in this case, though only marginally. Whilst we must be clear then that the current linear regression model does violate assumption 3 we will choose to continue with the results regardless until the model can be improved. 


## A4: Independence of IVs 
  durbinWatsonTest(model_4)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1165655      1.762584   0.014
##  Alternative hypothesis: rho != 0

  durbinWatsonTest(model_5)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1439494      1.705347   0.002
##  Alternative hypothesis: rho != 0

# Unfortunatley the linear regression model does not pass the test of independence (with a Stat of 1.76 and a p-value<0.05) meaning that we cannot consider all the variables to be independent. One reasons for this could be that the dataset is biased as all subjects were students at Colombia univeristy rather than a random sample. 
  

## A5: Multicolinearity 

vif(model_4)

##     attr      fun     shar 
## 2.089705 2.854899 2.656638

vif(model_5)

##          attr           fun          shar      attr:fun     attr:shar 
##      711.3610      354.3357      442.0814     1894.4840     2240.3416 
##      fun:shar attr:fun:shar 
##     1326.8614     3720.7749

# As none of the statistics are >5 it seems there is no strong multicolinearity between them - justifyinig keeping all of them in the model. 

## Outliers
outlierTest(model_4)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 140 3.424749         0.00067253      0.30197

outlierTest(model_5)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 140 3.647307         0.00029674      0.13324

# there appears to be no outliers more than three standard deviations away from the mean. There is thus no good reason to consider removing any of the observations. 

## Influential points
influencePlot(model_4, id.method = "identify")

# 399
dfbetaPlots(model_4, id.method = "identify")

#253
#147

6. QUICK LOOK AT STEPWISE REGRESSION

# stepwise regression is hated by statistians - read criticism below 
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

ngsdd <- na.omit(gsdd)
null= lm(data=ngsdd, y ~ 1)  # empty model
full = lm(data=ngsdd, y ~ .) # full model with all variables '.'=all 

## define starting point and ending popint
step_sdd = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")

## Start:  AIC=-644.08
## y ~ 1
## 
##         Df Sum of Sq     RSS     AIC
## + attr   1    7.3489  4.7415 -849.88
## + fun    1    5.2604  6.8300 -768.86
## + shar   1    4.4567  7.6337 -744.16
## + amb    1    1.6008 10.4896 -673.61
## + intel  1    1.5756 10.5147 -673.08
## + sinc   1    1.5232 10.5672 -671.97
## + male   1    0.4530 11.6374 -650.56
## + date   1    0.2947 11.7957 -647.56
## + gout   1    0.2065 11.8839 -645.90
## + age    1    0.1770 11.9134 -645.35
## + goal   1    0.1152 11.9752 -644.20
## <none>               12.0904 -644.08
## + inc    1    0.0176 12.0728 -642.40
## + cint   1    0.0052 12.0852 -642.17
## + iid    1    0.0013 12.0891 -642.10
## + exph   1    0.0005 12.0899 -642.09
## 
## Step:  AIC=-849.88
## y ~ attr
## 
##         Df Sum of Sq    RSS     AIC
## + fun    1   0.36839 4.3731 -865.84
## + shar   1   0.20926 4.5323 -857.90
## + sinc   1   0.08534 4.6562 -851.92
## + intel  1   0.06992 4.6716 -851.18
## <none>               4.7415 -849.88
## + amb    1   0.04039 4.7011 -849.78
## + inc    1   0.03837 4.7031 -849.69
## + age    1   0.03397 4.7075 -849.48
## + goal   1   0.02046 4.7211 -848.84
## + gout   1   0.02028 4.7212 -848.83
## + exph   1   0.01784 4.7237 -848.72
## + male   1   0.01169 4.7298 -848.43
## + date   1   0.00633 4.7352 -848.18
## + iid    1   0.00273 4.7388 -848.01
## + cint   1   0.00088 4.7406 -847.92
## 
## Step:  AIC=-865.84
## y ~ attr + fun
## 
##         Df Sum of Sq    RSS     AIC
## + goal   1  0.042571 4.3306 -866.01
## <none>               4.3731 -865.84
## + age    1  0.029251 4.3439 -865.33
## + inc    1  0.028543 4.3446 -865.29
## + gout   1  0.021939 4.3512 -864.95
## + intel  1  0.020323 4.3528 -864.87
## + shar   1  0.017080 4.3560 -864.71
## + sinc   1  0.013913 4.3592 -864.55
## + exph   1  0.013786 4.3593 -864.54
## + male   1  0.010973 4.3622 -864.40
## + date   1  0.004039 4.3691 -864.04
## + amb    1  0.003222 4.3699 -864.00
## + iid    1  0.001729 4.3714 -863.93
## + cint   1  0.000020 4.3731 -863.84
## 
## Step:  AIC=-866.01
## y ~ attr + fun + goal
## 
##         Df Sum of Sq    RSS     AIC
## <none>               4.3306 -866.01
## + inc    1  0.037741 4.2928 -865.95
## + intel  1  0.030046 4.3005 -865.56
## + gout   1  0.023290 4.3073 -865.21
## + sinc   1  0.022663 4.3079 -865.17
## + age    1  0.019077 4.3115 -864.99
## + shar   1  0.016683 4.3139 -864.87
## + exph   1  0.014120 4.3164 -864.74
## + male   1  0.008219 4.3223 -864.43
## + date   1  0.005294 4.3253 -864.28
## + amb    1  0.004235 4.3263 -864.23
## + iid    1  0.001596 4.3290 -864.09
## + cint   1  0.000448 4.3301 -864.03

## then this code tries to make the best model 
step_sdd$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## y ~ 1
## 
## Final Model:
## y ~ attr + fun + goal
## 
## 
##     Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                            221  12.090388 -644.0792
## 2 + attr  1 7.34887647       220   4.741512 -849.8833
## 3  + fun  1 0.36838754       219   4.373124 -865.8383
## 4 + goal  1 0.04257121       218   4.330553 -866.0100

summary(step_sdd)

## 
## Call:
## lm(formula = y ~ attr + fun + goal, data = ngsdd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39987 -0.09493 -0.00725  0.07484  0.44372 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.680014   0.064670 -10.515  < 2e-16 ***
## attr         0.123588   0.011383  10.857  < 2e-16 ***
## fun          0.056796   0.012810   4.434 1.47e-05 ***
## goal        -0.009072   0.006197  -1.464    0.145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1409 on 218 degrees of freedom
## Multiple R-squared:  0.6418, Adjusted R-squared:  0.6369 
## F-statistic: 130.2 on 3 and 218 DF,  p-value: < 2.2e-16

Group Assignment – Speed Dating

Samuel King, Heinrich Lohner, Rick Radewagen, Ye Wu & Jorn Jansen Schoonhoven

20 December 2016