dd <- read.csv("/home/rstudioshared/shared_files/data/Speed Dating Data.csv")
#or
dd <- read.csv("Data_Science_Data/speed-dating-experiment/Speed Dating Data.csv")
library(dplyr); library(ggplot2)
dd <- dd %>% filter(!is.na(pid), !is.na(iid), !is.na(attr))
dd <- dd %>% mutate(pid = as.factor(pid), iid=as.factor(iid))
mean(dd$dec)
length(unique(dd$pid))
length(unique(dd$iid))
Above that we see that our data includes information the dates of 551 potential daters and that people chose to date roughly 43% of their potential matches. Let’s see what else we can figure out!
We can look at the success rate for individuals (how often people chose to date them) with the code:
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec)) %>%
ggplot(aes(pid, dec_rate)) + geom_point() +
theme(axis.text.x = element_blank())
But that’s just a mess of points! This will tell us more if we order people in by success rate.
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec)) %>%
ggplot(aes(reorder(pid, dec_rate), dec_rate)) + geom_point()+
theme(axis.text.x = element_blank())
We can look at how choosiness varied by among st the people making the decisions the same way:
dd %>% group_by(iid) %>% summarize(dec_rate = mean(dec)) %>%
ggplot(aes(reorder(iid, dec_rate), dec_rate)) + geom_point()+
theme(axis.text.x = element_blank())
First, let’s use a regular-old logistic regression model to predict success rates based on individuals
library(caTools)
m <- glm(dec ~ pid, data=dd, family="binomial")
and use this model to make predictions for individual dates:
dd$pred <- predict(m, data=dd, type="response")
Now, let’s take a look at how the predictions for individuals compare to their actual results (the red dots are the predictions):
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec), n=n(), pred=mean(pred)) %>%
ggplot(aes(reorder(pid, dec_rate), dec_rate, size=n)) + geom_point()+
geom_point(aes(reorder(pid, dec_rate), pred, color="red", size=0.1))
The “predictions” simply tell us what happened! There is no new information here (at least not in this graph). But maybe some daters simply had good luck. Even if all the daters were equally likely to get dates, we’d see results vary to some degree. To get a better sense of this, we can use a mixed effects models. In this model we find a normal distribution of “date-worthiness” based on the data and our predictions are based on Bayes Theorem – combining our prior distribution with each dater’s observed results.
library(lme4)
m <- glmer(dec ~ (1|pid), data=dd, family="binomial")
summary(m)
Take a long look at the summary. The intercept is -0.33548…
exp(-0.33548)
… which tells us that a “typical” speed dater has 0.715 odds of getting a date
0.715/(1+0.715)
This amounts to about a 42% chance.
The summary also tells us that the standard deviation in date-getting (across individuals) is 1.004…
exp(1.004)
… which means that a suitor who is one standard deviation above average has 2.73 times better odds of getting a date…
exp(1.004)*0.715
… or put another way, a +1 SD (standard deviation) dater is about 2:1 to get a date.
Now, let’s make predictions:
dd$pred <- predict(m, data=dd, type="response")
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec), pred=mean(pred)) %>%
ggplot(aes(reorder(pid, dec_rate), dec_rate)) + geom_point()+
geom_point(aes(reorder(pid, dec_rate), pred, color="red"))
Now, how do the predictions of our model (red dots) compare to the actual results? What happened?
Let’s add attractiveness to our model:
m.attr <- glmer(dec ~ attr+ (1|pid), data=dd, family="binomial")
summary(m.attr)
Looking again at the model summary, you’ll see that the intercept has changed dramatically. Why? How can we interpret this new intercept (and how does that differ from our interpretation of the intercept from our last model?)
The summary also shows that the standard deviation in date-getting among individuals has decreased (from 1.004 in our previous model to 0.5545). Why has this number decreased? How does our interpretation of this number differ between models?
Let’s predict something odd. This model allows us to predict how successful each data would-have-been had they been rated at different levels of attractiveness.
Let’s give everyone an attractiveness of 8 and see how they do (relative to their actual results):
dd$pred.8.attr <- predict(m.attr, newdata=dd %>% mutate(attr=8), type="response")
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec), pred=mean(pred.8.attr)) %>%
ggplot(aes(reorder(pid, dec_rate), dec_rate)) + geom_point()+
geom_point(aes(reorder(pid, dec_rate), pred, color="red"))
We could, alternatively, given everyone an attractiveness of 5:
dd$pred.5.attr <- predict(m.attr, newdata=dd %>% mutate(attr=5), type="response")
dd %>% group_by(pid) %>% summarize(dec_rate = mean(dec), pred=mean(pred.5.attr)) %>%
ggplot(aes(reorder(pid, dec_rate), dec_rate)) + geom_point()+
geom_point(aes(reorder(pid, dec_rate), pred, color="red"))
Why is there less variance in the predictions when we set everyone’s attractiveness ratings to either 8 or 5?
Decision Makers
We could also add the decision maker to the model.
m <- glmer(dec ~ (1|pid) + (1|iid), data=dd, family="binomial")
summary(m)
Looking at the summary, does the identity of the partner or the identity of the decision maker have more influence on whether a date occurs? (In other words, is date-worthiness or choosiness more important?)
Now, let’s add attractiveness back into the mix and add the same question…
m.attr <- glmer(dec ~ (1|pid) + (1|iid)+attr, data=dd, family="binomial")
summary(m.attr)
Has the relative importance of decision-maker and partner changed? If so, why?
Maybe decision makers respond to attractivness in different ways, with some placing much more importance than others. We can calculate individual “slopes” which tell us how much each decision maker valued attractiveness as follows:
m.attr.slopes <- glmer(dec ~ (attr|iid), data=dd, family="binomial")
summary(m.attr.slopes)
To look at the intercepts and attractiveness slopes for each decision maker we can write:
ranef(m.attr.slopes)
We can calculate the mean intercepts and slopes as follows:
apply(ranef(m.attr.slopes)$iid, 2, mean)
How would you interpret these numbers?
We can also calculate the standard deviations in these numbers (across decision makers):
apply(ranef(m.attr.slopes)$iid, 2, sd)
For each decision maker, let’s calculate the probability that they would date someoone with they assigned an attractiveness of 5:
logodds5 <- ranef(m.attr.slopes)$iid[,1] + ranef(m.attr.slopes)$iid[,2]*5
odds5 <- exp(logodds5)
probs5 <- odds5/(odds5+1)
Let’s do the same for an attractiveness of 8 and then put this all in a data.frame along with decision maker id numbers:
logodds8 <- ranef(m.attr.slopes)$iid[,1] + ranef(m.attr.slopes)$iid[,2]*8
odds8 <- exp(logodds8)
probs8 <- odds8/(odds8+1)
iids <- unique(dd$iid)
probs <- data.frame(iid = rep(iids,2), attr = as.factor(c(rep(5, 551), rep(8, 551))), prob=c(probs5, probs8))
Finally, let’s make a slope graph!
ggplot(data = probs, aes(x = attr, y = prob, group = iid)) +
geom_line(aes(color = iid, alpha = 0.5), size = 0.3) +
theme(legend.position = "none")
In this graph, each line represents one decision maker and each person’s line shows our estimated probablity that they’d date someone whom they assigned attractiveness ratings of 5 and 8, respectively.
According to our model, would some peoople rather date someone whom they find less attactive? If so, how would such people show up on our graph?
Challenges:
Try creating your own models and making your own counter-factuals predictions.
Try creating a slope graph for intelligence.