This report contains regression models created based on data describing 5000 speed dates of 4 minutes of duration involving 310 american young adults. The original data were collected by Columbia Business School professors. Further information and the data itself can be found in this report repository.
The response variable is the variable that you are interesting in making measurements and conclusions on.
A predictor variable is a variable used in regression to predict another variable.
Our response variable will be "like", we want to study how well the predictor variables can help predict its behavior and how they impact it.
data <- read_csv(here("data/speed-dating.csv"),
progress = FALSE,
col_types =cols(.default = col_integer(),
int_corr = col_double(),
field = col_character(),
from = col_character(),
career = col_character(),
attr = col_double(),
samerace = col_character(),
sinc = col_double(),
intel = col_double(),
fun = col_double(),
amb = col_double(),
shar = col_double(),
like = col_double(),
prob = col_double(),
match_es = col_double(),
attr3_s = col_double(),
sinc3_s = col_double(),
intel3_s = col_double(),
fun3_s = col_double(),
amb3_s = col_double())) %>%
mutate(from = as.numeric(factor(from)),
gender = as.numeric(factor(gender)),
samerace = as.numeric(factor(samerace)))
data %>%
glimpse()
## Observations: 4,918
## Variables: 43
## $ iid <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ order <int> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, ...
## $ pid <int> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 1...
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28,...
## $ samerace <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2,...
## $ age_o <int> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 2...
## $ age <int> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 2...
## $ field <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law...
## $ race <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ from <dbl> 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 2, 2, 2, 2, 2...
## $ career <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "la...
## $ sports <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ tvsports <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ exercise <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7,...
## $ dining <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10,...
## $ museums <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8,...
## $ art <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ hiking <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ gaming <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ clubbing <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8,...
## $ reading <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10,...
## $ tv <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ theater <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9,...
## $ movies <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8...
## $ concerts <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7...
## $ music <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8,...
## $ shopping <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ yoga <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ attr <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5,...
## $ sinc <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8,...
## $ intel <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7,...
## $ fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5,...
## $ amb <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9...
## $ shar <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5,...
## $ like <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5,...
## $ prob <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6...
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ attr3_s <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ sinc3_s <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ intel3_s <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ fun3_s <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ amb3_s <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
data %>%
na.omit(race) %>%
ggplot(aes(race, ..prop..)) +
geom_bar(color = "black",
fill = "grey") +
labs(x= "Participant Race",
y = "Relative Frequency")
data %>%
na.omit(intel) %>%
ggplot(aes(intel, ..prop..)) +
geom_bar(color = "black",
fill = "grey") +
labs(x= "Intelligence (intel)",
y = "Relative Frequency")
data %>%
na.omit(attr) %>%
ggplot(aes(attr, ..prop..)) +
geom_bar(color = "black",
fill = "grey") +
labs(x= "Attraciveness (attr)",
y = "Relative Frequency")
data %>%
na.omit(amb) %>%
ggplot(aes(amb, ..prop..)) +
geom_bar(color = "black",
fill = "grey") +
labs(x= "Ambition (amb)",
y = "Relative Frequency")
data %>%
na.omit(sinc) %>%
ggplot(aes(sinc, ..prop..)) +
geom_bar(color = "black",
fill = "grey") +
labs(x= "Sincerity (sinc)",
y = "Relative Frequency")
require(GGally)
data %>%
select(like,fun,amb,attr,
sinc,intel,shar,prob,
fun3_s,amb3_s,attr3_s,
sinc3_s,intel3_s,samerace,
gender,from) %>%
na.omit() %>%
ggcorr(palette = "RdBu", label = TRUE,
hjust = 0.75, label_size = 3, nbreaks = 5) +
ggtitle("Correlation plot for employed variables")
data %>%
select(like,fun,attr,sinc,
amb,intel,shar) %>%
na.omit() %>%
ggpairs(upper = list(continuous = "density"),
lower = list(continuous = wrap("cor", size=5)),
axisLabels = 'show',progress = F) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Regarding the relationship with the response variable “like”:
data %>%
na.omit(fun, like) %>%
ggplot(aes(fun, like)) +
stat_density2d(aes(fill = ..level..),
geom = "polygon")
data %>%
na.omit(attr, like) %>%
ggplot(aes(attr, like)) +
stat_density2d(aes(fill = ..level..),
geom = "polygon")
data %>%
na.omit(intel, like) %>%
ggplot(aes(intel, like)) +
stat_density2d(aes(fill = ..level..),
geom = "polygon")
data %>% # Keep only promising predictor variables and response variable
select(fun, prob, order, amb,
attr, sinc, prob, shar,
intel, like, gender, samerace) %>%
na.omit() -> data # remove NAs
data %>% ## Put numeric predictor variables on same scale
mutate_at(.vars = vars(fun, prob, order,attr,
sinc, prob, shar,intel),
.funs = funs(as.numeric(scale(.)))) -> data_scaled
data_scaled %>%
glimpse()
## Observations: 4,101
## Variables: 11
## $ fun <dbl> 0.3673163, 0.8696052, 0.3673163, 0.3673163, -1.139550...
## $ prob <dbl> 0.44246160, -0.01644963, 0.44246160, 0.44246160, -0.0...
## $ order <dbl> -0.90758631, -1.08582650, -0.72934613, -0.37286577, -...
## $ amb <dbl> 6, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4...
## $ attr <dbl> -0.0256121, 0.4905314, 0.4905314, -0.5417556, -1.0578...
## $ sinc <dbl> 1.09093408, 0.53812046, -0.56750679, -0.56750679, 1.0...
## $ shar <dbl> -0.1395212, 0.3235922, 1.2498190, 0.3235922, -0.60263...
## $ intel <dbl> -0.1541995, -0.1541995, 0.4728427, -0.1541995, -0.154...
## $ like <dbl> 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5,...
## $ gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ samerace <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2,...
set.seed(101) # We set the set for reason of reproducibility
## Adding surrogate key to dataframe
data_scaled$id <- 1:nrow(data_scaled)
data_scaled %>%
dplyr::sample_frac(.8) -> training
training %>%
glimpse()
## Observations: 3,281
## Variables: 12
## $ fun <dbl> 0.3673163, 0.3673163, -0.1349725, 0.8696052, -0.63726...
## $ prob <dbl> 0.90137283, 2.27810651, -0.01644963, 0.90137283, 0.90...
## $ order <dbl> 0.16185478, -0.55110595, -0.90758631, 0.87481550, -0....
## $ amb <dbl> 8, 10, 6, 7, 4, 7, 6, 2, 9, 6, 6, 8, 7, 6, 8, 6, 8, 9...
## $ attr <dbl> 1.5228184, -1.5740426, -0.0256121, 0.4905314, -2.0901...
## $ sinc <dbl> 0.53812046, 1.64374770, -0.01469317, 0.53812046, -2.2...
## $ shar <dbl> 1.2498190, -0.1395212, -0.1395212, 1.2498190, -1.5288...
## $ intel <dbl> 1.0998849, -0.1541995, -0.7812416, 0.4728427, -2.0353...
## $ like <dbl> 8, 4, 6, 8, 1, 4, 6, 3, 9, 8, 6, 7, 8, 5, 6, 5, 7, 9,...
## $ gender <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 1,...
## $ samerace <dbl> 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2,...
## $ id <int> 1527, 180, 2909, 2696, 1024, 1230, 2396, 1366, 2546, ...
dplyr::anti_join(data_scaled,
training,
by = 'id') -> testing
testing %>%
glimpse()
## Observations: 820
## Variables: 12
## $ fun <dbl> -0.1349725, -0.1349725, -1.6418391, 0.3673163, 0.8696...
## $ prob <dbl> 0.90137283, 1.36028406, -0.01644963, 0.90137283, -1.8...
## $ order <dbl> -0.55110595, -1.44230686, -1.26406668, -1.44230686, -...
## $ amb <dbl> 3, 5, 6, 9, 6, 7, 8, 5, 4, 3, 2, 8, 7, 7, 6, 6, 1, 10...
## $ attr <dbl> -0.5417556, 0.4905314, 1.0066749, 1.0066749, -1.05789...
## $ sinc <dbl> 0.53812046, 1.09093408, -0.01469317, -0.01469317, -0....
## $ shar <dbl> -0.6026347, 0.7867056, -1.5288615, 0.7867056, 0.78670...
## $ intel <dbl> 1.0998849, -0.1541995, 0.4728427, 1.0998849, 0.472842...
## $ like <dbl> 6, 7, 6, 8, 4, 8, 7, 6, 7, 6, 9, 5, 7, 6, 8, 9, 1, 10...
## $ gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,...
## $ samerace <dbl> 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1,...
## $ id <int> 12, 13, 15, 27, 30, 31, 32, 39, 42, 50, 51, 62, 96, 9...
Our evaluation of the regression can be divided as follow:
Our response variable is the variable "like" (How much participant 1 liked participant 2).
In this model we will include the main predictor variables (according to what we’ve seen in terms of correlation and some tweaking with the predictors that is not here). We’ll also try to observe the effect of intelligence in sincerity. For the sake of simplicity we’ll use the thumb rule of alpha = 0.05 (95% confidence intervals).
mod <- lm(like ~ fun + attr + shar + sinc + prob + intel + amb + sinc * intel + amb * intel,
data = training)
glance(mod)
tidy(mod,
conf.int = TRUE,
conf.level = 0.95)
tidy(mod,
conf.int = TRUE,
conf.level = 0.95) %>%
filter(term != "(Intercept)") %>%
ggplot(aes(term, estimate, ymin = conf.low, ymax = conf.high)) +
geom_errorbar(size = 0.8, width= 0.4) +
geom_point(color = "red", size = 2) +
geom_hline(yintercept = 0, colour = "darkred") +
labs(x = "Predictor variable",
y = "Estimated value (95% of confidence)")
Let’s keep the residue data in a specific data frame
mod.res <- resid(mod)
std.resid <- rstandard(mod)
like <- training$like
resid_data <- data.frame(mod.res,std.resid,like,
stringsAsFactors=FALSE)
resid_data %>%
sample_n(10)
resid_data %>%
ggplot(aes(like, mod.res)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0,
color = "red") +
labs(x = "Response Variable (like)", y = "Residuals") +
ggtitle("Residual Plot")
The model doesn’t seem to fit the data that well
mod %>%
ggplot(aes(.fitted, .resid)) +
geom_point() +
stat_smooth(method="loess") +
geom_hline(yintercept=0, col="red", linetype="dashed") +
xlab("Fitted values") + ylab("Residuals") +
ggtitle("Residual vs Fitted Plot")
The data doesn’t seem to demand a non-linear regression.
y <- quantile(resid_data$std.resid[!is.na(resid_data$std.resid)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
resid_data %>%
ggplot(aes(sample=std.resid)) +
stat_qq(shape=1, size=3) + # open circles
labs(title="Normal Q-Q", # plot title
x="Theoretical Quantiles", # x-axis label
y="Standardized Residuals") + # y-axis label
geom_abline(slope = slope,
color = "red",
size = 0.8,
intercept = int,
linetype="dashed") # dashed reference line
This suggests that there would be a better combination of predictors to be found, although the current one still explains a sizable portion of the data variability .
mod %>%
ggplot(aes(.fitted,
sqrt(abs(.stdresid)))) +
geom_point(na.rm=TRUE) +
stat_smooth(method="loess",
na.rm = TRUE) +
labs(title = "Scale-Location",
x= "Fitted Value",
y = expression(sqrt("|Standardized residuals|")))
Confirmation of the qqplot results.
mod %>%
ggplot(aes(.hat, .stdresid)) +
geom_point(aes(size=.cooksd), na.rm=TRUE) +
stat_smooth(method="loess", na.rm=TRUE) +
xlab("Leverage")+ylab("Standardized Residuals") +
ggtitle("Residual vs Leverage Plot") +
scale_size_continuous("Cook's Distance", range=c(1,5)) +
theme(legend.position="bottom")
There are no “outliers”/extreme values who are influential cases (i.e., subjects) and would therefore have an undue influence on the regression line.
mod %>%
ggplot(aes(.hat, .cooksd)) +
geom_point(na.rm=TRUE) +
stat_smooth(method="loess", na.rm=TRUE) +
xlab("Leverage hii")+ylab("Cook's Distance") +
ggtitle("Cook's dist vs Leverage hii/(1-hii)") +
geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")
Plot confirms results of Residual vs Leverage Plot
predictions <- mod %>% predict(testing)
data.frame( R2 = caret::R2(predictions, testing$like),
RMSE = caret::RMSE(predictions, testing$like),
MAE = caret::MAE(predictions, testing$like),
ERR = caret::RMSE(predictions, testing$like)/
mean(testing$like))
Now let’s talk about the results taken from the test data (the most meaningful).
While not completely bad our model certainly left a lot of room for improvement.
Let’s now gather all the information our model provided and definitely provide an answer
Intelligence has a positive impact on (like), the smarter P1 thinks P2 is the higher (like) will be.
At 95% of confidence we have evidence that the effect of intelligence on (like) is significant (C.I. exclusively above 0).
Among the significant predictors Intelligence has one of the less relevant effects (small magnitude), if your intelligence score goes up 1 point your like score will go up around [0.10,0.36], not that much to be honest. Intelligence still has it's value but you should be more worried about whether P1 thinks you're funny (fun) or attractive (attr) as both have higher magnitude.
To be honest there's a lot of uncertainty around Intelligence as [0.10,0.36] is a long interval. The uncertainty becomes even more apparent for we have no evidence at 95% of confidence of a significant difference between the magnitude of Intelligence's effect and Sincerity's effect (sinc) or. We also have no evidence at 95% of confidence of a significant difference between the magnitude of Intelligence's effect and Probability's effect (prob): Intelligence C.I. encloses Sincerity's and Probability's.
Inteligence has an interaction with Sincerity and with Ambition, at 95% of confidence we can't be sure whether the effect for either is positive or negative as both C.I.s intersect 0.
At 95% of confidence we have evidence that it's not significant both (Inteligence * Sincerity) and (Inteligence * Ambition).
It's not relevant for both (Inteligence * Sincerity) and (Inteligence * Ambition).