This analysis is exploring the topic of speed-dating. It aims to provide a basic understanding of people’s behaviour in such events, including attributes most relevant for successful dating. The data set was gathered from speed dating events from 2002-2004. The research was carried out by Columbia Business School professors Ray Fisman and Sheena Iyengar as basis for their paper “Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment”. Whilst the data set itself is quite large (8378 records) it is worth mentioning that it has a high likelihood of being biased as all participants were Colombia Business School students and thus any conclusions from this project may not necessarily be generalizable.
The data set can be obtained from the following link:
https://www.kaggle.com/annavictoria/speed-dating-experiment/downloads/Speed%20Dating%20Data.csv.zip
Metadata can be found here:
The data set includes ratings of speed-dating partners based on six attributes:
The speed-dates lasted four minutes each. The participants met ~10-20 partners on one event. 21 events were recorded in total.
In addition, a questionnaire provided key personal data: demographics, dating habits, self-perception on six attributes (as mentioned above), beliefs on what others find valuable in a mate, lifestyle information etc.
Each row in the data set corresponds to a unique ‘event’ where participant ‘iid’ had a four minute date with their partner (identified by the corresponding ‘pid’).
Packages used:
Because events 6-9 use a different methodology for rating attributes (see metadata), they are excluded from our data set:
sdd<-sdd %>%
filter(wave>9 | wave <6)
sdd = sdd %>%
mutate(income=as.numeric(gsub(",","",income))) %>%
mutate(tuition=as.numeric(gsub(",","",tuition)))
In rating attributes, users were asked to distribute 100 point to six variables. Sometimes, they did not add up the numbers correctly. To correct for their mistakes, the numbers were normalized for the variables attr1_1 to shar1_1. These variables refer to attributes that are important in another person, as rated by the speed-dating participant.
sdd %>%
mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
do(tidy(.$attr1_1/(.$sum1_1/100)))
## # A tibble: 6,816 × 1
## x
## <dbl>
## 1 15
## 2 15
## 3 15
## 4 15
## 5 15
## 6 15
## 7 15
## 8 15
## 9 15
## 10 15
## # ... with 6,806 more rows
Next, we used mutate to introduce a new (normalized variable):
sdd <- sdd %>%
mutate(sum1_1=attr1_1+sinc1_1+intel1_1+fun1_1+amb1_1+shar1_1) %>%
mutate(attr1_1n=attr1_1/(sum1_1/100)) %>%
mutate(sinc1_1n=sinc1_1/(sum1_1/100)) %>%
mutate(intel1_1n=intel1_1/(sum1_1/100)) %>%
mutate(amb1_1n=amb1_1/(sum1_1/100)) %>%
mutate(shar1_1n=shar1_1/(sum1_1/100))
One of the problems the data set has is that it is biased. This is because it was conducted in Colombia Business School meaning the conclusion of the analysis is based on the specific sample, not universal. Another problem we found in the dataset is that there are many null values in the data set which makes the analysis more difficult and the results less accurate.
About the survey data
The surveys conducted to reflect on the people themself and their partner has 3 different timelines:
We first took a quick look into some of the general data we have.
#row of dataset
nrow(sdd)
## [1] 6816
#number of variables
ncol(sdd)
## [1] 201
#number of matches and dismatches
number_of_match=sdd%>%group_by(match)%>%dplyr::summarise(number=n())
prop.table(number_of_match$number)
## [1] 0.8350939 0.1649061
by_order<-sdd %>%
dplyr::group_by(order) %>%
dplyr::summarize(average=mean(match, na.rm=TRUE))
View(sdd)
qplot(data=by_order, x=order, y=average, xlab="Order of Meeting", ylab="Match") + geom_smooth()
qplot(data=sdd, wave, binwidth=1, colour=I("white"), xlab="Event", ylab="Number of Dates", main="Number of Dates per Event")
Next we went into the demographic outlook of the people in our data set. We decided to look into age, gender and race.
mean(sdd$age,na.rm = TRUE)
## [1] 26.28174
qplot(data=sdd,x=age,geom="histogram",colour=I("white"))
number_of_gender=sdd%>%group_by(gender)%>%dplyr::summarise(number=n())
prop.table(number_of_gender$number)
## [1] 0.4992664 0.5007336
qplot(data=sdd,race,colour=I("white"),binwidth=1)
number_of_race=sdd%>%select(race)%>%na.omit()%>%group_by(race)%>%summarise(number=n())
prop.table(number_of_race$number)
## [1] 0.04557561 0.56022492 0.08419651 0.24400710 0.06599586
We also wanted to know more about the background of the people in our dataset.
qplot(data=sdd,field_cd,colour=I("white"),binwidth=1)
sdd%>%select(field)%>%na.omit()%>%group_by(field)%>%summarise(n())
## # A tibble: 215 × 2
## field `n()`
## <fctr> <int>
## 1 58
## 2 Acting 22
## 3 African-American Studies/History 15
## 4 American Studies 9
## 5 anthropology 10
## 6 Anthropology 9
## 7 Anthropology/Education 14
## 8 Applied Maths/Econs 16
## 9 Applied Physiology & Nutrition 18
## 10 Architecture 10
## # ... with 205 more rows
sdd%>%select(sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,tv,theater,movies,concerts,shopping,yoga)%>%na.omit()%>%summarise_each(funs(mean))
## sports tvsports exercise dining museums art hiking gaming
## 1 6.406408 4.540492 6.159745 7.808514 6.954761 6.71062 5.704835 3.848858
## clubbing tv theater movies concerts shopping yoga
## 1 5.695194 5.294571 6.801988 7.940967 6.909374 5.654257 4.349748
#join information of both partners
decd <- sdd %>% filter(dec==1)
join <- inner_join(x=decd, y=sdd, by=c("pid" = "iid"))
other <- join %>% filter(pid.y == iid)
#transform data form tabular format to long format
ldata <- rbind(
data.frame(measure="look for", feat="attr", rating=sdd$attr1_1, gender=sdd$gender),
data.frame(measure="look for", feat="sinc", rating=sdd$sinc1_1, gender=sdd$gender),
data.frame(measure="look for", feat="shar", rating=sdd$shar1_1, gender=sdd$gender),
data.frame(measure="look for", feat="intel", rating=sdd$intel1_1, gender=sdd$gender),
data.frame(measure="look for", feat="fun", rating=sdd$fun1_1, gender=sdd$gender),
data.frame(measure="look for", feat="amb", rating=sdd$amb1_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="attr", rating=sdd$attr4_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="sinc", rating=sdd$sinc4_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="shar", rating=sdd$shar4_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="intel", rating=sdd$intel4_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="fun", rating=sdd$fun4_1, gender=sdd$gender),
data.frame(measure="fellows look for", feat="amb", rating=sdd$amb4_1, gender=sdd$gender),
data.frame(measure="i rate my dec", feat="attr", rating=other$attr_o.y, gender=other$gender.x),
data.frame(measure="i rate my dec", feat="sinc", rating=other$sinc_o.y, gender=other$gender.x),
data.frame(measure="i rate my dec", feat="shar", rating=other$shar_o.y, gender=other$gender.x),
data.frame(measure="i rate my dec", feat="intel", rating=other$intel_o.y, gender=other$gender.x),
data.frame(measure="i rate my dec", feat="fun", rating=other$fun_o.y, gender=other$gender.x),
data.frame(measure="i rate my dec", feat="amb", rating=other$amb_o.y, gender=other$gender.x),
data.frame(measure="they rate my dec", feat="attr", rating=other$attr5_1.y, gender=other$gender.x),
data.frame(measure="they rate my dec", feat="sinc", rating=other$sinc5_1.y, gender=other$gender.x),
data.frame(measure="they rate my dec", feat="intel", rating=other$intel5_1.y, gender=other$gender.x),
data.frame(measure="they rate my dec", feat="fun", rating=other$fun5_1.y, gender=other$gender.x),
data.frame(measure="they rate my dec", feat="amb", rating=other$amb5_1.y, gender=other$gender.x)
)
#Calculate zScore to compensate for different measurements
stats <- ldata %>% group_by(measure) %>% summarise(avg = mean(rating, na.rm=T), sdv = sd(rating, na.rm=T))
zldata <- rbind(
ldata %>% filter(measure=="look for") %>% mutate(rating=(rating-stats$avg[1])/stats$sdv[1]),
ldata %>% filter(measure=="fellows look for") %>% mutate(rating=(rating-stats$avg[2])/stats$sdv[2]),
ldata %>% filter(measure=="i rate my dec") %>% mutate(rating=(rating-stats$avg[3])/stats$sdv[3]),
ldata %>% filter(measure=="they rate my dec") %>% mutate(rating=(rating-stats$avg[4])/stats$sdv[4])
)
#
ggplot(subset(zldata, rating<3 & rating >-3), aes(y=rating,x=feat, colour=measure), na.rm=T) +
geom_boxplot(outlier.size=0, fill = "white", alpha=.8) +
facet_grid(gender ~ .)
For the regression analysis we decided to first look at a dervied variable ‘positive responses’ (the percentage of people who said yes to seeing the participate again after the 4 minutes dates) as dependent variable as this seems a good indicator of how successful candidates were at dating. To do this we created a variable >postive_responses< by summing how many ‘dec_o = 1’ each ‘iid’ had and dividing this by total date event they went on.
We also created a series of variables unique to each ‘iid’ as the independent variables. The most important of these are scores of their average attribtes. These were created by averaging their the score that partners gave each ‘iid’ for each of 7 attributes (for example each got a score of 0 to 100 based on their attractivness):
gsdd <- sdd %>% group_by(iid, age, gender, income, exphappy, goal, go_out, date) %>%
summarise(y = mean(dec_o, na.rm = T),
attr = mean(attr_o, na.rm = T),
sinc = mean(sinc_o, na.rm = T),
intel = mean(intel_o, na.rm = T),
fun = mean(fun_o, na.rm = T),
amb = mean(amb_o, na.rm = T),
shar = mean(shar_o, na.rm = T),
cint = mean(int_corr, na.rm = T)) %>%
rename(exph = exphappy, inc = income, male=gender, gout = go_out)
The next step was to build and experiment with numerous linear regression models looking at each independent variable alone and then built up using numerous combinations of these variables. We ultimatley found four models which should a significant p-value for linear regression:
All variables were also run in Ggaly:ggpairs to see if there are other potential IVs we had overlooked (income was removed to save processing time)
corrgram(gsdd, lower.panel=panel.cor,
upper.panel=panel.pie, text.panel=panel.txt,
main="Correlation of interesting variables")
ggpairs(gsdd)
fm1 = "y ~ attr"
model_1 = lm(data=gsdd, fm1)
summary(model_1)
##
## Call:
## lm(formula = fm1, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48459 -0.08640 -0.00720 0.09401 0.48513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.547853 0.035219 -15.56 <2e-16 ***
## attr 0.157051 0.005597 28.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1414 on 447 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.637
## F-statistic: 787.3 on 1 and 447 DF, p-value: < 2.2e-16
fm2 = "y ~ attr + shar"
model_2 = lm(data=gsdd, fm2)
summary(model_2)
##
## Call:
## lm(formula = fm2, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46077 -0.08450 -0.00428 0.08880 0.45175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.666995 0.038461 -17.342 < 2e-16 ***
## attr 0.126427 0.007164 17.649 < 2e-16 ***
## shar 0.056663 0.008795 6.443 3.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1354 on 446 degrees of freedom
## Multiple R-squared: 0.6687, Adjusted R-squared: 0.6672
## F-statistic: 450.1 on 2 and 446 DF, p-value: < 2.2e-16
fm3 = "y ~ attr + fun"
model_3 = lm(data=gsdd, fm3)
summary(model_3)
##
## Call:
## lm(formula = fm3, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38416 -0.09119 -0.00766 0.08506 0.46093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.706788 0.040965 -17.253 < 2e-16 ***
## attr 0.122397 0.007393 16.555 < 2e-16 ***
## fun 0.058245 0.008601 6.772 4.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1348 on 446 degrees of freedom
## Multiple R-squared: 0.6716, Adjusted R-squared: 0.6701
## F-statistic: 456.1 on 2 and 446 DF, p-value: < 2.2e-16
fm4 = "y ~ attr + fun + shar"
model_4 = lm(data=gsdd, fm4)
summary(model_4)
##
## Call:
## lm(formula = fm4, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39479 -0.08417 -0.00393 0.09095 0.45076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.726006 0.040991 -17.712 < 2e-16 ***
## attr 0.115435 0.007635 15.120 < 2e-16 ***
## fun 0.039242 0.010382 3.780 0.000178 ***
## shar 0.033801 0.010568 3.198 0.001481 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 445 degrees of freedom
## Multiple R-squared: 0.679, Adjusted R-squared: 0.6768
## F-statistic: 313.8 on 3 and 445 DF, p-value: < 2.2e-16
fm5 = "y ~ attr*fun*shar"
model_5 = lm(data=gsdd, fm5)
summary(model_5)
##
## Call:
## lm(formula = fm5, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40341 -0.07962 -0.00837 0.08123 0.46743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.660435 0.646451 2.569 0.010541 *
## attr -0.407726 0.137494 -2.965 0.003187 **
## fun -0.298734 0.112897 -2.646 0.008434 **
## shar -0.289769 0.133071 -2.178 0.029969 *
## attr:fun 0.073937 0.021306 3.470 0.000571 ***
## attr:shar 0.075859 0.026118 2.904 0.003864 **
## fun:shar 0.042748 0.020802 2.055 0.040469 *
## attr:fun:shar -0.010243 0.003736 -2.742 0.006363 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1303 on 441 degrees of freedom
## Multiple R-squared: 0.6969, Adjusted R-squared: 0.6921
## F-statistic: 144.9 on 7 and 441 DF, p-value: < 2.2e-16
fm6 = "y ~ attr + fun:shar"
model_6 = lm(data=gsdd, fm6)
summary(model_6)
##
## Call:
## lm(formula = fm6, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39249 -0.08549 -0.00247 0.08594 0.45351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5190200 0.0332373 -15.616 < 2e-16 ***
## attr 0.1144801 0.0075343 15.195 < 2e-16 ***
## fun:shar 0.0065750 0.0008346 7.878 2.57e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1327 on 446 degrees of freedom
## Multiple R-squared: 0.6821, Adjusted R-squared: 0.6807
## F-statistic: 478.5 on 2 and 446 DF, p-value: < 2.2e-16
fm7 = "y ~ attr:fun + fun:shar + attr:shar"
model_7 = lm(data=gsdd, fm7)
summary(model_7)
##
## Call:
## lm(formula = fm7, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39422 -0.08562 -0.00354 0.08259 0.46371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.184185 0.021851 -8.429 4.88e-16 ***
## attr:fun 0.010466 0.001419 7.375 8.10e-13 ***
## fun:shar -0.003923 0.001410 -2.782 0.00564 **
## attr:shar 0.009403 0.001829 5.141 4.11e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.131 on 445 degrees of freedom
## Multiple R-squared: 0.6906, Adjusted R-squared: 0.6885
## F-statistic: 331 on 3 and 445 DF, p-value: < 2.2e-16
fm8 = "y ~ attr:fun + attr:shar + shar:sinc"
model_8 = lm(data=gsdd, fm8)
summary(model_8)
##
## Call:
## lm(formula = fm8, data = gsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39512 -0.08233 -0.00157 0.08388 0.47759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.139444 0.030981 -4.501 8.65e-06 ***
## attr:fun 0.008380 0.001470 5.700 2.18e-08 ***
## attr:shar 0.010729 0.002130 5.036 6.90e-07 ***
## shar:sinc -0.003697 0.001350 -2.738 0.00644 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1311 on 445 degrees of freedom
## Multiple R-squared: 0.6904, Adjusted R-squared: 0.6883
## F-statistic: 330.8 on 3 and 445 DF, p-value: < 2.2e-16
# Next we ran anova tests between our nested models to see if they were significantly different, which they appeared to be.
anova(model_1,model_2)
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 446 8.1807 1 0.76139 41.51 3.045e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_1,model_3)
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + fun
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 446 8.1085 1 0.83367 45.855 4.029e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_1,model_4)
## Analysis of Variance Table
##
## Model 1: y ~ attr
## Model 2: y ~ attr + fun + shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 447 8.9421
## 2 445 7.9263 2 1.0159 28.517 2.224e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_4,model_5)
## Analysis of Variance Table
##
## Model 1: y ~ attr + fun + shar
## Model 2: y ~ attr * fun * shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 445 7.9263
## 2 441 7.4838 4 0.44242 6.5176 4.207e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_5,model_6)
## Analysis of Variance Table
##
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr + fun:shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 441 7.4838
## 2 446 7.8499 -5 -0.36602 4.3137 0.0007659 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_5,model_7)
## Analysis of Variance Table
##
## Model 1: y ~ attr * fun * shar
## Model 2: y ~ attr:fun + fun:shar + attr:shar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 441 7.4838
## 2 445 7.6405 -4 -0.15669 2.3083 0.05725 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fms <- c(fm1, fm2, fm3, fm4, fm5, fm6, fm7, fm8)
## 10 -folded r square: 0.6151375 formula: y ~ attr
## 10 -folded r square: 0.6495073 formula: y ~ attr + shar
## 10 -folded r square: 0.6530901 formula: y ~ attr + fun
## 10 -folded r square: 0.6595179 formula: y ~ attr + fun + shar
## 10 -folded r square: 0.6745532 formula: y ~ attr*fun*shar
## 10 -folded r square: 0.6648757 formula: y ~ attr + fun:shar
## 10 -folded r square: 0.6709955 formula: y ~ attr:fun + fun:shar + attr:shar
## 10 -folded r square: 0.6711298 formula: y ~ attr:fun + attr:shar + shar:sinc
##5-FOLD VALIDATION
#
# yy = iid_table$iid
# flds <- createFolds(yy, k = 5, list = TRUE, returnTrain = FALSE)
# flds
#
# training_no5 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
# training_no4 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[5]], ])
# training_no3 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[2]], ],iid_table[ flds[[5]], ],iid_table[ flds[[4]], ])
# training_no2 <- rbind(iid_table[ flds[[1]], ],iid_table[ flds[[5]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
# training_no1 <- rbind(iid_table[ flds[[5]], ],iid_table[ flds[[2]], ],iid_table[ flds[[3]], ],iid_table[ flds[[4]], ])
#
# testing_5 <- iid_table[ flds[[5]], ]
# testing_4 <- iid_table[ flds[[4]], ]
# testing_3 <- iid_table[ flds[[3]], ]
# testing_2 <- iid_table[ flds[[2]], ]
# testing_1 <- iid_table[ flds[[1]], ]
#
# model_4 = lm(data=iid_table, y ~ attr_scores + fun + shar_scores)
# summary(model_4)
#
# model_4_no5 = lm(data=training_no5, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no5)
# Rsqtraining_no5 <- summary(model_4_no5)$r.squared
# Rsqtraining_no5
#
# model_4_no4 = lm(data=training_no4, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no4)
# Rsqtraining_no4 <- summary(model_4_no4)$r.squared
# Rsqtraining_no4
#
# model_4_no3 = lm(data=training_no3, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no3)
# Rsqtraining_no3 <- summary(model_4_no3)$r.squared
# Rsqtraining_no3
#
# model_4_no2 = lm(data=training_no2, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no2)
# Rsqtraining_no2 <- summary(model_4_no2)$r.squared
# Rsqtraining_no2
#
# model_4_no1 = lm(data=training_no1, y ~ attr_scores + fun + shar_scores)
# summary(model_4_no1)
# Rsqtraining_no1 <- summary(model_4_no1)$r.squared
# Rsqtraining_no1
# testing
#
# ybar_test5 = mean(testing_5$y)
# yhat_test5 = predict(model_4_no5, newdata = testing_5)
# SSE_5 = (yhat_test5 - testing_5$y)^2 %>% sum()
# SST_5 = (ybar_test5 - testing_5$y)^2 %>% sum()
# Rsqtesting_5 = 1-(SSE_5/SST_5)
# Rsqtesting_5
#
# ybar_test4 = mean(testing_4$y)
# yhat_test4 = predict(model_4_no4, newdata = testing_4)
# SSE_4 = (yhat_test4 - testing_4$y)^2 %>% sum()
# SST_4 = (ybar_test4 - testing_4$y)^2 %>% sum()
# Rsqtesting_4 = 1-(SSE_4/SST_4)
# Rsqtesting_4
#
# ybar_test3 = mean(testing_3$y)
# yhat_test3 = predict(model_4_no3, newdata = testing_3)
# SSE_3 = (yhat_test3 - testing_3$y)^2 %>% sum()
# SST_3 = (ybar_test3 - testing_3$y)^2 %>% sum()
# Rsqtesting_3 = 1-(SSE_3/SST_3)
# Rsqtesting_3
#
# ybar_test2 = mean(testing_2$y)
# yhat_test2 = predict(model_4_no2, newdata = testing_2)
# SSE_2 = (yhat_test2 - testing_2$y)^2 %>% sum()
# SST_2 = (ybar_test2 - testing_2$y)^2 %>% sum()
# Rsqtesting_2 = 1-(SSE_2/SST_2)
# Rsqtesting_2
#
# ybar_test1 = mean(testing_1$y)
# yhat_test1 = predict(model_4_no1, newdata = testing_1)
# SSE_1 = (yhat_test1 - testing_1$y)^2 %>% sum()
# SST_1 = (ybar_test1 - testing_1$y)^2 %>% sum()
# Rsqtesting_1 = 1-(SSE_1/SST_1)
# Rsqtesting_1
#
#
# Final_Rsq_for_model_4 <- mean(Rsqtesting_1, Rsqtesting_2, Rsqtesting_3, Rsqtesting_4, Rsqtesting_5)
# Final_Rsq_for_model_4
## Final Rsquare value: 0.6697428
# Using a regression tree for the same variable
reg_tree_1<-rpart(data=gsdd, y ~ .)
fancyRpartPlot(reg_tree_1)
## Assumption 1: A linear relationship between IVs and DP exists
qplot(predict(model_4), rstandard(model_4), geom="point") + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
qplot(predict(model_5), rstandard(model_5), geom="point") + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
# Whilst there does seem to be some kind of limit resulting in there being no residuals in the bottom left corner of the graph (i.e.) all residuals are not totally randomly distributed - overall there seems no clear relationship between the residentuals and the predicted values (e.g. a linear, exponential, or sin pattern) which suggests that generally the IVs and linearly correlated with the DV.
## A2: Residuals are distributed normally
q1 = qplot(rstandard(model_4), geom="blank") +
geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.5)+
stat_function(fun=dnorm, args=list(mean=0, sd=1),
colour=I("red"), alpha=I(0.5))
q2 = qplot(sample=rstandard(model_4)) +
geom_abline(slope=1,intercept=0)
grid.arrange(q1, q2, nrow=1)
qq1 = qplot(rstandard(model_5), geom="blank") +
geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.5)+
stat_function(fun=dnorm, args=list(mean=0, sd=1),
colour=I("red"), alpha=I(0.5))
qq2 = qplot(sample=rstandard(model_5)) +
geom_abline(slope=1,intercept=0)
grid.arrange(qq1, qq2, nrow=1)
# All a few points early on are below the theoretical line and a few point at the end are above, generally the resideuals seem to follow a very clear normal distribution.
## A3: Homoscedasticity of residuals
qplot(predict(model_4), rstandard(model_4), geom="point") + geom_hline(yintercept=0) +
geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))
qplot(predict(model_5), rstandard(model_5), geom="point") + geom_hline(yintercept=0) +
geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))
# Whilst this assumption is clearly violated to some extend (as there are no two lines that can easily capture all points) this is mostly due to there being no values in the bottom right of the graph. Whilst this does still clearly mean that the assumption of homoscedasticity does not hold to some extent, it is not severe enough to undermine the entire linear regression model.
# We also decided to test Homoscedasticity using functions from the 'car' package to double check results
spreadLevelPlot(model_4)
## Warning in spreadLevelPlot.lm(model_4): 12 negative fitted values removed
##
## Suggested power transformation: 0.9404072
spreadLevelPlot(model_5)
## Warning in spreadLevelPlot.lm(model_5): 1 negative fitted value removed
##
## Suggested power transformation: 0.8267515
# As the Spreadlevelplot above shows, the red line is not horizontal (which would mean homoscedasticity assumptions are totally justifiable), though the slope is quite general suggesting the issue is not too serious.
#ncvTest(model_4)
ncvTest(model_5)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 10.66148 Df = 1 p = 0.001093899
# The NCV test also shows a fairly low P-value suggesting the Homoscedasticity assumption has been violated in this case, though only marginally. Whilst we must be clear then that the current linear regression model does violate assumption 3 we will choose to continue with the results regardless until the model can be improved.
## A4: Independence of IVs
durbinWatsonTest(model_4)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1165655 1.762584 0.014
## Alternative hypothesis: rho != 0
durbinWatsonTest(model_5)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1439494 1.705347 0.002
## Alternative hypothesis: rho != 0
# Unfortunatley the linear regression model does not pass the test of independence (with a Stat of 1.76 and a p-value<0.05) meaning that we cannot consider all the variables to be independent. One reasons for this could be that the dataset is biased as all subjects were students at Colombia univeristy rather than a random sample.
## A5: Multicolinearity
vif(model_4)
## attr fun shar
## 2.089705 2.854899 2.656638
vif(model_5)
## attr fun shar attr:fun attr:shar
## 711.3610 354.3357 442.0814 1894.4840 2240.3416
## fun:shar attr:fun:shar
## 1326.8614 3720.7749
# As none of the statistics are >5 it seems there is no strong multicolinearity between them - justifyinig keeping all of them in the model.
## Outliers
outlierTest(model_4)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 140 3.424749 0.00067253 0.30197
outlierTest(model_5)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 140 3.647307 0.00029674 0.13324
# there appears to be no outliers more than three standard deviations away from the mean. There is thus no good reason to consider removing any of the observations.
## Influential points
influencePlot(model_4, id.method = "identify")
# 399
dfbetaPlots(model_4, id.method = "identify")
#253
#147
# stepwise regression is hated by statistians - read criticism below
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
ngsdd <- na.omit(gsdd)
null= lm(data=ngsdd, y ~ 1) # empty model
full = lm(data=ngsdd, y ~ .) # full model with all variables '.'=all
## define starting point and ending popint
step_sdd = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")
## Start: AIC=-644.08
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + attr 1 7.3489 4.7415 -849.88
## + fun 1 5.2604 6.8300 -768.86
## + shar 1 4.4567 7.6337 -744.16
## + amb 1 1.6008 10.4896 -673.61
## + intel 1 1.5756 10.5147 -673.08
## + sinc 1 1.5232 10.5672 -671.97
## + male 1 0.4530 11.6374 -650.56
## + date 1 0.2947 11.7957 -647.56
## + gout 1 0.2065 11.8839 -645.90
## + age 1 0.1770 11.9134 -645.35
## + goal 1 0.1152 11.9752 -644.20
## <none> 12.0904 -644.08
## + inc 1 0.0176 12.0728 -642.40
## + cint 1 0.0052 12.0852 -642.17
## + iid 1 0.0013 12.0891 -642.10
## + exph 1 0.0005 12.0899 -642.09
##
## Step: AIC=-849.88
## y ~ attr
##
## Df Sum of Sq RSS AIC
## + fun 1 0.36839 4.3731 -865.84
## + shar 1 0.20926 4.5323 -857.90
## + sinc 1 0.08534 4.6562 -851.92
## + intel 1 0.06992 4.6716 -851.18
## <none> 4.7415 -849.88
## + amb 1 0.04039 4.7011 -849.78
## + inc 1 0.03837 4.7031 -849.69
## + age 1 0.03397 4.7075 -849.48
## + goal 1 0.02046 4.7211 -848.84
## + gout 1 0.02028 4.7212 -848.83
## + exph 1 0.01784 4.7237 -848.72
## + male 1 0.01169 4.7298 -848.43
## + date 1 0.00633 4.7352 -848.18
## + iid 1 0.00273 4.7388 -848.01
## + cint 1 0.00088 4.7406 -847.92
##
## Step: AIC=-865.84
## y ~ attr + fun
##
## Df Sum of Sq RSS AIC
## + goal 1 0.042571 4.3306 -866.01
## <none> 4.3731 -865.84
## + age 1 0.029251 4.3439 -865.33
## + inc 1 0.028543 4.3446 -865.29
## + gout 1 0.021939 4.3512 -864.95
## + intel 1 0.020323 4.3528 -864.87
## + shar 1 0.017080 4.3560 -864.71
## + sinc 1 0.013913 4.3592 -864.55
## + exph 1 0.013786 4.3593 -864.54
## + male 1 0.010973 4.3622 -864.40
## + date 1 0.004039 4.3691 -864.04
## + amb 1 0.003222 4.3699 -864.00
## + iid 1 0.001729 4.3714 -863.93
## + cint 1 0.000020 4.3731 -863.84
##
## Step: AIC=-866.01
## y ~ attr + fun + goal
##
## Df Sum of Sq RSS AIC
## <none> 4.3306 -866.01
## + inc 1 0.037741 4.2928 -865.95
## + intel 1 0.030046 4.3005 -865.56
## + gout 1 0.023290 4.3073 -865.21
## + sinc 1 0.022663 4.3079 -865.17
## + age 1 0.019077 4.3115 -864.99
## + shar 1 0.016683 4.3139 -864.87
## + exph 1 0.014120 4.3164 -864.74
## + male 1 0.008219 4.3223 -864.43
## + date 1 0.005294 4.3253 -864.28
## + amb 1 0.004235 4.3263 -864.23
## + iid 1 0.001596 4.3290 -864.09
## + cint 1 0.000448 4.3301 -864.03
## then this code tries to make the best model
step_sdd$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## y ~ 1
##
## Final Model:
## y ~ attr + fun + goal
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 221 12.090388 -644.0792
## 2 + attr 1 7.34887647 220 4.741512 -849.8833
## 3 + fun 1 0.36838754 219 4.373124 -865.8383
## 4 + goal 1 0.04257121 218 4.330553 -866.0100
summary(step_sdd)
##
## Call:
## lm(formula = y ~ attr + fun + goal, data = ngsdd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39987 -0.09493 -0.00725 0.07484 0.44372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.680014 0.064670 -10.515 < 2e-16 ***
## attr 0.123588 0.011383 10.857 < 2e-16 ***
## fun 0.056796 0.012810 4.434 1.47e-05 ***
## goal -0.009072 0.006197 -1.464 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1409 on 218 degrees of freedom
## Multiple R-squared: 0.6418, Adjusted R-squared: 0.6369
## F-statistic: 130.2 on 3 and 218 DF, p-value: < 2.2e-16