In this document the Exploratory Factor Analysis of Speed Dating Kaggle data is continued.
library(psych)
require(dplyr)
# loading the same data
dating <-read.csv("Speed Dating Data.csv")
dating1<- dating[c("imprace","imprelig", "date", "go_out", "sports",
"tvsports", "exercise", "dining" , "museums", "art",
"hiking", "gaming", "clubbing",
"reading", "tv", "theater", "movies", "concerts",
"music", "shopping", "yoga", "exphappy" , "attr1_1",
"sinc1_1", "intel1_1", "fun1_1", "amb1_1",
"shar1_1", "attr2_1", "sinc2_1", "intel2_1",
"fun2_1", "amb2_1", "shar2_1", "attr3_1", "sinc3_1",
"intel3_1", "fun3_1", "amb3_1")]
dating1 <- as.data.frame(dating1)
dating12 <- na.omit(dating1)
As it was shown in the EFA Part 1, the Speed Dating data is awful for this kind of analysis/ Still, for the purpose of learning, let’s bring it to some conclusion. In this part, I’m going to assess the convergence of question in the determined factors and try to fit a regression model with those factors solely.
For further analysis I am refering the model with the 6 factors model that assumes there is a correlation between factors (‘oblimin’ rotation). Let’s revise it’s loadings again
fa6 <- fa(dating1, nfactors=6, rotate="oblimin", fm="ml")
print(fa6$loadings, cutoff = 0, digits = 2)
##
## Loadings:
## ML5 ML4 ML6 ML1 ML2 ML3
## imprace -0.01 -0.06 0.19 0.04 -0.03 0.01
## imprelig -0.12 -0.15 0.17 0.22 -0.08 -0.04
## date -0.03 -0.10 -0.28 0.21 0.08 0.00
## go_out 0.11 0.00 -0.33 0.04 0.07 -0.05
## sports -0.14 0.16 0.24 -0.17 0.10 -0.03
## tvsports -0.11 0.09 0.23 -0.01 0.12 -0.15
## exercise -0.06 -0.07 0.29 -0.05 0.05 0.05
## dining 0.34 -0.05 0.35 0.03 -0.03 0.07
## museums 0.93 0.02 -0.03 -0.02 -0.04 0.00
## art 0.93 0.02 -0.01 -0.05 0.02 -0.02
## hiking 0.21 0.05 -0.02 0.01 0.11 0.04
## gaming -0.05 0.19 0.16 -0.09 0.04 -0.01
## clubbing 0.11 -0.01 0.23 -0.01 0.05 0.05
## reading 0.29 -0.01 -0.03 0.05 -0.12 0.03
## tv -0.01 -0.07 0.22 0.18 0.08 -0.02
## theater 0.56 -0.10 0.08 0.16 -0.01 0.05
## movies 0.34 -0.01 0.10 0.15 -0.04 0.01
## concerts 0.45 -0.01 0.09 0.05 0.06 -0.04
## music 0.32 0.00 0.20 0.06 0.09 0.01
## shopping 0.18 -0.18 0.39 0.10 0.05 0.05
## yoga 0.28 -0.05 0.16 0.11 0.06 -0.02
## exphappy 0.06 0.22 0.17 -0.12 0.01 0.02
## attr1_1 -0.01 -0.07 0.02 -0.93 0.17 -0.04
## sinc1_1 0.01 0.01 -0.25 0.49 0.24 -0.03
## intel1_1 0.02 -0.02 -0.03 0.07 -0.98 -0.01
## fun1_1 -0.08 0.05 0.11 0.20 0.17 0.28
## amb1_1 -0.01 -0.03 0.28 0.56 0.12 0.03
## shar1_1 0.03 0.11 -0.11 0.49 0.18 -0.17
## attr2_1 -0.01 -0.95 0.00 -0.07 -0.03 -0.17
## sinc2_1 0.00 0.65 -0.08 0.05 0.13 -0.13
## intel2_1 -0.02 0.65 0.10 -0.07 -0.22 -0.10
## fun2_1 0.00 0.03 -0.02 0.00 0.01 1.00
## amb2_1 -0.05 0.60 0.03 -0.05 0.04 -0.23
## shar2_1 0.11 0.41 -0.04 0.25 0.11 -0.12
## attr3_1 0.04 0.09 0.53 -0.17 -0.03 0.03
## sinc3_1 0.07 -0.05 0.18 0.23 0.09 -0.02
## intel3_1 -0.03 0.14 0.40 -0.04 -0.16 -0.04
## fun3_1 0.03 -0.03 0.64 0.05 0.15 -0.07
## amb3_1 -0.02 0.03 0.58 0.01 0.01 -0.06
##
## ML5 ML4 ML6 ML1 ML2 ML3
## SS loadings 2.93 2.55 2.39 2.13 1.36 1.28
## Proportion Var 0.08 0.07 0.06 0.05 0.03 0.03
## Cumulative Var 0.08 0.14 0.20 0.26 0.29 0.32
In the end of Part 1 I’ve determined the following labels for factors and the constituenting variables of the latter:
ML5 as “Cultural interests”: variables museums, art, theater, and concerts.
ML4 as “Personal characteristics_2”: variables attr2_1, sinc2_1, intel2_1, shar2_1, and amb2_1.
ML6 as “Personal characteristics_3”: variables amb3_1, fun3_1, intel3_1, and attr3_1.
ML2 as “Intelligence_1”: variable intel1_1.
ML1 as “Personal characteristics_1”: variables attr1_1, sinc1_1, amb1_1, and shar1_1.
ML3 as “Sense of humor_2”:variable fun2_1.
The mentioned variabes were chosen as representing the factors by their loadings’ value (with the threshold of 0.4). The remaining variables from the dataset which weren’t included can be dropped. It’s also advisable to get rid of “lonely” variables, i.e., the ones like in ML3 and ML2 - as a factor should be represented with at least 3 variables.
Let’s reconstruct the FA model dropping unnecessary variables and reducing the quantity of factors to 4.
dating0 <- dating1 %>% select(museums, art, concerts, theater, attr2_1, sinc2_1, intel2_1, shar2_1, amb2_1, amb3_1, fun3_1, intel3_1, attr3_1, attr1_1, sinc1_1, amb1_1)
fa4 <- fa(dating0, nfactors=4, rotate="oblimin", fm="ml")
print(fa4$loadings, cutoff = 0, digits = 2)
##
## Loadings:
## ML1 ML3 ML4 ML2
## museums 0.01 0.92 -0.01 -0.01
## art 0.01 0.94 0.02 0.04
## concerts -0.01 0.43 -0.01 -0.02
## theater -0.10 0.55 0.02 -0.17
## attr2_1 -0.98 -0.01 0.00 0.05
## sinc2_1 0.65 0.00 -0.08 0.04
## intel2_1 0.60 -0.01 0.05 0.04
## shar2_1 0.40 0.08 0.00 -0.19
## amb2_1 0.56 -0.06 0.11 0.10
## amb3_1 -0.01 -0.02 0.65 -0.02
## fun3_1 -0.04 0.05 0.57 0.01
## intel3_1 0.08 -0.03 0.48 -0.03
## attr3_1 0.07 0.05 0.57 0.14
## attr1_1 -0.04 -0.01 0.04 0.98
## sinc1_1 0.05 -0.01 -0.29 -0.39
## amb1_1 -0.03 -0.02 0.35 -0.54
##
## ML1 ML3 ML4 ML2
## SS loadings 2.25 2.24 1.53 1.50
## Proportion Var 0.14 0.14 0.10 0.09
## Cumulative Var 0.14 0.28 0.38 0.47
fa.diagram(fa4)
Nice, the diagram looks better now, even the correlation value between ML1 and ML2 appeared. Let’s look at the TLI and RMSEA of this model.
fa4$TLI
## [1] 0.7010984
fa4$RMSEA[1]
## RMSEA
## 0.1215271
Wow, TLI (the upper value) actually increased pretty much - it was about 0.27 in the previous model. Yet, the model is still of not perfect fit. RMSEA value decreased a little (it was 0.13), but is also higher than 0.08? incdicating a bad fit of the factor model.
Anyways, let’s proceed.
To assess how well the scales of fa4 model are fitted, internal consistency is measured. It can be done with the Cronbach’s alpha measure: a value bigger that 0.7 indicates good reliability.
ML3 <- as.data.frame(dating0[c("museums", "art", "theater", "concerts")])
alpha(ML3, keys = T)
##
## Reliability analysis
## Call: alpha(x = ML3, keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.81 0.81 0.81 0.52 4.4 0.0034 6.8 1.7 0.47
##
## lower alpha upper 95% confidence boundaries
## 0.81 0.81 0.82
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## museums 0.71 0.71 0.62 0.45 2.4 0.0055 0.0047 0.41
## art 0.71 0.71 0.63 0.45 2.4 0.0055 0.0081 0.41
## theater 0.78 0.78 0.79 0.55 3.6 0.0043 0.0727 0.40
## concerts 0.84 0.85 0.83 0.65 5.5 0.0031 0.0343 0.55
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## museums 8299 0.87 0.87 0.88 0.75 7.0 2.1
## art 8299 0.87 0.87 0.88 0.74 6.7 2.3
## theater 8299 0.78 0.78 0.64 0.59 6.8 2.2
## concerts 8299 0.68 0.69 0.49 0.46 6.8 2.2
The Cronbach’s alpha value > 0.7, so we can conclude that ML3 is reliable and we should keep all the variables assigned to it. Let’s see what other factors bring:
ML1 <- as.data.frame(dating0[c("attr2_1", "amb2_1", "sinc2_1", "intel2_1", "shar2_1")])
alpha(ML1, check.keys = T, keys = T)
##
## Reliability analysis
## Call: alpha(x = ML1, keys = T, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.7 0.7 0.8 0.32 2.3 0.0029 24 6.3 0.25
##
## lower alpha upper 95% confidence boundaries
## 0.69 0.7 0.71
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## attr2_1- 0.43 0.43 0.38 0.16 0.75 0.0100 0.0088 0.19
## amb2_1 0.67 0.69 0.75 0.36 2.23 0.0033 0.0559 0.35
## sinc2_1 0.64 0.65 0.72 0.32 1.88 0.0036 0.0587 0.36
## intel2_1 0.66 0.67 0.74 0.34 2.06 0.0035 0.0549 0.33
## shar2_1 0.70 0.73 0.76 0.40 2.70 0.0032 0.0399 0.39
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## attr2_1- 8299 0.98 0.95 1.01 0.91 70 16.2
## amb2_1 8289 0.59 0.60 0.48 0.42 12 6.9
## sinc2_1 8299 0.67 0.67 0.59 0.52 13 7.0
## intel2_1 8299 0.62 0.63 0.54 0.47 14 6.3
## shar2_1 8289 0.49 0.52 0.39 0.32 12 6.2
The Cronbach’s alpha value > 0.7. ML1 is reliable and we should keep all the variables assigned to it.
ML2 <- as.data.frame(dating0[c("attr1_1", "amb1_1", "sinc1_1")])
alpha(ML2, check.keys = T, keys = T)
##
## Reliability analysis
## Call: alpha(x = ML2, keys = T, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.57 0.57 0.57 0.3 1.3 0.0061 35 6.6 0.44
##
## lower alpha upper 95% confidence boundaries
## 0.56 0.57 0.58
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## attr1_1- 0.01 0.011 0.0053 0.0053 0.011 0.0214 NA 0.0053
## amb1_1 0.54 0.610 0.4391 0.4391 1.566 0.0082 NA 0.4391
## sinc1_1 0.54 0.639 0.4700 0.4700 1.774 0.0075 NA 0.4700
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## attr1_1- 8299 0.93 0.87 0.80 0.64 77 12.6
## amb1_1 8279 0.61 0.67 0.45 0.35 11 6.1
## sinc1_1 8299 0.63 0.66 0.42 0.34 17 7.0
The Cronbach’s alpha value is lower than minimally acceptable 0.65. That means ML2 is not reliable and we should reconsider the variables for this factor.
ML4 <- as.data.frame(dating0[c("attr3_1", "amb3_1", "fun3_1", "intel3_1")])
alpha(ML4, keys = T)
##
## Reliability analysis
## Call: alpha(x = ML4, keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.66 0.67 0.62 0.34 2.1 0.0059 7.7 1 0.34
##
## lower alpha upper 95% confidence boundaries
## 0.65 0.66 0.67
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## attr3_1 0.56 0.58 0.48 0.31 1.4 0.0077 0.0022 0.33
## amb3_1 0.62 0.62 0.53 0.36 1.7 0.0069 0.0081 0.37
## fun3_1 0.57 0.60 0.50 0.33 1.5 0.0078 0.0011 0.33
## intel3_1 0.62 0.63 0.54 0.36 1.7 0.0071 0.0048 0.35
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## attr3_1 8273 0.72 0.74 0.62 0.49 7.1 1.4
## amb3_1 8273 0.75 0.70 0.53 0.43 7.6 1.8
## fun3_1 8273 0.74 0.72 0.58 0.47 7.7 1.6
## intel3_1 8273 0.62 0.69 0.52 0.42 8.4 1.1
##
## Non missing response frequency for each item
## 2 3 4 5 6 7 8 9 10 miss
## attr3_1 0.00 0.02 0.03 0.08 0.13 0.35 0.27 0.09 0.03 0.01
## amb3_1 0.01 0.02 0.03 0.08 0.09 0.20 0.25 0.19 0.14 0.01
## fun3_1 0.01 0.01 0.01 0.04 0.12 0.21 0.27 0.22 0.11 0.01
## intel3_1 0.00 0.00 0.00 0.01 0.03 0.14 0.35 0.32 0.16 0.01
The Cronbach’s alpha value > 0.65 which is minimally acceptable. We conclude that ML4 is reliable and we should keep all the variables assigned to it.
Thus, among four factors there is only ML2 that raises suspicions in terms of its reliability.
Now, let’s save the factors’ scores in order to use them for regression modeling
fa4_wscores <- fa(dating0, nfactors=4, rotate="oblimin", fm="ml", scores = T)
fascores <-as.data.frame(fa4_wscores$scores)
datingfa<-cbind(dating0,fascores)
To the regression itself - I start with ML2 and ML3 as the independent variables predicting int_corr from the original dataset. The latter stands for the correlation between participant’s and partner’s ratings of interests in Time 1, taking up the range of numeric values from 0 to 1. The independent variable is chosen due to…well, there are not many (very few, actually) numeric variables in the initial dating data to choose from. Plus the correlation between participant’s and partner’s ratings of interests is distributed very nicely (normally), have a look:
require(ggplot2)
ggplot(dating, aes(x = int_corr)) +
labs(title = "Distribution of Outcome Values", subtitle = "Histogram") +
xlab("Correlation Between Participant’s & Partner’s Interests' Ratings") +
ylab("") +
geom_histogram(binwidth = 0.1, fill = "brown", col= "black", alpha = 0.95) +
theme_bw()
lm_data <- cbind(datingfa, dating$int_corr)
options(scipen = 999)
linear <- lm(dating$int_corr~ML3+ML2, data = lm_data)
summary(linear)
##
## Call:
## lm(formula = dating$int_corr ~ ML3 + ML2, data = lm_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09656 -0.21014 0.01954 0.22874 0.71755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.196034 0.003301 59.384 < 0.0000000000000002 ***
## ML3 0.052441 0.003515 14.921 < 0.0000000000000002 ***
## ML2 -0.013762 0.003514 -3.916 0.0000908 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2984 on 8171 degrees of freedom
## (204 observations deleted due to missingness)
## Multiple R-squared: 0.03082, Adjusted R-squared: 0.03059
## F-statistic: 129.9 on 2 and 8171 DF, p-value: < 0.00000000000000022
The summary output of the model shows that it has some right to exist - its overall p-value is much lower than 0.01. However, a very low R-squared value suggests that it is not the best model that can be chosen to predict the correlation between participant’s and partner’s ratings of interests in Time 1. It explains about 3% of variation in the outcome variable. Yet, the coefficients of factors as predictors are statistically signifficant.
The relationship between the factors and the outcome can be interpreted as follows:
the more points a person scores on ML3 (which stands for degree of interest in theater, museums, concerts, and art on the scale from 1 to 10), the more correlated participant’s and partner’s ratings of interests are. Which seems pretty obvious.
the more points a person scores on ML2 (which stands for degree of importance in such attributes as attractive, ambitious, and sincere of a possible date partner on the scale from 1 to 10), the less correlated participant’s and partner’s ratings of interests are. That’s kinda curious.
!BONUS: regression model diagnostics
plot(linear)
The diagnostics shows that the model is quite fine. The distribution of residual values in the outcome is approximately normal - on the normal probability plot residuals follow almost a straight line and the residuals are spread equally along the ranges of predictors (3rd plot). Though sight deviations from normality, , i.e., outliers, are present. Yetthose are not critical - there are no leverage values, i.e., ones falling below the red dashed line on the Cook’s distance plot (the last one).
That’s all :)