Intro

In this document the Exploratory Factor Analysis of Speed Dating Kaggle data is continued.

library(psych)
require(dplyr)

# loading the same data
dating <-read.csv("Speed Dating Data.csv")
dating1<- dating[c("imprace","imprelig", "date", "go_out", "sports", 
                   "tvsports", "exercise",  "dining" , "museums",  "art",  
                   "hiking", "gaming",  "clubbing",  
                   "reading", "tv",  "theater", "movies",  "concerts",   
                   "music",   "shopping",   "yoga", "exphappy" , "attr1_1",
                   "sinc1_1",   "intel1_1", "fun1_1",   "amb1_1",   
                   "shar1_1", "attr2_1", "sinc2_1",   "intel2_1",   
                   "fun2_1",   "amb2_1",   "shar2_1",   "attr3_1",   "sinc3_1",
                   "intel3_1",   "fun3_1",   "amb3_1")]
dating1 <- as.data.frame(dating1)
dating12 <- na.omit(dating1) 

As it was shown in the EFA Part 1, the Speed Dating data is awful for this kind of analysis/ Still, for the purpose of learning, let’s bring it to some conclusion. In this part, I’m going to assess the convergence of question in the determined factors and try to fit a regression model with those factors solely.

6 Factors Model

For further analysis I am refering the model with the 6 factors model that assumes there is a correlation between factors (‘oblimin’ rotation). Let’s revise it’s loadings again

fa6 <- fa(dating1, nfactors=6, rotate="oblimin", fm="ml") 
print(fa6$loadings, cutoff = 0, digits = 2)
## 
## Loadings:
##          ML5   ML4   ML6   ML1   ML2   ML3  
## imprace  -0.01 -0.06  0.19  0.04 -0.03  0.01
## imprelig -0.12 -0.15  0.17  0.22 -0.08 -0.04
## date     -0.03 -0.10 -0.28  0.21  0.08  0.00
## go_out    0.11  0.00 -0.33  0.04  0.07 -0.05
## sports   -0.14  0.16  0.24 -0.17  0.10 -0.03
## tvsports -0.11  0.09  0.23 -0.01  0.12 -0.15
## exercise -0.06 -0.07  0.29 -0.05  0.05  0.05
## dining    0.34 -0.05  0.35  0.03 -0.03  0.07
## museums   0.93  0.02 -0.03 -0.02 -0.04  0.00
## art       0.93  0.02 -0.01 -0.05  0.02 -0.02
## hiking    0.21  0.05 -0.02  0.01  0.11  0.04
## gaming   -0.05  0.19  0.16 -0.09  0.04 -0.01
## clubbing  0.11 -0.01  0.23 -0.01  0.05  0.05
## reading   0.29 -0.01 -0.03  0.05 -0.12  0.03
## tv       -0.01 -0.07  0.22  0.18  0.08 -0.02
## theater   0.56 -0.10  0.08  0.16 -0.01  0.05
## movies    0.34 -0.01  0.10  0.15 -0.04  0.01
## concerts  0.45 -0.01  0.09  0.05  0.06 -0.04
## music     0.32  0.00  0.20  0.06  0.09  0.01
## shopping  0.18 -0.18  0.39  0.10  0.05  0.05
## yoga      0.28 -0.05  0.16  0.11  0.06 -0.02
## exphappy  0.06  0.22  0.17 -0.12  0.01  0.02
## attr1_1  -0.01 -0.07  0.02 -0.93  0.17 -0.04
## sinc1_1   0.01  0.01 -0.25  0.49  0.24 -0.03
## intel1_1  0.02 -0.02 -0.03  0.07 -0.98 -0.01
## fun1_1   -0.08  0.05  0.11  0.20  0.17  0.28
## amb1_1   -0.01 -0.03  0.28  0.56  0.12  0.03
## shar1_1   0.03  0.11 -0.11  0.49  0.18 -0.17
## attr2_1  -0.01 -0.95  0.00 -0.07 -0.03 -0.17
## sinc2_1   0.00  0.65 -0.08  0.05  0.13 -0.13
## intel2_1 -0.02  0.65  0.10 -0.07 -0.22 -0.10
## fun2_1    0.00  0.03 -0.02  0.00  0.01  1.00
## amb2_1   -0.05  0.60  0.03 -0.05  0.04 -0.23
## shar2_1   0.11  0.41 -0.04  0.25  0.11 -0.12
## attr3_1   0.04  0.09  0.53 -0.17 -0.03  0.03
## sinc3_1   0.07 -0.05  0.18  0.23  0.09 -0.02
## intel3_1 -0.03  0.14  0.40 -0.04 -0.16 -0.04
## fun3_1    0.03 -0.03  0.64  0.05  0.15 -0.07
## amb3_1   -0.02  0.03  0.58  0.01  0.01 -0.06
## 
##                 ML5  ML4  ML6  ML1  ML2  ML3
## SS loadings    2.93 2.55 2.39 2.13 1.36 1.28
## Proportion Var 0.08 0.07 0.06 0.05 0.03 0.03
## Cumulative Var 0.08 0.14 0.20 0.26 0.29 0.32

In the end of Part 1 I’ve determined the following labels for factors and the constituenting variables of the latter:

  • ML5 as “Cultural interests”: variables museums, art, theater, and concerts.

  • ML4 as “Personal characteristics_2”: variables attr2_1, sinc2_1, intel2_1, shar2_1, and amb2_1.

  • ML6 as “Personal characteristics_3”: variables amb3_1, fun3_1, intel3_1, and attr3_1.

  • ML2 as “Intelligence_1”: variable intel1_1.

  • ML1 as “Personal characteristics_1”: variables attr1_1, sinc1_1, amb1_1, and shar1_1.

  • ML3 as “Sense of humor_2”:variable fun2_1.

The mentioned variabes were chosen as representing the factors by their loadings’ value (with the threshold of 0.4). The remaining variables from the dataset which weren’t included can be dropped. It’s also advisable to get rid of “lonely” variables, i.e., the ones like in ML3 and ML2 - as a factor should be represented with at least 3 variables.

Let’s reconstruct the FA model dropping unnecessary variables and reducing the quantity of factors to 4.

dating0 <- dating1 %>% select(museums, art, concerts, theater, attr2_1, sinc2_1, intel2_1, shar2_1, amb2_1, amb3_1, fun3_1, intel3_1, attr3_1, attr1_1, sinc1_1, amb1_1)

fa4 <- fa(dating0, nfactors=4, rotate="oblimin", fm="ml") 
print(fa4$loadings, cutoff = 0, digits = 2)
## 
## Loadings:
##          ML1   ML3   ML4   ML2  
## museums   0.01  0.92 -0.01 -0.01
## art       0.01  0.94  0.02  0.04
## concerts -0.01  0.43 -0.01 -0.02
## theater  -0.10  0.55  0.02 -0.17
## attr2_1  -0.98 -0.01  0.00  0.05
## sinc2_1   0.65  0.00 -0.08  0.04
## intel2_1  0.60 -0.01  0.05  0.04
## shar2_1   0.40  0.08  0.00 -0.19
## amb2_1    0.56 -0.06  0.11  0.10
## amb3_1   -0.01 -0.02  0.65 -0.02
## fun3_1   -0.04  0.05  0.57  0.01
## intel3_1  0.08 -0.03  0.48 -0.03
## attr3_1   0.07  0.05  0.57  0.14
## attr1_1  -0.04 -0.01  0.04  0.98
## sinc1_1   0.05 -0.01 -0.29 -0.39
## amb1_1   -0.03 -0.02  0.35 -0.54
## 
##                 ML1  ML3  ML4  ML2
## SS loadings    2.25 2.24 1.53 1.50
## Proportion Var 0.14 0.14 0.10 0.09
## Cumulative Var 0.14 0.28 0.38 0.47
fa.diagram(fa4)

Nice, the diagram looks better now, even the correlation value between ML1 and ML2 appeared. Let’s look at the TLI and RMSEA of this model.

fa4$TLI
## [1] 0.7010984
fa4$RMSEA[1]
##     RMSEA 
## 0.1215271

Wow, TLI (the upper value) actually increased pretty much - it was about 0.27 in the previous model. Yet, the model is still of not perfect fit. RMSEA value decreased a little (it was 0.13), but is also higher than 0.08? incdicating a bad fit of the factor model.

Anyways, let’s proceed.

Measuring Convergence

To assess how well the scales of fa4 model are fitted, internal consistency is measured. It can be done with the Cronbach’s alpha measure: a value bigger that 0.7 indicates good reliability.

  • ML3
ML3 <- as.data.frame(dating0[c("museums", "art", "theater", "concerts")])
alpha(ML3, keys = T)
## 
## Reliability analysis   
## Call: alpha(x = ML3, keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##       0.81      0.81    0.81      0.52 4.4 0.0034  6.8 1.7     0.47
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.81 0.82 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## museums       0.71      0.71    0.62      0.45 2.4   0.0055 0.0047  0.41
## art           0.71      0.71    0.63      0.45 2.4   0.0055 0.0081  0.41
## theater       0.78      0.78    0.79      0.55 3.6   0.0043 0.0727  0.40
## concerts      0.84      0.85    0.83      0.65 5.5   0.0031 0.0343  0.55
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean  sd
## museums  8299  0.87  0.87  0.88   0.75  7.0 2.1
## art      8299  0.87  0.87  0.88   0.74  6.7 2.3
## theater  8299  0.78  0.78  0.64   0.59  6.8 2.2
## concerts 8299  0.68  0.69  0.49   0.46  6.8 2.2

The Cronbach’s alpha value > 0.7, so we can conclude that ML3 is reliable and we should keep all the variables assigned to it. Let’s see what other factors bring:

  • ML1
ML1 <- as.data.frame(dating0[c("attr2_1", "amb2_1", "sinc2_1", "intel2_1", "shar2_1")])
alpha(ML1, check.keys = T, keys = T)
## 
## Reliability analysis   
## Call: alpha(x = ML1, keys = T, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##        0.7       0.7     0.8      0.32 2.3 0.0029   24 6.3     0.25
## 
##  lower alpha upper     95% confidence boundaries
## 0.69 0.7 0.71 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## attr2_1-      0.43      0.43    0.38      0.16 0.75   0.0100 0.0088  0.19
## amb2_1        0.67      0.69    0.75      0.36 2.23   0.0033 0.0559  0.35
## sinc2_1       0.64      0.65    0.72      0.32 1.88   0.0036 0.0587  0.36
## intel2_1      0.66      0.67    0.74      0.34 2.06   0.0035 0.0549  0.33
## shar2_1       0.70      0.73    0.76      0.40 2.70   0.0032 0.0399  0.39
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## attr2_1- 8299  0.98  0.95  1.01   0.91   70 16.2
## amb2_1   8289  0.59  0.60  0.48   0.42   12  6.9
## sinc2_1  8299  0.67  0.67  0.59   0.52   13  7.0
## intel2_1 8299  0.62  0.63  0.54   0.47   14  6.3
## shar2_1  8289  0.49  0.52  0.39   0.32   12  6.2

The Cronbach’s alpha value > 0.7. ML1 is reliable and we should keep all the variables assigned to it.

  • ML2
ML2 <- as.data.frame(dating0[c("attr1_1", "amb1_1", "sinc1_1")])
alpha(ML2, check.keys = T, keys = T)
## 
## Reliability analysis   
## Call: alpha(x = ML2, keys = T, check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##       0.57      0.57    0.57       0.3 1.3 0.0061   35 6.6     0.44
## 
##  lower alpha upper     95% confidence boundaries
## 0.56 0.57 0.58 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r  med.r
## attr1_1-      0.01     0.011  0.0053    0.0053 0.011   0.0214    NA 0.0053
## amb1_1        0.54     0.610  0.4391    0.4391 1.566   0.0082    NA 0.4391
## sinc1_1       0.54     0.639  0.4700    0.4700 1.774   0.0075    NA 0.4700
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## attr1_1- 8299  0.93  0.87  0.80   0.64   77 12.6
## amb1_1   8279  0.61  0.67  0.45   0.35   11  6.1
## sinc1_1  8299  0.63  0.66  0.42   0.34   17  7.0

The Cronbach’s alpha value is lower than minimally acceptable 0.65. That means ML2 is not reliable and we should reconsider the variables for this factor.

  • ML4
ML4 <- as.data.frame(dating0[c("attr3_1", "amb3_1", "fun3_1", "intel3_1")])
alpha(ML4, keys = T)
## 
## Reliability analysis   
## Call: alpha(x = ML4, keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean sd median_r
##       0.66      0.67    0.62      0.34 2.1 0.0059  7.7  1     0.34
## 
##  lower alpha upper     95% confidence boundaries
## 0.65 0.66 0.67 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## attr3_1       0.56      0.58    0.48      0.31 1.4   0.0077 0.0022  0.33
## amb3_1        0.62      0.62    0.53      0.36 1.7   0.0069 0.0081  0.37
## fun3_1        0.57      0.60    0.50      0.33 1.5   0.0078 0.0011  0.33
## intel3_1      0.62      0.63    0.54      0.36 1.7   0.0071 0.0048  0.35
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean  sd
## attr3_1  8273  0.72  0.74  0.62   0.49  7.1 1.4
## amb3_1   8273  0.75  0.70  0.53   0.43  7.6 1.8
## fun3_1   8273  0.74  0.72  0.58   0.47  7.7 1.6
## intel3_1 8273  0.62  0.69  0.52   0.42  8.4 1.1
## 
## Non missing response frequency for each item
##             2    3    4    5    6    7    8    9   10 miss
## attr3_1  0.00 0.02 0.03 0.08 0.13 0.35 0.27 0.09 0.03 0.01
## amb3_1   0.01 0.02 0.03 0.08 0.09 0.20 0.25 0.19 0.14 0.01
## fun3_1   0.01 0.01 0.01 0.04 0.12 0.21 0.27 0.22 0.11 0.01
## intel3_1 0.00 0.00 0.00 0.01 0.03 0.14 0.35 0.32 0.16 0.01

The Cronbach’s alpha value > 0.65 which is minimally acceptable. We conclude that ML4 is reliable and we should keep all the variables assigned to it.

Thus, among four factors there is only ML2 that raises suspicions in terms of its reliability.

Regression Model

Now, let’s save the factors’ scores in order to use them for regression modeling

fa4_wscores <- fa(dating0, nfactors=4, rotate="oblimin", fm="ml", scores = T)

fascores <-as.data.frame(fa4_wscores$scores)
datingfa<-cbind(dating0,fascores)

To the regression itself - I start with ML2 and ML3 as the independent variables predicting int_corr from the original dataset. The latter stands for the correlation between participant’s and partner’s ratings of interests in Time 1, taking up the range of numeric values from 0 to 1. The independent variable is chosen due to…well, there are not many (very few, actually) numeric variables in the initial dating data to choose from. Plus the correlation between participant’s and partner’s ratings of interests is distributed very nicely (normally), have a look:

require(ggplot2)
ggplot(dating, aes(x = int_corr)) +
  labs(title = "Distribution of Outcome Values", subtitle = "Histogram") +
  xlab("Correlation Between Participant’s & Partner’s Interests' Ratings") + 
  ylab("") +
  geom_histogram(binwidth = 0.1, fill = "brown", col= "black", alpha = 0.95) +
  theme_bw()

lm_data <- cbind(datingfa, dating$int_corr)

options(scipen = 999)
linear <- lm(dating$int_corr~ML3+ML2, data = lm_data)
summary(linear)
## 
## Call:
## lm(formula = dating$int_corr ~ ML3 + ML2, data = lm_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09656 -0.21014  0.01954  0.22874  0.71755 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  0.196034   0.003301  59.384 < 0.0000000000000002 ***
## ML3          0.052441   0.003515  14.921 < 0.0000000000000002 ***
## ML2         -0.013762   0.003514  -3.916            0.0000908 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2984 on 8171 degrees of freedom
##   (204 observations deleted due to missingness)
## Multiple R-squared:  0.03082,    Adjusted R-squared:  0.03059 
## F-statistic: 129.9 on 2 and 8171 DF,  p-value: < 0.00000000000000022

The summary output of the model shows that it has some right to exist - its overall p-value is much lower than 0.01. However, a very low R-squared value suggests that it is not the best model that can be chosen to predict the correlation between participant’s and partner’s ratings of interests in Time 1. It explains about 3% of variation in the outcome variable. Yet, the coefficients of factors as predictors are statistically signifficant.

The relationship between the factors and the outcome can be interpreted as follows:

  • the more points a person scores on ML3 (which stands for degree of interest in theater, museums, concerts, and art on the scale from 1 to 10), the more correlated participant’s and partner’s ratings of interests are. Which seems pretty obvious.

  • the more points a person scores on ML2 (which stands for degree of importance in such attributes as attractive, ambitious, and sincere of a possible date partner on the scale from 1 to 10), the less correlated participant’s and partner’s ratings of interests are. That’s kinda curious.

!BONUS: regression model diagnostics

plot(linear)

The diagnostics shows that the model is quite fine. The distribution of residual values in the outcome is approximately normal - on the normal probability plot residuals follow almost a straight line and the residuals are spread equally along the ranges of predictors (3rd plot). Though sight deviations from normality, , i.e., outliers, are present. Yetthose are not critical - there are no leverage values, i.e., ones falling below the red dashed line on the Cook’s distance plot (the last one).

That’s all :)