include_graphics(genderImagePath)
We are going to perform a factor analysis on a dataset of answers to the Rosenberg Self-Esteem Scale carryed out in 2014, then check the results and compare them for three different gender identities stated in the questionnaire:
There are 10 questions:
All the variables are Likert items.
We start by plotting the variables
dataNormalities <- data %>% dplyr::select(-gender)
dataNormalities <- dataNormalities %>% pivot_longer(cols=everything() ,names_to = "VAR", values_to = "vals")
ggplot(dataNormalities,aes(x=vals))+geom_bar(fill="steelblue")+facet_wrap(~VAR) + theme_void()
It looks like the variables are generally normal , except for Q2 and Q4 that are more left-skewed than most , and Q10.
Let’s check with actual statistical tests:
lshap <- data %>% dplyr::select(-gender) %>% lapply(lillie.test)
lres <- sapply(lshap, `[`, c("statistic","p.value"))
lres <- t(lres)
datatable(lres)
The test outputs a strong rejection of the null hypothesis that the distributions are normal.
We’ll go on using methods that don’t assume normality of the distributions of our variables.
where:
R = \(r_{ij}\) is the correlation matrix, U = \(u_{ij}\) is the partial covariance matrix
corrMatrix <- data %>% dplyr::select(-gender) %>% cor(method = ("spearman"))
corrplot(corrMatrix,type="upper",order="hclust")
psych::KMO(corrMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corrMatrix)
## Overall MSA = 0.92
## MSA for each item =
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
## 0.92 0.91 0.95 0.95 0.95 0.92 0.92 0.97 0.89 0.90
We expect an index of at least 0.6 to carry on the analysis , so in this case we are satisfied,as the 0.92 Measure of Sample Adequacy suggests that it is worthwhile to analyze this correlation matrix.
We do this with a parallel analysis scree plot
data %>% dplyr::select(-gender) %>% fa.parallel(fm="pa")
## Parallel analysis suggests that the number of factors = 4 and the number of components = 2
We choose 2 factors.
res_PA<- data %>% dplyr::select(-gender) %>% fa(nfactors = 2,n.obs=nrow(data),rotate="none", fm = 'pa')
res_PA
## Factor Analysis using method = pa
## Call: fa(r = ., nfactors = 2, n.obs = nrow(data), rotate = "none",
## fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## Q1 0.75 0.33 0.67 0.33 1.4
## Q2 0.68 0.37 0.60 0.40 1.5
## Q3 -0.75 0.13 0.58 0.42 1.1
## Q4 0.61 0.26 0.43 0.57 1.3
## Q5 -0.71 0.10 0.51 0.49 1.0
## Q6 0.78 0.11 0.62 0.38 1.0
## Q7 0.76 0.08 0.58 0.42 1.0
## Q8 -0.55 0.21 0.35 0.65 1.3
## Q9 -0.72 0.36 0.65 0.35 1.5
## Q10 -0.77 0.33 0.71 0.29 1.3
##
## PA1 PA2
## SS loadings 5.06 0.64
## Proportion Var 0.51 0.06
## Cumulative Var 0.51 0.57
## Proportion Explained 0.89 0.11
## Cumulative Proportion 0.89 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 5.31 with Chi Square of 254553.3
## The degrees of freedom for the model are 26 and the objective function was 0.22
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## The harmonic number of observations is 47974 with the empirical chi square 3674.65 with prob < 0
## The total number of observations was 47974 with Likelihood Chi Square = 10568.66 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.928
## RMSEA index = 0.092 and the 90 % confidence intervals are 0.09 0.093
## BIC = 10288.42
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.96 0.79
## Multiple R square of scores with factors 0.93 0.63
## Minimum correlation of possible factor scores 0.85 0.25
We can see that:
the first factor explains most of the variables, and we have items like Q2,Q4,Q8,Q9,Q10 that have cross-loading across factors, this is reflected in the complexities as seen below.
they are still kind of high , with mean item complexity = 1.3 .
we can consider as acceptable the communalities from 0.2 on , since we are considering an acceptable factor loading as 0.4. This is because , approximately:
In our case they are already much higher than that.
Uniqueness , as a result , will be \(1 - communality\).
cor(res_PA$scores)
## PA1 PA2
## PA1 1.000000000 -0.005227977
## PA2 -0.005227977 1.000000000
No high correlation between factors is found. In general , if the correlations are >0.85 the factors are not distinct from each other, thus they overlap , leading to multicollinearity, so they should be combined (Brown,2015).
fa.diagram(res_PA)
As we stated before , all the variables are explained by the first factor only , so we’ll need to dive deeper and look for a rotation.
The rotation of choice is oblimin (Fabrigar & Wegener, 2012):
as we leave room for variables to be correlated , ’cause we know that with those 10 items we are trying to explain just one construct.
res_PA_Oblimin<- data %>% dplyr::select(-gender) %>% fa(nfactors = 2,n.obs=nrow(data),rotate="oblimin", fm = 'pa')
res_PA_Oblimin
## Factor Analysis using method = pa
## Call: fa(r = ., nfactors = 2, n.obs = nrow(data), rotate = "oblimin",
## fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## Q1 0.00 0.82 0.67 0.33 1.0
## Q2 0.08 0.82 0.60 0.40 1.0
## Q3 0.58 -0.23 0.58 0.42 1.3
## Q4 -0.02 0.65 0.43 0.57 1.0
## Q5 0.52 -0.25 0.51 0.49 1.4
## Q6 -0.30 0.55 0.62 0.38 1.6
## Q7 -0.32 0.51 0.58 0.42 1.7
## Q8 0.57 -0.03 0.35 0.65 1.0
## Q9 0.86 0.08 0.65 0.35 1.0
## Q10 0.84 0.00 0.71 0.29 1.0
##
## PA1 PA2
## SS loadings 2.91 2.78
## Proportion Var 0.29 0.28
## Cumulative Var 0.29 0.57
## Proportion Explained 0.51 0.49
## Cumulative Proportion 0.51 1.00
##
## With factor correlations of
## PA1 PA2
## PA1 1.00 -0.68
## PA2 -0.68 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 5.31 with Chi Square of 254553.3
## The degrees of freedom for the model are 26 and the objective function was 0.22
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## The harmonic number of observations is 47974 with the empirical chi square 3674.65 with prob < 0
## The total number of observations was 47974 with Likelihood Chi Square = 10568.66 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.928
## RMSEA index = 0.092 and the 90 % confidence intervals are 0.09 0.093
## BIC = 10288.42
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.94 0.94
## Multiple R square of scores with factors 0.89 0.87
## Minimum correlation of possible factor scores 0.77 0.75
As we can see from the results:
res_PA_Oblimin$score.cor
## [,1] [,2]
## [1,] 1.0000000 -0.6994701
## [2,] -0.6994701 1.0000000
These correlations reflect some interesting things : * the two factors are explaining two opposite things. * we have a clue that the two factors are explaining the same construct as expected - even though of course we still can’t be sure of the construct we’re measuring , that’s validity’s job -.
res_PA_Oblimin<- data %>% dplyr::select(-c(gender,Q3,Q5,Q6,Q7)) %>% fa(nfactors = 2,n.obs=nrow(mydata),rotate="oblimin", fm ='pa')
res_PA_Oblimin
## Factor Analysis using method = pa
## Call: fa(r = ., nfactors = 2, n.obs = nrow(mydata), rotate = "oblimin",
## fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## Q1 -0.04 0.80 0.69 0.31 1
## Q2 0.05 0.83 0.64 0.36 1
## Q4 -0.05 0.63 0.44 0.56 1
## Q8 0.54 -0.05 0.33 0.67 1
## Q9 0.86 0.04 0.70 0.30 1
## Q10 0.84 -0.04 0.75 0.25 1
##
## PA1 PA2
## SS loadings 1.78 1.76
## Proportion Var 0.30 0.29
## Cumulative Var 0.30 0.59
## Proportion Explained 0.50 0.50
## Cumulative Proportion 0.50 1.00
##
## With factor correlations of
## PA1 PA2
## PA1 1.00 -0.63
## PA2 -0.63 1.00
##
## Mean item complexity = 1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 15 and the objective function was 2.42 with Chi Square of 115976.2
## The degrees of freedom for the model are 4 and the objective function was 0
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.01
##
## The harmonic number of observations is 47974 with the empirical chi square 69.01 with prob < 3.7e-14
## The total number of observations was 47974 with Likelihood Chi Square = 194.76 with prob < 5e-41
##
## Tucker Lewis Index of factoring reliability = 0.994
## RMSEA index = 0.032 and the 90 % confidence intervals are 0.028 0.035
## BIC = 151.65
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.93 0.92
## Multiple R square of scores with factors 0.86 0.84
## Minimum correlation of possible factor scores 0.72 0.68
Now we can see that the mean item complexity is 1 .
The loadings lead us to do this separation:
Q1,Q2,Q4 belong to PA2
Q8,Q9,Q10 belong to PA1
Let’s carve an interpretation out of these question groups:
The communalities are very good as most exceed the value of 0.6 , so they exceed the usual researchers’ threshold of 0.5, even though we have Q4 and Q8 that are respectively 0.4 and 0.3, but they respect out fixed minimum of 0.2 anyway.
Lastly,let’s check the factor correlations:
cor(res_PA_Oblimin$scores)
## PA1 PA2
## PA1 1.0000000 -0.7133108
## PA2 -0.7133108 1.0000000
The correlations are still negative and significant as we expected , so what we told before is still valid.
Let’s see the diagram:
fa.diagram(res_PA_Oblimin)
This diagram accounts for everything we’ve said before.
So, in conclusion, going into the actual analysis of the self-esteem differences through gender , we will use these results.
To do that , we are going to compute Cronbach’s alpha, as it indicates the internal consistency reliability.
Let’s start with the first factor , the one that accounts for the negative attitude towards self.
positiveSelf = c('Q8','Q9','Q10')
negativeSelf = c('Q1','Q2','Q4')
relPositive = psych::alpha(data[,positiveSelf])
relNegative = psych::alpha(data[,negativeSelf])
print(relPositive, digits = 3 )
##
## Reliability analysis
## Call: psych::alpha(x = data[, positiveSelf])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.796 0.795 0.744 0.564 3.87 0.00161 2.68 0.861 0.493
##
## lower alpha upper 95% confidence boundaries
## 0.793 0.796 0.8
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Q8 0.838 0.840 0.724 0.724 5.24 0.00147 NA 0.724
## Q9 0.658 0.660 0.493 0.493 1.94 0.00310 NA 0.493
## Q10 0.643 0.643 0.474 0.474 1.80 0.00326 NA 0.474
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Q8 47974 0.767 0.779 0.569 0.521 2.68 0.969
## Q9 47974 0.872 0.870 0.798 0.701 2.77 1.007
## Q10 47974 0.887 0.878 0.812 0.711 2.57 1.084
##
## Non missing response frequency for each item
## 0 1 2 3 4 miss
## Q8 0.004 0.138 0.240 0.405 0.212 0
## Q9 0.005 0.139 0.197 0.393 0.265 0
## Q10 0.004 0.217 0.219 0.325 0.235 0
Cronbach’s alpha = 0.79
print(relNegative, digits = 3 )
##
## Reliability analysis
## Call: psych::alpha(x = data[, negativeSelf])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.804 0.804 0.739 0.578 4.11 0.00155 3 0.715 0.546
##
## lower alpha upper 95% confidence boundaries
## 0.801 0.804 0.807
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Q1 0.690 0.690 0.527 0.527 2.22 0.00283 NA 0.527
## Q2 0.706 0.706 0.546 0.546 2.40 0.00268 NA 0.546
## Q4 0.795 0.796 0.661 0.661 3.90 0.00187 NA 0.661
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Q1 47974 0.873 0.868 0.777 0.690 3.00 0.876
## Q2 47974 0.858 0.860 0.761 0.677 3.09 0.824
## Q4 47974 0.812 0.815 0.649 0.589 2.91 0.829
##
## Non missing response frequency for each item
## 0 1 2 3 4 miss
## Q1 0.002 0.063 0.180 0.438 0.317 0
## Q2 0.007 0.043 0.131 0.495 0.324 0
## Q4 0.005 0.051 0.213 0.494 0.237 0
Cronbach’s alpha = 0.80
We have good values as they are both greater than 0.7.
dataGenders <- cbind(res_PA_Oblimin$scores,data)
dataFemales <- dataGenders %>% dplyr::filter(gender == 2) %>% dplyr::select(c(PA2,PA1))
dataOther <- dataGenders %>% dplyr::filter(gender == 3) %>% dplyr::select(c(PA2,PA1))
dataMales <- dataGenders %>% dplyr::filter(gender == 1) %>% dplyr::select(c(PA2,PA1))
# mean(dataFemales$PA2)
We remark the meanings of the factors:
PA2: Evenness with others PA1: Self Disheartening
The mean of factors’scores for females are:
The mean of factors’scores for ‘other genders’ are:
The mean of factors’scores for males are:
While females have more of an inferiority complex towards the world compared to males, they also tend to dishearten more . So , the data reflects a well-spread stereotype: at least on the surface , men are psychologically “stronger” than men.
include_graphics(cureImagePath)
The interpretation is the same as before: While females have more of an inferiority complex towards the world compared to ‘other genders’, they also tend to dishearten more .
Basically males don’t suffer from inferiority complex towards the world as much as other genders. Males also tend to dishearten less .
include_graphics(mentalPath)
We’ll perform some Welch t-tests , as neither the sample sizes , nor the variances are equal for the different gender samples.
t.test(dataFemales$PA2,dataOther$PA2,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataFemales$PA2 and dataOther$PA2
## t = 9.4844, df = 543.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3385766 0.5154575
## sample estimates:
## mean of x mean of y
## -0.08497743 -0.51199447
t.test(dataFemales$PA1,dataOther$PA1,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataFemales$PA1 and dataOther$PA1
## t = -9.4941, df = 547.31, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4594393 -0.3019161
## sample estimates:
## mean of x mean of y
## 0.07753153 0.45820926
t.test(dataFemales$PA2,dataMales$PA2,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataFemales$PA2 and dataMales$PA2
## t = -28.149, df = 38264, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2578549 -0.2242835
## sample estimates:
## mean of x mean of y
## -0.08497743 0.15609178
t.test(dataFemales$PA1,dataMales$PA1,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataFemales$PA1 and dataMales$PA1
## t = 24.647, df = 36990, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1997876 0.2343080
## sample estimates:
## mean of x mean of y
## 0.07753153 -0.13951625
t.test(dataOther$PA2,dataMales$PA2,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataOther$PA2 and dataMales$PA2
## t = -14.78, df = 551.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7568762 -0.5792964
## sample estimates:
## mean of x mean of y
## -0.5119945 0.1560918
t.test(dataOther$PA1,dataMales$PA1,alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: dataOther$PA1 and dataMales$PA1
## t = 14.814, df = 561.21, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5184721 0.6769789
## sample estimates:
## mean of x mean of y
## 0.4582093 -0.1395163
We reject the null hypothesis for all our t-tests: the differences in means are statically significant for each one of our comparisons.
A work by Antonio Taurisano