library(pacman)
p_load(dplyr, psych, lavaan)
BESDS <- function(r){
PC = 0.5 * abs(r) + 0.5
return(PC)}
OmegaU <- function(x){
require(lavaan)
IN <- inspect(x, 'std')$lambda #maybe ^2? I don't think so due to path tracing rules
GL <- sum(IN); DT <- attributes(IN)$dimnames[[1]]
FL <- x@call; DF <- eval(FL$data); DD <- DF[DT]
CM <- ((sum(cor(DD))) - length(DD))/2
RE <- GL/CM
return(RE)}
The latter function is to compute \(\omega\) from a single factor lavaan model.
N79 <- read.csv("NLS79SES.csv")
N97 <- read.csv("NLS97SES.csv")
Gregory Clark takes a latent variable approach to social status in order to account for measurement error. There are a few assumptions of the model, like invariance between generations, that have, to my knowledge, not been looked at. Another assumption that is critical for even using a latent variable in the first place is local independence. If manifest variables used to indicate some sort of latent social status are residually correlated, then there may be problems using a latent variable. This may be what's expected for a few reasons. Namely, that these variables are often causally related. For instance, a prestigious occupation can cause high income, or if we use modern measurements, low income can cause welfare use, bankruptcy, or giving up the pursuit of education, which, itself, may cause higher incomes (in a sense). So perhaps a latent variable model will be inappropriate; the do-calculus is certainly uncertain on this question. If it turns out a latent variable model is not appropriate because, say, the measures are not unidimensional enough, it may be better to use PCA for a summary, but this loses the measurement error reducing benefit, even if it leads to variance reduction. If that is the case, other means of correcting for measurement error, both in variables and in sumscores, should be considered.
I will test if some social status measures yield a (1) fitting latent variable model with (2) adequate local independence in the NLSY 1979 and 1997 cohorts. For the NLSY '79, I will be using (and am open to suggestions)
Because the variables changed, for the NLSY '97, I will be using
I do not know why the '97 stopped collecting data on welfare receipt in 2009. I doubt it was because it became too common or something like that.
N79 <- N79 %>% rename(
W10 = G0260300, #total welfare received, last year
W12 = G0276000,
W14 = G0292700,
W16 = G0313100,
WST10 = T3045300, #wages/salary/tips, last year
WST12 = T3977400,
WST14 = T4915800,
WST16 = T5619500,
UNE10 = T3109000, #unemployment weeks, last year
UNE12 = T4113500,
UNE14 = T5024900,
UNE16 = T5772900,
POV10 = T3108000, #poverty status, last year
POV12 = T4112500,
POV14 = T5022800,
POV16 = T5770900,
EDU = T5177500, #highest degree received, latest year
WEA12 = T4045802, #net wealth, last year
WEA16 = T5684500,
RES12 = T4049500, #market value of residential properties
RES16 = T5687800,
BANKR = T5732100, #number of bankruptcies, latest year
ENTRE = X0045900, #number of businesses ever owned
AFQT = R0618200 #AFQT percentile score
) %>% mutate(
WT = W10 + W12 + W14 + W16,
INC = WST10 + WST12 + WST14 + WST16,
UNE = UNE10 + UNE12 + UNE14 + UNE16,
POV = POV10 + POV12 + POV14 + POV16,
WEA = WEA12 + WEA16,
RES = RES12 + RES16)
describe(N79)
SESM79 <- '
SES =~ WT + INC + UNE + POV + EDU + WEA + RES + BANKR + ENTRE'
FIT79 <- sem(SESM79, data = N79, std.lv = T, std.ov = T)
fitMeasures(FIT79)
## npar fmin chisq df
## 18.000 0.202 5137.367 27.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.000 51327.367 36.000 0.000
## cfi tli nnfi rfi
## 0.900 0.867 0.867 0.867
## nfi pnfi ifi rni
## 0.900 0.675 0.900 0.900
## logl unrestricted.logl aic bic
## -138906.388 -136337.704 277848.776 277982.844
## ntotal bic2 rmsea rmsea.ci.lower
## 12686.000 277925.642 0.122 0.119
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.125 0.000 0.078 0.078
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.078 0.078 0.078 0.087
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.087 0.078 0.078 100.054
## cn_01 gfi agfi pgfi
## 116.968 0.916 0.860 0.549
## mfi ecvi
## 0.818 0.408
resid(FIT79, "cor")
## $type
## [1] "cor.bollen"
##
## $cov
## WT INC UNE POV EDU WEA RES BANKR ENTRE
## WT 0.000
## INC -0.197 0.000
## UNE 0.040 -0.138 0.000
## POV 0.026 0.011 0.017 0.000
## EDU -0.019 -0.007 -0.011 -0.012 0.000
## WEA -0.119 0.399 -0.101 0.001 -0.002 0.000
## RES -0.014 -0.004 -0.007 -0.001 0.012 -0.005 0.000
## BANKR -0.010 -0.055 0.008 -0.005 0.037 -0.110 0.005 0.000
## ENTRE -0.033 0.067 -0.022 -0.002 0.002 0.079 -0.003 0.025 0.000
mean(abs(resid(FIT79, "cor")$cov))
## [1] 0.03962961
OmegaU(FIT79)
## [1] 0.4778986
Vars79 <- c("WT", "INC", "UNE", "POV", "EDU", "WEA", "RES", "BANKR", "ENTRE", "AFQT"); Vars79 <- N79[Vars79]
FC <- cor(Vars79); FC; FCA <- FC[,'AFQT']; #FCA[1:length(FCA)-1]
## WT INC UNE POV EDU WEA
## WT 1.00000000 -0.09747381 0.1722029 0.23860283 0.13920528 -0.05644752
## INC -0.09747381 1.00000000 0.1295101 0.44078113 0.31229458 0.52473548
## UNE 0.17220294 0.12951011 1.0000000 0.58988039 0.41486146 0.06735660
## POV 0.23860283 0.44078113 0.5898804 1.00000000 0.67131737 0.27110915
## EDU 0.13920528 0.31229458 0.4148615 0.67131737 1.00000000 0.19844436
## WEA -0.05644752 0.52473548 0.0673566 0.27110915 0.19844436 1.00000000
## RES 0.19791321 0.42578495 0.5651300 0.91814629 0.69494552 0.26437182
## BANKR 0.10595509 0.17987660 0.3213307 0.49751425 0.41053100 0.03691875
## ENTRE 0.01675670 0.16728454 0.1116439 0.21296126 0.16120409 0.14241164
## AFQT -0.14592429 0.25752004 -0.1098700 -0.04457269 -0.02041383 0.23641176
## RES BANKR ENTRE AFQT
## WT 0.19791321 0.10595509 0.01675670 -0.14592429
## INC 0.42578495 0.17987660 0.16728454 0.25752004
## UNE 0.56512995 0.32133073 0.11164389 -0.10987001
## POV 0.91814629 0.49751425 0.21296126 -0.04457269
## EDU 0.69494552 0.41053100 0.16120409 -0.02041383
## WEA 0.26437182 0.03691875 0.14241164 0.23641176
## RES 1.00000000 0.50618267 0.21147473 -0.04137285
## BANKR 0.50618267 1.00000000 0.14210627 -0.06036245
## ENTRE 0.21147473 0.14210627 1.00000000 0.07561672
## AFQT -0.04137285 -0.06036245 0.07561672 1.00000000
MCA <- mean(abs(FCA[1:length(FCA)-1])); MCA; BESDS(MCA)
## [1] 0.1102294
## [1] 0.5551147
\(\omega\) seems low and the average residual correlation is very small, but the average masks potentially important variation within the matrix. For example, welfare usage is residually negatively correlated with income and wealth in addition to a positive correlation with unemployment and poverty. Income and unemployment are residually negatively correlated, and income is residually correlated with greater education and wealth, as well as a lower number of bankruptcies and a greater number of previously owned businesses. Unemployment is residually negatively related to wealth, and wealth is residually negatively related to bankruptcy and positively related to number of businesses previously owned. What "latent socioeconomic status" explains in these variables amounts to only 1 - (0.399/0.525) = 24% of the correlation between income and wealth. Using a latent socioeconomic status variable model or variable may end up being problematic. To clarify something, the reason these small deviations (compared to, say, cognitive data, where correlation matrices are much stronger) may be concerning is because the latent variable doesn't explain much anyway:
mean(inspect(FIT79, 'std')$lambda^2)
## [1] 0.3728846
Notably, we would not expect much difference with PCA (components are sumscores, not latent variables):
IN <- inspect(FIT79, 'std')$lambda; DT <- attributes(IN)$dimnames[[1]]; FL <- FIT79@call; DF <- eval(FL$data); DD <- DF[DT]
fa(DD, method = "pca")
## Factor Analysis using method = minres
## Call: fa(r = DD, method = "pca")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## WT 0.19 0.035 0.965 1
## INC 0.45 0.200 0.800 1
## UNE 0.57 0.322 0.678 1
## POV 0.97 0.936 0.064 1
## EDU 0.72 0.514 0.486 1
## WEA 0.30 0.090 0.910 1
## RES 0.96 0.924 0.076 1
## BANKR 0.52 0.267 0.733 1
## ENTRE 0.24 0.057 0.943 1
##
## MR1
## SS loadings 3.34
## Proportion Var 0.37
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 36 and the objective function was 4.05 with Chi Square of 51307.81
## The degrees of freedom for the model are 27 and the objective function was 0.42
##
## The root mean square of the residuals (RMSR) is 0.09
## The df corrected root mean square of the residuals is 0.1
##
## The harmonic number of observations is 12686 with the empirical chi square 6762 with prob < 0
## The total number of observations was 12686 with Likelihood Chi Square = 5351.73 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.862
## RMSEA index = 0.125 and the 90 % confidence intervals are 0.122 0.128
## BIC = 5096.62
## Fit based upon off diagonal values = 0.94
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.99
## Multiple R square of scores with factors 0.97
## Minimum correlation of possible factor scores 0.95
At least in this dataset and with these indicators, there is not likely to be a general factor of socioeconomic status. Perhap using only (1) income; (2) education (3) occupational score, as Clark practically does, would change things. I may attempt that later. Using more indicators should not make for a worse model, excepting the penalty on fit that accrues normally for single-factor models and may necessitate, perhaps, moving to a higher-order (or some other variety of) model.
It is interesting and worth noting that attained wealth correlates more with the AFQT score than the education measure, which may not be highest degree attained (it's listed as highest grade; I will check this later; this is preliminary). It does not correlate more than income though. Attainment correlations contrast with parental SES-child IQ correlations, which are low for wealth and comparable quite high for parental education. The differences between attained and parental SES measure correlations are interesting for cross-racial comparisons, because the latter tend to be lower for lower-performing groups due to regression to the mean (Jensen discussed this in The g Factor), although, validating that perspective, the former are usually almost the same. Because, earlier, I found something like a 0.5 correlation with a measure someone else sent me, I think I'm probably using the wrong one for education. If that's the case, this looks like an almost Sowellian pattern of correlations, where you're much less likely to be poor if you just graduate high school.
N97 <- N97 %>% rename(
POV11 = T6656800, #poverty ratio - higher = less poverty
POV13 = T8129200,
POV15 = U0009000,
POV17 = U1845600,
INC11 = T7545600, #total wages/income/salary
INC13 = T8976700,
INC15 = U0956900,
INC17 = U2857200,
RES = U2865500, #residential property value
BUS = Z0515400, #businesses reported
BPV = Z0515800, #market value of businesses/partnerships
BANKR = Z0525200, #number of bankruptcies
UNE12 = Z9076003, #unemployment months
UNE13 = Z9076004,
UNE14 = Z9076005,
UNE15 = Z9076006,
UNE16 = Z9076007,
UNE17 = Z9076008,
UNE18 = Z9076009,
EDU = Z9083900, #highest degree attained
NETW = Z9141400, #net worth
ASVAB = R9829600 #ASVAB math-verbal percentile score
) %>% mutate(
POV = POV11 + POV13 + POV15 + POV17,
INC = INC11 + INC13 + INC15 + INC17,
UNE = UNE12 + UNE13 + UNE14 + UNE15 + UNE16 + UNE17 + UNE18)
describe(N97)
SESM97 <- '
SES =~ POV + INC + UNE + RES + BUS + BPV + BANKR + EDU + NETW'
FIT97 <- sem(SESM97, data = N97, std.lv = T, std.ov = T)
fitMeasures(FIT97)
## npar fmin chisq df
## 18.000 0.087 1563.725 27.000
## pvalue baseline.chisq baseline.df baseline.pvalue
## 0.000 16395.889 36.000 0.000
## cfi tli nnfi rfi
## 0.906 0.875 0.875 0.873
## nfi pnfi ifi rni
## 0.905 0.678 0.906 0.906
## logl unrestricted.logl aic bic
## -107309.112 -106527.249 214654.223 214782.081
## ntotal bic2 rmsea rmsea.ci.lower
## 8984.000 214724.880 0.080 0.076
## rmsea.ci.upper rmsea.pvalue rmr rmr_nomean
## 0.083 0.000 0.045 0.045
## srmr srmr_bentler srmr_bentler_nomean crmr
## 0.045 0.045 0.045 0.050
## crmr_nomean srmr_mplus srmr_mplus_nomean cn_05
## 0.050 0.045 0.045 231.461
## cn_01 gfi agfi pgfi
## 270.814 0.966 0.943 0.580
## mfi ecvi
## 0.918 0.178
resid(FIT97, "cor")
## $type
## [1] "cor.bollen"
##
## $cov
## POV INC UNE RES BUS BPV BANKR EDU NETW
## POV 0.000
## INC 0.010 0.000
## UNE -0.001 -0.030 0.000
## RES -0.037 -0.028 0.205 0.000
## BUS 0.017 -0.036 0.012 0.022 0.000
## BPV -0.004 -0.001 -0.007 0.000 0.105 0.000
## BANKR -0.005 -0.007 0.053 0.035 0.023 -0.004 0.000
## EDU 0.006 0.012 -0.067 0.027 -0.007 -0.013 -0.014 0.000
## NETW -0.009 0.001 -0.054 0.115 0.028 0.017 -0.018 -0.060 0.000
mean(abs(resid(FIT97, "cor")$cov))
## [1] 0.02689205
OmegaU(FIT97)
## [1] 0.5928947
\(\omega\) is higher but, based on some previous simulations, probably not high enough to justify a general factor of socioeconomic status in this dataset, as in the first one, but there is evidence for more, meaningful local independence among these measures. Deviations are mostly obvious, like a connection between residence value and net worth and between the number of businesses and the value of one's businesses and partnerships. These sorts of residual relationships hardly threaten the validity of a latent SES factor, but the lack of general dimensionality does. A better strategy is some form of dimension reduction or accounting for reliability by using a sumscore made from many measures (composite reliabilities are trivial to compute).
Vars97 <- c("POV", "INC", "UNE", "RES", "BUS", "BPV", "BANKR", "EDU", "NETW", "ASVAB"); Vars97 <- N97[Vars97]
FC <- cor(Vars97); FC; FCA <- FC[,'ASVAB']; #FCA[1:length(FCA)-1]
## POV INC UNE RES BUS BPV
## POV 1.00000000 0.72778674 0.362191428 0.43142582 0.15027002 0.027041565
## INC 0.72778674 1.00000000 0.319943226 0.42353802 0.09344531 0.028270877
## UNE 0.36219143 0.31994323 1.000000000 0.43312291 0.07738759 0.007664619
## RES 0.43142582 0.42353802 0.433122912 1.00000000 0.10651576 0.019113095
## BUS 0.15027002 0.09344531 0.077387593 0.10651576 1.00000000 0.110074367
## BPV 0.02704156 0.02827088 0.007664619 0.01911309 0.11007437 1.000000000
## BANKR 0.01544829 0.01236054 0.063230431 0.04799598 0.02624845 -0.003545517
## EDU 0.43304105 0.42331110 0.141316102 0.29541903 0.06932037 0.004493055
## NETW 0.40867473 0.40429284 0.150069400 0.37773597 0.10264323 0.034471934
## ASVAB 0.33908412 0.32236639 0.040989239 0.20230461 0.08541391 0.019569815
## BANKR EDU NETW ASVAB
## POV 0.015448291 0.433041052 0.408674729 0.33908412
## INC 0.012360541 0.423311097 0.404292838 0.32236639
## UNE 0.063230431 0.141316102 0.150069400 0.04098924
## RES 0.047995975 0.295419031 0.377735969 0.20230461
## BUS 0.026248447 0.069320365 0.102643225 0.08541391
## BPV -0.003545517 0.004493055 0.034471934 0.01956981
## BANKR 1.000000000 -0.002409366 -0.006535394 -0.01149917
## EDU -0.002409366 1.000000000 0.179889439 0.47341150
## NETW -0.006535394 0.179889439 1.000000000 0.16839567
## ASVAB -0.011499169 0.473411502 0.168395669 1.00000000
MCA <- mean(abs(FCA[1:length(FCA)-1])); MCA; BESDS(MCA)
## [1] 0.1847816
## [1] 0.5923908
Again, the education measure in the first dataset may have been wrong. Even if that changed things, it's hardly plausible that the failure of this model would be overturned. Regarding the total amount this second model explains:
mean(inspect(FIT97, 'std')$lambda^2)
## [1] 0.2680762
And the result is still not very different from PCA:
IN <- inspect(FIT97, 'std')$lambda; DT <- attributes(IN)$dimnames[[1]]; FL <- FIT97@call; DF <- eval(FL$data); DD <- DF[DT]
fa(DD, method = "pca")
## Factor Analysis using method = minres
## Call: fa(r = DD, method = "pca")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## POV 0.84 0.7107 0.29 1
## INC 0.81 0.6518 0.35 1
## UNE 0.44 0.1944 0.81 1
## RES 0.60 0.3657 0.63 1
## BUS 0.16 0.0265 0.97 1
## BPV 0.04 0.0016 1.00 1
## BANKR 0.03 0.0011 1.00 1
## EDU 0.48 0.2282 0.77 1
## NETW 0.49 0.2406 0.76 1
##
## MR1
## SS loadings 2.42
## Proportion Var 0.27
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 36 and the objective function was 1.83 with Chi Square of 16387.07
## The degrees of freedom for the model are 27 and the objective function was 0.19
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.05
##
## The harmonic number of observations is 8984 with the empirical chi square 1461.22 with prob < 1.9e-291
## The total number of observations was 8984 with Likelihood Chi Square = 1687.96 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.865
## RMSEA index = 0.083 and the 90 % confidence intervals are 0.079 0.086
## BIC = 1442.17
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.92
## Multiple R square of scores with factors 0.85
## Minimum correlation of possible factor scores 0.70
I don't know what the reasons are to believe in a latent socioeconomic status variable. If it is real, it seems to not be very general. Because higher-order loadings are consistent between good single-factor, STCU, higher-order, and bifactor models, mathematically, it seems unlikely that much will change this. The use of latent variables seems dubious. The use of measures weighted based on the loadings from them seems even worse: why make a composite from something with a weight of 0.03 among, generously, a dozen variables? Latent variable modeling may generate results consistent with low mobility because it is a bad (and indeed, perhaps inappropriate) means of summarizing measures. Even the first principal component does not seem usable as a summary of latent socioeconomic status, trait, summary, or otherwise. Unless someone really wants to argue that what socioeconomic status measures have in common is all (or a reasonable approximation of) what's reliable for them (for all I know, this could be true), a probably far better strategy would be to use
\[X_t = \frac{\sum_{i=1}^kX_i-K}{\sqrt{\sum_{i=1}^k \sum_{j=1}^k (r_{ij})}}\]
with reliability equal to
\[\frac{\sum_{i=1}^k \sum_{j=1}^k r_{ij} - K + \sum_{i=1}^k r_i}{\sum_{i=1}^k \sum_{j=1}^k r_{ij}}\]
If anyone recalls the sources for these, do tell.
Maybe this is alright and, say, the amount of reliable variance in socioeconomic status is just what the measures (how are these selected?) have in common, but I've never seen anyone justify this and I feel it should be broken down into multiple subdomains instead of one broad domain in any case. The theoretical implications of latent variable models versus other means of dimensionality reduction should be reckoned with. If the result is fine, using a sumscore should generate comparable results. This does not justify the result, it's just a necessary corollary. I would be surprised if the total variance to be explained was like the amount observed in psychology.