社会調査士のためのこれからの因子分析 on Rpubs

ここでは，4件法の順序尺度水準で得られたデータを因子分析する例についてお話します。従来通りの因子分析の方法，段階反応モデルを使った方法，多次元IRTを使う方法，の三段階にわけて比較検討しながら進めてまいります。このお話の背景については，こちら (“http://kosugitti.hatenadiary.jp/entry/2012/12/19/144228”) もご一読いただければ幸いです。

今回は次のパッケージをご用意ください。

library(psych)
library(ltm)

## Loading required package: MASS

## Loading required package: msm

## Loading required package: mvtnorm

## Loading required package: polycor

## Loading required package: sfsmisc

## Attaching package: 'polycor'

## The following object(s) are masked from 'package:psych':
## 
## polyserial

## Attaching package: 'ltm'

## The following object(s) are masked from 'package:psych':
## 
## factor.scores

library(mirt)

## Loading required package: stats4

## Loading required package: lattice

## Attaching package: 'mirt'

## The following object(s) are masked from 'package:ltm':
## 
## Science

サンプルデータはltmパッケージにあるScienceデータを使います。4件法で7つの項目があります。

data(Science)
summary(ltm::Science)

##               Comfort               Environment                 Work    
##  strongly disagree:  5   strongly disagree: 29   strongly disagree: 33  
##  disagree         : 32   disagree         : 90   disagree         : 98  
##  agree            :266   agree            :145   agree            :206  
##  strongly agree   : 89   strongly agree   :128   strongly agree   : 55  
##                Future                Technology               Industry  
##  strongly disagree: 14   strongly disagree: 18   strongly disagree: 10  
##  disagree         : 72   disagree         : 91   disagree         : 47  
##  agree            :210   agree            :157   agree            :173  
##  strongly agree   : 96   strongly agree   :126   strongly agree   :162  
##               Benefit   
##  strongly disagree: 21  
##  disagree         :100  
##  agree            :193  
##  strongly agree   : 78

従来はこうした4件法であっても，（無理矢理）間隔尺度水準とみなして分析していたわけです。すなわち，相関係数の出し方がピアソンの積率相関係数で，それに基づく因子分析だった。ちょっとやってみます。まず間隔尺度水準に置き換えます。

Science.n <- data.frame(data.matrix(ltm::Science))

積率相関係数とポリコリック相関係数を比較します。ポリコリック相関係数は，ltmパッケージが読み込むpolycorパッケージにあるhetcor関数で，変数の型にあった適切な相関係数を出してくれる関数。

print(cor(Science.n), digit = 2)

##             Comfort Environment   Work Future Technology Industry Benefit
## Comfort       1.000       0.077  0.151  0.284      0.071    0.135   0.334
## Environment   0.077       1.000 -0.071 -0.030      0.387    0.328  -0.032
## Work          0.151      -0.071  1.000  0.401     -0.089   -0.019   0.167
## Future        0.284      -0.030  0.401  1.000     -0.028    0.062   0.313
## Technology    0.071       0.387 -0.089 -0.028      1.000    0.345  -0.012
## Industry      0.135       0.328 -0.019  0.062      0.345    1.000   0.086
## Benefit       0.334      -0.032  0.167  0.313     -0.012    0.086   1.000

result.het <- hetcor(ltm::Science)
print(result.het$correlations, digit = 2)

##             Comfort Environment    Work Future Technology Industry Benefit
## Comfort       1.000       0.099  0.2012  0.346      0.090   0.1857   0.408
## Environment   0.099       1.000 -0.0828 -0.028      0.464   0.4108  -0.037
## Work          0.201      -0.083  1.0000  0.479     -0.104  -0.0076   0.209
## Future        0.346      -0.028  0.4786  1.000     -0.036   0.1027   0.377
## Technology    0.090       0.464 -0.1039 -0.036      1.000   0.4348  -0.014
## Industry      0.186       0.411 -0.0076  0.103      0.435   1.0000   0.118
## Benefit       0.408      -0.037  0.2086  0.377     -0.014   0.1180   1.000

結果をみると，ポリコリック相関係数のほうが数値が大きい。逆に言うと，順序尺度水準の変数を無理矢理間隔尺度水準とみなして積率相関係数を出すことは，不適切なモデル適用によって値が過小評価されちゃうことでもあるわけです。

で，普通の因子分析というのは，連続変数とみなしたデータに対して固有値分解し，因子数を決めたりするわけですね。因子数の決定には，psychパッケージのfa.parallel関数による並行分析がいいかも。

fa.parallel(Science.n)

plot of chunk unnamed-chunk-5

## Parallel analysis suggests that the number of factors =  3  and the number of components =  2

fa.parallel(result.het$correlations, n.obs = 392)

plot of chunk unnamed-chunk-5

## Parallel analysis suggests that the number of factors =  3  and the number of components =  2

まあ結果は変わらないんですけどね。

因子分析も連続変数と見なした古典的方法と，ポリコリック相関係数をつかった方法，両方で見てみましょう。まずは古典的方法から。

fa.result.n <- fa(Science.n, nfactors = 3, fm = "ml", rotate = "promax")
print(fa.result.n, sort = T, digit = 3)

## Factor Analysis using method =  ml
## Call: fa(r = Science.n, nfactors = 3, rotate = "promax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             item    ML2    ML1    ML3    h2    u2
## Technology     5  0.639 -0.021 -0.046 0.403 0.597
## Environment    2  0.617  0.002 -0.080 0.372 0.628
## Industry       6  0.545  0.040  0.058 0.314 0.686
## Future         4  0.018  0.793 -0.010 0.620 0.380
## Work           3 -0.082  0.548 -0.067 0.273 0.727
## Benefit        7 -0.076 -0.044  0.799 0.587 0.413
## Comfort        1  0.119  0.170  0.341 0.236 0.764
## 
##                         ML2   ML1   ML3
## SS loadings           1.103 0.952 0.751
## Proportion Var        0.158 0.136 0.107
## Cumulative Var        0.158 0.294 0.401
## Proportion Explained  0.393 0.339 0.268
## Cumulative Proportion 0.393 0.732 1.000
## 
##  With factor correlations of 
##        ML2    ML1   ML3
## ML2  1.000 -0.008 0.154
## ML1 -0.008  1.000 0.560
## ML3  0.154  0.560 1.000
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  0.825 with Chi Square of  320
## The degrees of freedom for the model are 3  and the objective function was  0.001 
## 
## The root mean square of the residuals (RMSR) is  0.002 
## The df corrected root mean square of the residuals is  0.009 
## The number of observations was  392  with Chi Square =  0.222  with prob <  0.974 
## 
## Tucker Lewis Index of factoring reliability =  1.065
## RMSEA index =  0  and the 90 % confidence intervals are  NA NA
## BIC =  -17.69
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                  ML2   ML1   ML3
## Correlation of scores with factors             0.798 0.836 0.813
## Multiple R square of scores with factors       0.636 0.699 0.661
## Minimum correlation of possible factor scores  0.273 0.399 0.322

ポリコリック相関係数を使った方。

fa.result <- fa(result.het$correlations, nfactors = 3, n.obs = 392, fm = "ml", 
    rotate = "promax")
print(fa.result, sort = T, digit = 3)

## Factor Analysis using method =  ml
## Call: fa(r = result.het$correlations, nfactors = 3, n.obs = 392, rotate = "promax", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             item    ML2    ML3    ML1    h2    u2
## Technology     5  0.700 -0.049 -0.047 0.485 0.515
## Environment    2  0.674 -0.009 -0.089 0.443 0.557
## Industry       6  0.630  0.065  0.055 0.421 0.579
## Future         4  0.016  0.832 -0.009 0.686 0.314
## Work           3 -0.090  0.633 -0.090 0.351 0.649
## Benefit        7 -0.075 -0.050  0.895 0.736 0.264
## Comfort        1  0.143  0.213  0.352 0.293 0.707
## 
##                         ML2   ML3   ML1
## SS loadings           1.364 1.132 0.919
## Proportion Var        0.195 0.162 0.131
## Cumulative Var        0.195 0.357 0.488
## Proportion Explained  0.399 0.331 0.269
## Cumulative Proportion 0.399 0.731 1.000
## 
##  With factor correlations of 
##       ML2   ML3   ML1
## ML2 1.000 0.025 0.160
## ML3 0.025 1.000 0.573
## ML1 0.160 0.573 1.000
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  1.249 with Chi Square of  484.3
## The degrees of freedom for the model are 3  and the objective function was  0.001 
## 
## The root mean square of the residuals (RMSR) is  0.002 
## The df corrected root mean square of the residuals is  0.009 
## The number of observations was  392  with Chi Square =  0.287  with prob <  0.963 
## 
## Tucker Lewis Index of factoring reliability =  1.041
## RMSEA index =  0  and the 90 % confidence intervals are  NA NA
## BIC =  -17.63
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                  ML2   ML3   ML1
## Correlation of scores with factors             0.845 0.872 0.881
## Multiple R square of scores with factors       0.714 0.761 0.776
## Minimum correlation of possible factor scores  0.428 0.522 0.553

これまた結果は大きく変わらないのですが，後者の方がやや大きい負荷量が算出されている。前者の方がやや「目減り」していたわけです。

後者がカテゴリカル因子分析のやり方ですが，いったんhetcor関数でポリコリック相関係数を算出させるあたりがちょっといやですね。ということで，ルートとしては，この因子分析の結果を受けてIRT(GRM)ということが考えられます。最近，psychパッケージにirt.faというポリコリック相関係数から因子分析できる関数が追加されました。変数は数値型（Factor型ではない）で渡す必要がありますが，次のようにします。

result.irtFa <- irt.fa(Science.n, nfactors = 3, fm = "ml")

## Loading required package: GPArotation

plot of chunk unnamed-chunk-8

result.irtFa$fa

## Factor Analysis using method =  ml
## Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##               ML2   ML3   ML1   h2   u2
## Comfort      0.17  0.24  0.33 0.28 0.72
## Environment  0.67 -0.02 -0.06 0.44 0.56
## Work        -0.09  0.61 -0.06 0.35 0.65
## Future       0.02  0.82  0.02 0.70 0.30
## Technology   0.69 -0.05 -0.02 0.49 0.51
## Industry     0.63  0.07  0.07 0.42 0.58
## Benefit     -0.02 -0.01  0.90 0.80 0.20
## 
##                        ML2  ML3  ML1
## SS loadings           1.37 1.14 0.96
## Proportion Var        0.20 0.16 0.14
## Cumulative Var        0.20 0.36 0.50
## Proportion Explained  0.39 0.33 0.28
## Cumulative Proportion 0.39 0.72 1.00
## 
##  With factor correlations of 
##      ML2  ML3  ML1
## ML2 1.00 0.00 0.06
## ML3 0.00 1.00 0.49
## ML1 0.06 0.49 1.00
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  1.25 with Chi Square of  484.1
## The degrees of freedom for the model are 3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  0.01 
## The number of observations was  392  with Chi Square =  0.28  with prob <  0.96 
## 
## Tucker Lewis Index of factoring reliability =  1.041
## RMSEA index =  0  and the 90 % confidence intervals are  NA NA
## BIC =  -17.64
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 ML2  ML3  ML1
## Correlation of scores with factors             0.84 0.87 0.90
## Multiple R square of scores with factors       0.71 0.76 0.81
## Minimum correlation of possible factor scores  0.43 0.52 0.63

結果の全体像をみるには，print関数でshort=Fオプションをつけましょう。

print(result.irtFa, short = FALSE)

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = Science.n, nfactors = 3, fm = "ml")
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1     2     3
## Environment  0.27 0.41 0.49 0.47 0.36  0.22  0.11
## Technology   0.33 0.46 0.52 0.49 0.38  0.22  0.11
## Industry     0.31 0.39 0.41 0.36 0.26  0.16  0.08
## Test Info    0.91 1.26 1.41 1.32 1.00  0.60  0.30
## SEM          1.05 0.89 0.84 0.87 1.00  1.29  1.82
## Reliability -0.10 0.21 0.29 0.24 0.00 -0.66 -2.31
## 
##  Factor =  2 
##                -3   -2   -1     0     1     2     3
## Work         0.20 0.29 0.35  0.35  0.30  0.23  0.15
## Future       0.69 0.72 0.65  0.51  0.58  0.41  0.14
## Test Info    0.89 1.01 1.00  0.86  0.89  0.64  0.29
## SEM          1.06 0.99 1.00  1.08  1.06  1.25  1.85
## Reliability -0.12 0.01 0.00 -0.17 -0.13 -0.57 -2.44
## 
##  Factor =  3 
##                -3    -2   -1     0     1    2     3
## Comfort      0.08  0.09 0.09  0.09  0.08 0.07  0.06
## Benefit      0.78  0.65 1.04  0.43  0.53 1.03  0.37
## Test Info    0.86  0.74 1.13  0.51  0.61 1.10  0.43
## SEM          1.08  1.16 0.94  1.40  1.28 0.95  1.53
## Reliability -0.17 -0.35 0.12 -0.95 -0.63 0.09 -1.33
## 
## Item discrimination and location for factor  ML2 
##             discrimination location.1 location.2 location.3
## Comfort               0.17      -2.27      -1.33       0.76
## Environment           0.89      -1.94      -0.69       0.60
## Work                 -0.09      -1.38      -0.43       1.08
## Future                0.02      -1.80      -0.77       0.69
## Technology            0.97      -2.34      -0.82       0.64
## Industry              0.82      -2.52      -1.36       0.28
## Benefit              -0.02      -1.61      -0.50       0.85
## 
## Item discrimination and location for factor  ML3 
##             discrimination location.1 location.2 location.3
## Comfort               0.25      -2.30      -1.35       0.77
## Environment          -0.02      -1.45      -0.51       0.45
## Work                  0.76      -1.73      -0.54       1.36
## Future                1.45      -3.18      -1.37       1.22
## Technology           -0.05      -1.69      -0.59       0.46
## Industry              0.07      -1.96      -1.06       0.22
## Benefit              -0.01      -1.61      -0.50       0.85
## 
## Item discrimination and location for factor  ML1 
##             discrimination location.1 location.2 location.3
## Comfort               0.35      -2.37      -1.39       0.79
## Environment          -0.06      -1.45      -0.52       0.45
## Work                 -0.06      -1.38      -0.43       1.08
## Future                0.02      -1.80      -0.77       0.69
## Technology           -0.02      -1.69      -0.59       0.46
## Industry              0.07      -1.96      -1.06       0.22
## Benefit               2.03      -3.65      -1.13       1.92
## 
##  These parameters were based on the following factor analysis
## Factor Analysis using method =  ml
## Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##               ML2   ML3   ML1   h2   u2
## Comfort      0.17  0.24  0.33 0.28 0.72
## Environment  0.67 -0.02 -0.06 0.44 0.56
## Work        -0.09  0.61 -0.06 0.35 0.65
## Future       0.02  0.82  0.02 0.70 0.30
## Technology   0.69 -0.05 -0.02 0.49 0.51
## Industry     0.63  0.07  0.07 0.42 0.58
## Benefit     -0.02 -0.01  0.90 0.80 0.20
## 
##                        ML2  ML3  ML1
## SS loadings           1.37 1.14 0.96
## Proportion Var        0.20 0.16 0.14
## Cumulative Var        0.20 0.36 0.50
## Proportion Explained  0.39 0.33 0.28
## Cumulative Proportion 0.39 0.72 1.00
## 
##  With factor correlations of 
##      ML2  ML3  ML1
## ML2 1.00 0.00 0.06
## ML3 0.00 1.00 0.49
## ML1 0.06 0.49 1.00
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  1.25 with Chi Square of  484.1
## The degrees of freedom for the model are 3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  0.01 
## The number of observations was  392  with Chi Square =  0.28  with prob <  0.96 
## 
## Tucker Lewis Index of factoring reliability =  1.041
## RMSEA index =  0  and the 90 % confidence intervals are  NA NA
## BIC =  -17.64
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 ML2  ML3  ML1
## Correlation of scores with factors             0.84 0.87 0.90
## Multiple R square of scores with factors       0.71 0.76 0.81
## Minimum correlation of possible factor scores  0.43 0.52 0.63

まあ他にも色々できるようですが，まだちょっと関数の挙動が怪しかったりします。

IRT(GRM)をやるには，以前からある専門のltmパッケージ使った方が確実かも。今回第一因子はTechnology,Environment,Industryからなり，第二因子はFuture,Work，第三因子はBenefit,Comfortからなるわけですから，ltmパッケージのgrm関数を使って，

grm(ltm::Science[c(5, 2, 6)])  #Factor 1

## 
## Call:
## grm(data = ltm::Science[c(5, 2, 6)])
## 
## Coefficients:
##              Extrmt1  Extrmt2  Extrmt3  Dscrmn
## Technology    -2.386   -0.855    0.624   1.751
## Environment   -2.112   -0.772    0.618   1.626
## Industry      -3.005   -1.589    0.301   1.522
## 
## Log.Lik: -1312

grm(ltm::Science[c(4, 3)])  #Factor 2

## 
## Call:
## grm(data = ltm::Science[c(4, 3)])
## 
## Coefficients:
##         Extrmt1  Extrmt2  Extrmt3  Dscrmn
## Future   -2.538   -1.063    0.942   1.817
## Work     -1.982   -0.610    1.547   1.685
## 
## Log.Lik: -856.8

grm(ltm::Science[c(7, 1)])  #Factor 3

## 
## Call:
## grm(data = ltm::Science[c(7, 1)])
## 
## Coefficients:
##          Extrmt1  Extrmt2  Extrmt3  Dscrmn
## Benefit   -1.879   -0.570    0.975   3.037
## Comfort   -4.851   -2.632    1.461   0.994
## 
## Log.Lik: -776.2

とすることができます。

もっとも，これだと因子間相関がでないですね。IRT(GRM)は基本的に単因子モデルですから。これを多次元モデルにするのがmirtパッケージのmirt関数です。数値型の変数を渡して，

result.mirt <- mirt(Science.n, 3, itemtype = "graded", rotate = "promax")
summary(result.mirt)

## 
## Rotation:  promax 
## 
## Rotated factor loadings: 
## 
##                F_1    F_2    F_3    h2
## Comfort      0.293 -0.218  0.347 0.421
## Environment -0.113 -0.665 -0.039 0.429
## Work        -0.148  0.048  0.705 0.390
## Future       0.049  0.038  0.747 0.591
## Technology   0.077 -0.679 -0.141 0.449
## Industry    -0.069 -0.680  0.107 0.487
## Benefit      1.000  0.068  0.011 0.996
## 
## Rotated SS loadings:  1.133 1.422 1.207 
## 
## Factor correlations: 
## 
##        F_1    F_2    F_3
## F_1  1.000 -0.145  0.569
## F_2 -0.145  1.000 -0.212
## F_3  0.569 -0.212  1.000

となります。

とまあ，このようにかなりカテゴリカルな因子分析ができる環境が整ってきているわけです。これからのスタンダードが変わるといいなあ，と思っています。

余談ですが，もっとちゃんとしたいという人は，M-plusというソフトがありますよ。