社会調査士のためのこれからの因子分析

ここでは，4件法の順序尺度水準で得られたデータを因子分析する例についてお話します。従来通りの因子分析の方法，段階反応モデルを使った方法，多次元IRTを使う方法，の三段階にわけて比較検討しながら進めてまいります。このお話の背景については，こちら(http://rpubs.com/kosugitti/3274)もご一読いただければ幸いです。 ※途中，エラーが出たりRそのもののアウトプットと異なる場所がありますが，おそらくknitrあたりの問題で，実際はちゃんと結果が出てたりします。ご自分の環境でやってみてください。 ※当方，R2.15.2,Rstudio0.97.246，MacOSXMountaiLion環境です。

今回は次のパッケージをご用意ください。

library(psych)
library(ltm)

## Loading required package: MASS

## Loading required package: msm

## Loading required package: mvtnorm

## Loading required package: polycor

## Loading required package: sfsmisc

## Attaching package: 'polycor'

## The following object(s) are masked from 'package:psych':
## 
## polyserial

## Attaching package: 'ltm'

## The following object(s) are masked from 'package:psych':
## 
## factor.scores

library(mirt)

## Loading required package: stats4

## Loading required package: lattice

## Attaching package: 'mirt'

## The following object(s) are masked from 'package:ltm':
## 
## Science

サンプルデータはltmパッケージにあるScienceデータを使います。4件法で7つの項目があります。

data(Science)
summary(Science)

##     Comfort          Work          Future        Benefit    
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:2.00   1st Qu.:3.00   1st Qu.:2.00  
##  Median :3.00   Median :3.00   Median :3.00   Median :3.00  
##  Mean   :3.12   Mean   :2.72   Mean   :2.99   Mean   :2.84  
##  3rd Qu.:3.00   3rd Qu.:3.00   3rd Qu.:3.00   3rd Qu.:3.00  
##  Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00

従来はこうした4件法であっても，（無理矢理）間隔尺度水準とみなして分析していたわけです。すなわち，相関係数の出し方がピアソンの積率相関係数で，それに基づく因子分析だった。ちょっとやってみます。まず間隔尺度水準に置き換えます。

Science.n <- data.frame(data.matrix(Science))

積率相関係数とポリコリック相関係数を比較します。ポリコリック相関係数は，ltmパッケージが読み込むpolycorパッケージにあるhetcor関数で，変数の型にあった適切な相関係数を出してくれる関数。

print(cor(Science.n), digit = 2)

##         Comfort Work Future Benefit
## Comfort    1.00 0.15   0.28    0.33
## Work       0.15 1.00   0.40    0.17
## Future     0.28 0.40   1.00    0.31
## Benefit    0.33 0.17   0.31    1.00

result.het <- hetcor(Science)
print(result.het$correlations, digit = 2)

##         Comfort Work Future Benefit
## Comfort    1.00 0.15   0.28    0.33
## Work       0.15 1.00   0.40    0.17
## Future     0.28 0.40   1.00    0.31
## Benefit    0.33 0.17   0.31    1.00

結果をみると，ポリコリック相関係数のほうが数値が大きい。逆に言うと，順序尺度水準の変数を無理矢理間隔尺度水準とみなして積率相関係数を出すことは，不適切なモデル適用によって値が過小評価されちゃうことでもあるわけです。

で，普通の因子分析というのは，連続変数とみなしたデータに対して固有値分解し，因子数を決めたりするわけですね。因子数の決定には，psychパッケージのfa.parallel関数による並行分析がいいかも。

fa.parallel(Science.n)

plot of chunk unnamed-chunk-5

## Parallel analysis suggests that the number of factors =  2  and the number of components =  1

fa.parallel(result.het$correlations, n.obs = 392)

plot of chunk unnamed-chunk-5

## Parallel analysis suggests that the number of factors =  2  and the number of components =  1

まあ結果は変わらないんですけどね。

因子分析も連続変数と見なした古典的方法と，ポリコリック相関係数をつかった方法，両方で見てみましょう。まずは古典的方法から。

fa.result.n <- fa(Science.n, nfactors = 3, fm = "ml", rotate = "promax")

## In fa, too many factors requested for this number of variables to use SMC
## for communality estimates, 1s are used instead

print(fa.result.n, sort = T, digit = 3)

## Factor Analysis using method =  ml
## Call: fa(r = Science.n, nfactors = 3, rotate = "promax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         item    ML1   ML3   ML2    h2    u2
## Future     3  0.844 0.047 0.013 0.771 0.229
## Benefit    4  0.012 0.587 0.001 0.352 0.648
## Comfort    1 -0.010 0.569 0.000 0.317 0.683
## Work       2  0.004 0.001 0.800 0.644 0.356
## 
##                         ML1   ML3   ML2
## SS loadings           0.743 0.694 0.648
## Proportion Var        0.186 0.173 0.162
## Cumulative Var        0.186 0.359 0.521
## Proportion Explained  0.356 0.333 0.311
## Cumulative Proportion 0.356 0.689 1.000
## 
##  With factor correlations of 
##       ML1   ML3   ML2
## ML1 1.000 0.549 0.555
## ML3 0.549 1.000 0.339
## ML2 0.555 0.339 1.000
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  6  and the objective function was  0.442 with Chi Square of  171.7
## The degrees of freedom for the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The number of observations was  392  with Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.036
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                  ML1   ML3   ML2
## Correlation of scores with factors             0.886 0.761 0.825
## Multiple R square of scores with factors       0.785 0.580 0.681
## Minimum correlation of possible factor scores  0.570 0.159 0.362

ポリコリック相関係数を使った方。

fa.result <- fa(result.het$correlations, nfactors = 3, n.obs = 392, fm = "ml", 
    rotate = "promax")

## In fa, too many factors requested for this number of variables to use SMC
## for communality estimates, 1s are used instead

print(fa.result, sort = T, digit = 3)

## Factor Analysis using method =  ml
## Call: fa(r = result.het$correlations, nfactors = 3, n.obs = 392, rotate = "promax", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         item    ML1   ML3   ML2    h2    u2
## Future     3  0.844 0.047 0.013 0.771 0.229
## Benefit    4  0.012 0.587 0.001 0.352 0.648
## Comfort    1 -0.010 0.569 0.000 0.317 0.683
## Work       2  0.004 0.001 0.800 0.644 0.356
## 
##                         ML1   ML3   ML2
## SS loadings           0.743 0.694 0.648
## Proportion Var        0.186 0.173 0.162
## Cumulative Var        0.186 0.359 0.521
## Proportion Explained  0.356 0.333 0.311
## Cumulative Proportion 0.356 0.689 1.000
## 
##  With factor correlations of 
##       ML1   ML3   ML2
## ML1 1.000 0.549 0.555
## ML3 0.549 1.000 0.339
## ML2 0.555 0.339 1.000
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  6  and the objective function was  0.442 with Chi Square of  171.7
## The degrees of freedom for the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The number of observations was  392  with Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.036
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                  ML1   ML3   ML2
## Correlation of scores with factors             0.886 0.761 0.825
## Multiple R square of scores with factors       0.785 0.580 0.681
## Minimum correlation of possible factor scores  0.570 0.159 0.362

これまた結果は大きく変わらないのですが，後者の方がやや大きい負荷量が算出されている。前者の方がやや「目減り」していたわけです。

後者がカテゴリカル因子分析のやり方ですが，いったんhetcor関数でポリコリック相関係数を算出させるあたりがちょっといやですね。ということで，ルートとしては，この因子分析の結果を受けてIRT(GRM)ということが考えられます。最近，psychパッケージにirt.faというポリコリック相関係数から因子分析できる関数が追加されました。変数は数値型（Factor型ではない）で渡す必要がありますが，次のようにします。

result.irtFa <- irt.fa(Science.n, nfactors = 3, fm = "ml")

## In fa, too many factors requested for this number of variables to use SMC
## for communality estimates, 1s are used instead

## Loading required package: GPArotation

plot of chunk unnamed-chunk-8

result.irtFa$fa

## Factor Analysis using method =  ml
## Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML2   ML1   ML3   h2   u2
## Comfort 0.64 -0.05  0.03 0.39 0.61
## Work    0.00  0.01  0.73 0.54 0.46
## Future  0.00  0.80  0.01 0.64 0.36
## Benefit 0.63  0.06 -0.03 0.43 0.57
## 
##                        ML2  ML1  ML3
## SS loadings           0.81 0.65 0.54
## Proportion Var        0.20 0.16 0.13
## Cumulative Var        0.20 0.37 0.50
## Proportion Explained  0.41 0.32 0.27
## Cumulative Proportion 0.41 0.73 1.00
## 
##  With factor correlations of 
##      ML2  ML1  ML3
## ML2 1.00 0.70 0.43
## ML1 0.70 1.00 0.81
## ML3 0.43 0.81 1.00
## 
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  6  and the objective function was  0.65 with Chi Square of  253.2
## The degrees of freedom for the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The number of observations was  392  with Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.024
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                 ML2  ML1  ML3
## Correlation of scores with factors             0.81 0.86 0.81
## Multiple R square of scores with factors       0.65 0.74 0.65
## Minimum correlation of possible factor scores  0.30 0.48 0.31

結果の全体像をみるには，print関数でshort=Fオプションをつけましょう。

print(result.irtFa, short = FALSE)

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = Science.n, nfactors = 3, fm = "ml")
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3    -2    -1     0     1     2     3
## Comfort      0.33  0.37  0.35  0.31  0.26  0.19  0.11
## Benefit      0.24  0.33  0.38  0.37  0.32  0.23  0.14
## Test Info    0.57  0.70  0.73  0.68  0.58  0.41  0.25
## SEM          1.33  1.19  1.17  1.21  1.32  1.56  2.02
## Reliability -0.76 -0.42 -0.36 -0.46 -0.74 -1.42 -3.07
## 
##  Factor =  2

## Error: 'x' must be an array of at least two dimensions

まあ他にも色々できるようですが，まだちょっと関数の挙動が怪しかったりします。

IRT(GRM)をやるには，以前からある専門のltmパッケージ使った方が確実かも。今回第一因子はTechnology,Environment,Industryからなり，第二因子はFuture,Work，第三因子はBenefit,Comfortからなるわけですから，ltmパッケージのgrm関数を使って，

grm(Science[c(5, 2, 6)])  #Factor 1

## Error: undefined columns selected

grm(Science[c(4, 3)])  #Factor 2

## 
## Call:
## grm(data = Science[c(4, 3)])
## 
## Coefficients:
##          Extrmt1  Extrmt2  Extrmt3  Dscrmn
## Benefit   -2.118   -0.652    1.097   2.079
## Future    -3.533   -1.436    1.276   1.076
## 
## Log.Lik: -872.9

grm(Science[c(7, 1)])  #Factor 3

## Error: undefined columns selected

とすることができます。

もっとも，これだと因子間相関がでないですね。IRT(GRM)は基本的に単因子モデルですから。これを多次元モデルにするのがmirtパッケージのmirt関数です。数値型の変数を渡して，

result.mirt <- mirt(Science.n, 3, itemtype = "graded", rotate = "promax")
summary(result.mirt)

## 
## Rotation:  promax 
## 
## Rotated factor loadings: 
## 
##            F_1    F_2   F_3    h2
## Comfort  0.027  0.012 0.567 0.338
## Work     0.001 -0.787 0.009 0.627
## Future   0.496 -0.127 0.184 0.543
## Benefit -0.020 -0.012 0.680 0.450
## 
## Rotated SS loadings:  0.247 0.635 0.817 
## 
## Factor correlations: 
## 
##        F_1    F_2    F_3
## F_1  1.000 -0.762  0.713
## F_2 -0.762  1.000 -0.435
## F_3  0.713 -0.435  1.000

となります（警告が出るのでちょっと推定がうまくいってないかもです）とまあ，このようにかなりカテゴリカルな因子分析ができる環境が整ってきているわけです。 ltmは安定した答えを出してくれますが，psychパッケージやmirtパッケージはまだちょっと不十分かな，とおもうところがなくはありません。その辺は今後に期待ということで。

余談ですが，もっとちゃんとしたいという人は，M-plusというソフトがありますよ。