Path Analysis

今日は パス解析 です。

パス解析のパス，は通路という意味があるように，X→Y→Zという一連の連なった関係をみる分析方法です。

今回はサンプルデータを使います。Moodleにあるサンプルデータをプロジェクトフォルダにダウンロードして，次のように読み込んでみてください。

sports <- read.csv("sports.csv",head=T)
summary(sports)

##      Run50M          Run500M          Run1500M         Kensui     
##  Min.   : 4.444   Min.   : 87.33   Min.   :211.2   Min.   :16.00  
##  1st Qu.: 7.286   1st Qu.:114.11   1st Qu.:232.2   1st Qu.:27.00  
##  Median : 7.999   Median :120.09   Median :239.1   Median :30.00  
##  Mean   : 7.970   Mean   :120.23   Mean   :239.6   Mean   :30.12  
##  3rd Qu.: 8.673   3rd Qu.:126.59   3rd Qu.:246.6   3rd Qu.:33.00  
##  Max.   :11.330   Max.   :162.59   Max.   :269.8   Max.   :45.00  
##      Entoh          Udetate         Hanpuku         Kiritsu     
##  Min.   :15.00   Min.   : 2.00   Min.   : 1.00   Min.   :1.774  
##  1st Qu.:27.00   1st Qu.:28.00   1st Qu.:22.00   1st Qu.:4.257  
##  Median :30.00   Median :34.00   Median :29.00   Median :4.861  
##  Mean   :30.23   Mean   :34.12   Mean   :28.72   Mean   :4.894  
##  3rd Qu.:33.00   3rd Qu.:41.00   3rd Qu.:35.00   3rd Qu.:5.571  
##  Max.   :44.00   Max.   :58.00   Max.   :53.00   Max.   :7.765  
##     Habatobi           Hit            Homerun          Choda      
##  Min.   : 7.590   Min.   : 62.00   Min.   :14.00   Min.   : 6.00  
##  1st Qu.: 9.260   1st Qu.: 93.00   1st Qu.:19.00   1st Qu.:29.00  
##  Median : 9.910   Median :100.00   Median :20.00   Median :35.00  
##  Mean   : 9.898   Mean   : 99.98   Mean   :20.14   Mean   :35.25  
##  3rd Qu.:10.605   3rd Qu.:107.00   3rd Qu.:22.00   3rd Qu.:42.00  
##  Max.   :12.900   Max.   :130.00   Max.   :28.00   Max.   :77.00

野球選手の50m走，500m走，1500m走，懸垂，遠投，腕立て伏せ，反復横とび，起立（素早く立ち上がる），幅跳び，ヒットの本数，ホームラン数，長打数，の仮想データです。

単位に左右されたくないので，標準化しておきます。

sports.z <- as.data.frame(scale(sports))

ストーリー(仮説)を考える

パス解析をするにはまず，ストーリーを考える必要があります。ここでは，

1.50mを素早く走り抜ける力が，500m走の記録の根本である 2.500mが速ければ，1500mも速い 3.1500m走が速い人は,総じてヒットの数が多い

という筋を考えます。この筋は仮説や理路とも呼ばれますが，この順に影響していくということを考えるのです。

回帰分析によるパス解析

XがYの原因になっている，という説明をするには，回帰分析を使うのでした。まず1を考えます。

result.1 <- lm(Run500M~Run50M,data=sports.z)
summary(result.1)

## 
## Call:
## lm(formula = Run500M ~ Run50M, data = sports.z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8889 -0.4655 -0.0078  0.4076  2.7656 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.553e-16  3.054e-02    0.00        1    
## Run50M       7.311e-01  3.057e-02   23.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6829 on 498 degrees of freedom
## Multiple R-squared:  0.5346, Adjusted R-squared:  0.5336 
## F-statistic: 571.9 on 1 and 498 DF,  p-value: < 2.2e-16

統計的に有意な関係が認められました。R^2も中程度です。では次に2.のモデルを考えます。

result.2 <- lm(Run1500M~Run500M,data=sports.z)
summary(result.2)

## 
## Call:
## lm(formula = Run1500M ~ Run500M, data = sports.z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89108 -0.62879 -0.00504  0.58566  2.14236 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.101e-15  3.593e-02    0.00        1    
## Run500M     5.966e-01  3.596e-02   16.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8034 on 498 degrees of freedom
## Multiple R-squared:  0.3559, Adjusted R-squared:  0.3546 
## F-statistic: 275.2 on 1 and 498 DF,  p-value: < 2.2e-16

これもまあなんとかなったようです。最後に仮説3を考えます。

result.3 <- lm(Hit~Run1500M,data=sports.z)
summary(result.3)

## 
## Call:
## lm(formula = Hit ~ Run1500M, data = sports.z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8102 -0.7004 -0.0257  0.6936  2.9846 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 4.646e-16  4.454e-02   0.000   1.0000  
## Run1500M    1.001e-01  4.459e-02   2.245   0.0252 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.996 on 498 degrees of freedom
## Multiple R-squared:  0.01002,    Adjusted R-squared:  0.008028 
## F-statistic: 5.038 on 1 and 498 DF,  p-value: 0.02523

有意なパスはありましたが，R^2がずいぶん小さくなってしまいました。ともあれ，こうしてそれぞれの理路をチェックすることができました。

ここで，直接効果 と 間接効果 という用語を説明します。直接効果とはそのものズバリ，直接的な関係にある効果のことで，50m走は直接500m走に影響しているのでした（影響は標準化解で0.7311です)。

間接効果とは例えば，50m走の結果が500m走を介して，1500m走の記録に影響する大きさです。500→1500の影響は0.5966でしたから，間接効果はこの二つを掛け合わせて，

0.7311 * 0.5966

## [1] 0.4361743

であることがわかります。

SEMによるモデリング

さて，こうした(重)回帰分析の繰り返しによるパス解析は，少なくとも二つの問題があります。それは

同じデータを何度も重複して使っていること
適合度が部分部分でしかわからない（全体的にどうなのかわからない）こと

です。一つのデータを水増しすることなく，一度の分析で全体的な適合度がわかればいいですね。そのために，SEMを使うことができます。パス解析もまた，SEMの下位モデルのひとつなのです。

library(lavaan)

## This is lavaan 0.5-17
## lavaan is BETA software! Please report any bugs.

path.1 <- '
Run500M ~ Run50M
Run1500M ~ Run500M
Hit ~ Run1500M
'
fit.1 <- sem(path.1,data=sports)
summary(fit.1,standardized=T,fit.measures=T)

## lavaan (0.5-17) converged normally after  24 iterations
## 
##   Number of observations                           500
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               71.448
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              678.815
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.898
##   Tucker-Lewis Index (TLI)                       0.797
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -6013.949
##   Loglikelihood unrestricted model (H1)      -5978.225
## 
##   Number of free parameters                          6
##   Akaike (AIC)                               12039.898
##   Bayesian (BIC)                             12065.186
##   Sample-size adjusted Bayesian (BIC)        12046.141
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.214
##   90 Percent Confidence Interval          0.172  0.258
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.066
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
## Regressions:
##   Run500M ~
##     Run50M            6.803    0.284   23.963    0.000    6.803    0.731
##   Run1500M ~
##     Run500M           0.627    0.038   16.622    0.000    0.627    0.597
##   Hit ~
##     Run1500M          0.094    0.042    2.249    0.025    0.094    0.100
## 
## Variances:
##     Run500M          45.234    2.861                     45.234    0.465
##     Run1500M         69.132    4.372                     69.132    0.644
##     Hit              93.785    5.931                     93.785    0.990

このやり方で，上で述べた問題点がいずれも解決されたことになります。

潜在変数を用いたモデリング

SEMでは因子のような，目に見えない変数＝潜在変数を想定したモデリングが可能なのでした。そこで，50,500,1500m走のデータはすべて脚力に関係していると考え，脚力という潜在変数を用いたモデリングをしてみます。

path.2 <- '
Ashi =~ Run50M + Run500M + Run1500M
Hit ~ Ashi
'
fit.2 <- sem(path.2,data=sports)
summary(fit.2,standardized=T,fit.measures=T)

## lavaan (0.5-17) converged normally after  28 iterations
## 
##   Number of observations                           500
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.240
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.887
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic              678.815
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.008
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -5978.345
##   Loglikelihood unrestricted model (H1)      -5978.225
## 
##   Number of free parameters                          8
##   Akaike (AIC)                               11972.689
##   Bayesian (BIC)                             12006.406
##   Sample-size adjusted Bayesian (BIC)        11981.014
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.042
##   P-value RMSEA <= 0.05                          0.965
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.004
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
## Latent variables:
##   Ashi =~
##     Run50M            1.000                               0.933    0.881
##     Run500M           8.769    0.468   18.755    0.000    8.182    0.830
##     Run1500M          7.991    0.475   16.825    0.000    7.455    0.720
## 
## Regressions:
##   Hit ~
##     Ashi              1.266    0.498    2.540    0.011    1.181    0.121
## 
## Variances:
##     Run50M            0.252    0.038                      0.252    0.224
##     Run500M          30.247    3.306                     30.247    0.311
##     Run1500M         51.749    3.970                     51.749    0.482
##     Hit              93.339    5.916                     93.339    0.985
##     Ashi              0.871    0.078                      1.000    1.000

また，野球選手の能力を安打数だけではなく，他の二つも含めてみたいと思います。

path.3 <- '
Ashi =~ Run50M + Run500M + Run1500M
Perform =~ Hit + Homerun + Choda
Perform ~ Ashi
'
fit.3 <- sem(path.3,data=sports)
summary(fit.3,standardized=T,fit.measures=T)

## lavaan (0.5-17) converged normally after  80 iterations
## 
##   Number of observations                           500
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                5.463
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.707
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             1279.444
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.004
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -8585.191
##   Loglikelihood unrestricted model (H1)      -8582.459
## 
##   Number of free parameters                         13
##   Akaike (AIC)                               17196.382
##   Bayesian (BIC)                             17251.172
##   Sample-size adjusted Bayesian (BIC)        17209.909
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.040
##   P-value RMSEA <= 0.05                          0.984
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.018
## 
## Parameter estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
##                    Estimate  Std.err  Z-value  P(>|z|)   Std.lv  Std.all
## Latent variables:
##   Ashi =~
##     Run50M            1.000                               0.935    0.883
##     Run500M           8.737    0.466   18.765    0.000    8.169    0.829
##     Run1500M          7.964    0.474   16.812    0.000    7.446    0.719
##   Perform =~
##     Hit               1.000                               7.242    0.744
##     Homerun           0.189    0.012   15.128    0.000    1.369    0.700
##     Choda             1.267    0.080   15.842    0.000    9.174    0.919
## 
## Regressions:
##   Perform ~
##     Ashi              1.167    0.397    2.940    0.003    0.151    0.151
## 
## Variances:
##     Run50M            0.248    0.038                      0.248    0.221
##     Run500M          30.455    3.305                     30.455    0.313
##     Run1500M         51.891    3.974                     51.891    0.483
##     Hit              42.290    3.674                     42.290    0.446
##     Homerun           1.949    0.153                      1.949    0.510
##     Choda            15.471    4.148                     15.471    0.155
##     Ashi              0.874    0.078                      1.000    1.000
##     Perform          51.253    5.794                      0.977    0.977

このように，潜在変数をもちいてより複雑なモデルを組むことが可能です。ちなみに，脚力やパフォーマンスを説明するモデル(=~で書くところ)は 測定方程式 ，潜在変数同士の関係を説明するモデルは 構造方程式 と読んで区別されることがあります。

　本日の課題

サンプルデータを使って，潜在変数を含まないパス解析モデル，潜在変数を含んだモデルをそれぞれ提案してください。適合度は高い程よい。