主成分分析 (Principal Components Analysis)

主成分分析是一種通過降維技術把多個變數化成少數幾個主成分的方法。這些主成分能夠反映原始變數的絕大部分資訊，它們通常表示為原始變數的線性組合。

主成分分析

利用主成分迴歸方法建立銷售量 y 與 4 個變數的迴歸方程。可以透過 princomp 函式進行分析，

# the princomp prototype
# x: data as data.frame
# cor:
# |- True: 使用樣本的相關矩陣(Correlation Matrix)做主成分分析
# |- False: 使用樣本的共變異數矩陣(Covariance Matrix)做主成分分析
# covmat: 協方差陣
princomp(x, cor=False, scores=True, covmat=NuLL, ...)

cons.pr <- princomp(~x1+x2+x3+x4,data=cons[1:10,1:4], cor=TRUE)

# it is also simplified as the following script
#cons.pr <- princomp(cons[1:10,1:4], cor=TRUE)

# 列出主成分對應於原始變數 x1, x2, x3, x4 的係數
summary(cons.pr)

## Importance of components:
##                           Comp.1      Comp.2     Comp.3       Comp.4
## Standard deviation     1.9859037 0.199906992 0.11218966 0.0603085506
## Proportion of Variance 0.9859534 0.009990701 0.00314663 0.0009092803
## Cumulative Proportion  0.9859534 0.995944090 0.99909072 1.0000000000

顯示主成分分析或因數分析中 loadings 的內容，在主成分分析中，該內容便是主成分對應的各變數的組成，及正交矩陣。

loadings(cons.pr)

## 
## Loadings:
##    Comp.1 Comp.2 Comp.3 Comp.4
## x1 -0.502 -0.237  0.579  0.598
## x2 -0.500  0.493 -0.610  0.367
## x3 -0.498 -0.707 -0.368 -0.342
## x4 -0.501  0.449  0.396 -0.626
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

由於第一主成分的方差貢獻已達到 98.6%，因此其餘主成分可以除去，已達到降維目的，故公式為

\[z_1^{*}=-0.502*x1^{*}-0.5*x2^{*}-0.498*x3^{*}-0.501*x4^{*}\]

由各項主成分看出 x1 為主要因數，即收入因素。

預測主成分的值。

pred = predict(cons.pr)
pred

##         Comp.1      Comp.2      Comp.3       Comp.4
## 1   2.74300221  0.21717213  0.04664610  0.092431543
## 2   2.26912529  0.15180493  0.05701818 -0.094266513
## 3   1.66536864  0.11683915 -0.03013748  0.045923725
## 4   1.55844739 -0.27262184  0.10173123 -0.064513216
## 5   0.53961155 -0.04080947 -0.12051901 -0.011795871
## 6  -0.04403012 -0.33989164 -0.09182299 -0.003917754
## 7  -0.96230840 -0.21126360 -0.04524573  0.064368428
## 8  -2.22802489  0.22379838 -0.19645929 -0.020173666
## 9  -2.65380615  0.17110117  0.08143164 -0.067009629
## 10 -2.88738552 -0.01612921  0.19735735  0.058952953

畫出碎石圖

# the screeplot prototype
# x: data calculated by princomp
# npcs: 主成分個數
# type: 繪圖樣式
screeplot(x, npcs=min(10,length(x$dev)), type=c("barplot","lines"), main=deparse(substitute(x)), ...)

screeplot(cons.pr, type="lines")

畫出第 1 及第 2 主成分樣本的散布圖

畫出關於主成分的散點圖和原座標在主成分下的方向

biplot(cons.pr, choices = 1:2)

做主成分迴歸

cons$z1 = pred[,1]

# the lm prototype
# lm(formula = y~z1, data = data)
lm.sol = lm(y~z1, data=cons)

summary(lm.sol)

## 
## Call:
## lm(formula = y ~ z1, data = cons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72237 -0.20946  0.05154  0.21032  0.81856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.03000    0.16615   84.44 4.32e-13 ***
## z1          -2.06119    0.08367  -24.64 7.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5254 on 8 degrees of freedom
## Multiple R-squared:  0.987,  Adjusted R-squared:  0.9854 
## F-statistic: 606.9 on 1 and 8 DF,  p-value: 7.873e-09

由上可看出，迴歸係數與迴歸方程均通過檢驗，且效果顯著，可以得到回應變數與主成分的關係，即

\[y = 14.03 - 2.06*z_1^{*}\]

轉換成原變數的迴歸方程

因得到回應變數與主成分的迴歸方程並不容易使用，故須轉換成原變數的迴歸方程，而基本公式推導如下:

\[ y=\beta_0^* + \beta_1^*Z_1^* \\ Z_1^*=a_{11}X_1^*+a_{12}X_2^*+a_{13}X_3^*+a_{14}X_4^* \\ =\frac{a_11(x_1-\bar{x_1})}{\sqrt{s_{11}}} + \frac{a_12(x_2-\bar{x_2})}{\sqrt{s_{12}}} + ... \\ \\ \therefore \\ \\ \beta_0 = \beta_0^* - \beta_1^*(\frac{a_{11}\bar{x_1}}{\sqrt{s_{11}}} + \frac{a_{12}\bar{x_2}}{\sqrt{s_{12}}} + ...)\\ \beta_i = \frac{\beta_1^*a_{1i}}{\sqrt{S_{ii}}} \]

# 取得迴歸係數
beta = coef(lm.sol)

# 取得主成分對應的特徵向量
feat = loadings(cons.pr)

# 取得資料中心的均值
x.bar = cons.pr$center

# 取得資料的標準差
x.sd = cons.pr$scale

# 求係數, beta1 ~ beta4, 可參考上述公式
coeff = (beta[2] * feat[,1]) / x.sd

# 求 beta0 係數 (即斜率), 可參考上述公式
beta0 <- beta[1] - sum(x.bar*coeff)

# 合併結果
c(beta0, coeff)

##  (Intercept)           x1           x2           x3           x4 
## -23.77771861   0.02992643   0.13365158   0.08361156   0.16965187

由上結果建立的迴歸方程為

\[y=-23.78+0.03*x1+0.13*x2+0.08*x3+0.17*x4\]

主成分分析 (Principal Components Analysis)

JianKai Wang (https://sophia.ddns.net/)

2018年3月13日

原始資料

主成分分析