1. 회귀분석

1.1 회귀분석 개요

회귀분석은 변수들 간의 관계를 설명하는 통계적 모형을 생성하는 기법

회귀분석의 종류.
  • 단순회귀 - 독립변수가 1개 ex) 단순회귀분석
    현실 세계에서 요인과 요인의 1대1 관계는 흔하지 않지만, 모델의 간명성 때문에 종종 사용함.
  • 다중회귀 - 독립변수가 2개 이상
  • 이항회귀 - 종속변수가 2개 ex) 로지스틱 이항 회귀분석
  • 다항회귀 - 종속변수가 3개 이상 ex) 로지스틱 다항 회귀분석

1.2 선형회귀분석

  • 참값과 예측값의 차로 정의되는 오차를 최소화하는 회귀 방정식을 도출
  • 오차는 잔차라고도 한다.
  • SST(Total Sum of Squares)=SSR(Sum of Squared Residuals)+SSE(Sum of Sqaured Errors)
  • 회귀식의 설명력은 결정계수 (R^2)를 사용한다. 결정계수=SSR/SST
  • 결정계수가 높다고 해서 항상 좋은 것은 아니다. (다중공선성 문제 발생 가능)
  • 잔차의 독립성 : 더빈-왓슨 / 다중공선성 - VIF(2.5보다 크면 위험)

1.2.1 데이터 준비

autopart=read.csv("data/autoparts.csv") # 자동차 부품에 대한 데이터 

1.2.2 데이터 확인

head(autopart) # 6개의 레코드만 출력 
##       prod_no fix_time a_speed b_speed separation s_separation rate_terms
## 1 90784-76001     85.5   0.611   1.715      242.0        657.6         95
## 2 90784-76001     86.2   0.606   1.708      244.7        657.1         95
## 3 90784-76001     86.0   0.609   1.715      242.7        657.5         95
## 4 90784-76001     86.1   0.610   1.718      241.9        657.3         95
## 5 90784-76001     86.1   0.603   1.704      242.5        657.3         95
## 6 90784-76001     86.3   0.606   1.707      244.5        656.9         95
##    mpa load_time highpressure_time c_thickness
## 1 78.2      18.1                58        24.7
## 2 77.9      18.2                58        22.5
## 3 78.0      18.1                82        24.1
## 4 78.2      18.1                74        25.1
## 5 77.9      18.2                56        24.5
## 6 77.9      18.0                78        22.9

* 데이터 분리

데이터에서 특정 조건을 만족하는 데이터만 추출하기.

특정 컬럼만 추출

autopart1=autopart[autopart$prod_no=="90784-76001", c(2:11)]
head(autopart1)
##   fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 1     85.5   0.611   1.715      242.0        657.6         95 78.2
## 2     86.2   0.606   1.708      244.7        657.1         95 77.9
## 3     86.0   0.609   1.715      242.7        657.5         95 78.0
## 4     86.1   0.610   1.718      241.9        657.3         95 78.2
## 5     86.1   0.603   1.704      242.5        657.3         95 77.9
## 6     86.3   0.606   1.707      244.5        656.9         95 77.9
##   load_time highpressure_time c_thickness
## 1      18.1                58        24.7
## 2      18.2                58        22.5
## 3      18.1                82        24.1
## 4      18.1                74        25.1
## 5      18.2                56        24.5
## 6      18.0                78        22.9

전체 컬럼 추출

autopartfull=autopart[autopart$prod_no=="90784-76001",]
head(autopartfull)
##       prod_no fix_time a_speed b_speed separation s_separation rate_terms
## 1 90784-76001     85.5   0.611   1.715      242.0        657.6         95
## 2 90784-76001     86.2   0.606   1.708      244.7        657.1         95
## 3 90784-76001     86.0   0.609   1.715      242.7        657.5         95
## 4 90784-76001     86.1   0.610   1.718      241.9        657.3         95
## 5 90784-76001     86.1   0.603   1.704      242.5        657.3         95
## 6 90784-76001     86.3   0.606   1.707      244.5        656.9         95
##    mpa load_time highpressure_time c_thickness
## 1 78.2      18.1                58        24.7
## 2 77.9      18.2                58        22.5
## 3 78.0      18.1                82        24.1
## 4 78.2      18.1                74        25.1
## 5 77.9      18.2                56        24.5
## 6 77.9      18.0                78        22.9

1.2.3 결측치 확인

sum(is.na(autopart))
## [1] 0

결측치가 없음을 확인.

1.2.4 complete.cases() 이용하여 결측치가 포함된 행만 나타내기

autopart[!complete.cases(autopart),] ## NA가 포함된 행만 보여줌
##  [1] prod_no           fix_time          a_speed          
##  [4] b_speed           separation        s_separation     
##  [7] rate_terms        mpa               load_time        
## [10] highpressure_time c_thickness      
## <0 rows> (or 0-length row.names)
library("reshape2")
head(french_fries)
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3
french_fries[!complete.cases(french_fries),]
##     time treatment subject rep potato buttery grassy rancid painty
## 315    5         3      15   1     NA      NA     NA     NA     NA
## 455    7         2      79   1    7.3      NA    0.0    0.7      0
## 515    8         1      79   1   10.5      NA    0.0    0.5      0
## 520    8         2      16   1    4.5      NA    1.4    6.7      0
## 563    8         2      79   2    5.7       0    1.4    2.3     NA

1.2.5 summary 함수를 이용한 기초 통계량 확인

summary 함수를 이용하여 기초 통계량을 확인해 볼 수 있음.

summary(autopart1)
##     fix_time         a_speed          b_speed        separation   
##  Min.   :  1.00   Min.   :0.4570   Min.   :1.240   Min.   :141.6  
##  1st Qu.: 81.00   1st Qu.:0.5980   1st Qu.:1.597   1st Qu.:185.9  
##  Median : 82.10   Median :0.6090   Median :1.640   Median :190.7  
##  Mean   : 83.14   Mean   :0.6189   Mean   :1.644   Mean   :214.5  
##  3rd Qu.: 85.40   3rd Qu.:0.6520   3rd Qu.:1.676   3rd Qu.:248.7  
##  Max.   :148.60   Max.   :0.8080   Max.   :2.528   Max.   :294.5  
##   s_separation     rate_terms         mpa          load_time    
##  Min.   :623.3   Min.   :76.00   Min.   :24.80   Min.   : 0.00  
##  1st Qu.:651.6   1st Qu.:81.00   1st Qu.:75.30   1st Qu.:18.10  
##  Median :710.3   Median :85.00   Median :76.60   Median :19.20  
##  Mean   :685.9   Mean   :84.53   Mean   :74.21   Mean   :18.68  
##  3rd Qu.:713.6   3rd Qu.:87.00   3rd Qu.:78.10   3rd Qu.:19.20  
##  Max.   :747.3   Max.   :97.00   Max.   :82.10   Max.   :22.30  
##  highpressure_time   c_thickness     
##  Min.   :   37.00   Min.   :   0.30  
##  1st Qu.:   60.00   1st Qu.:  21.80  
##  Median :   67.00   Median :  23.80  
##  Mean   :   96.36   Mean   :  27.44  
##  3rd Qu.:   72.00   3rd Qu.:  25.40  
##  Max.   :65534.00   Max.   :6553.40

확인한 결과 몇 개의 변수에서 이상치(outlier)가 존재함을 확인. highpressure_time, c_thickness

1.2.6 boxplot을 이용한 분포 시각화

boxplot(autopart1)

boxplot(autopart1$separation)

1.2.7 이상치 제거

autopart2=autopart1[autopart1$c_thickness<1000,]
dim(autopart2) # 레코드 수 확인
## [1] 21767    10
boxplot(autopart2$c_thickness) # 다시 확인

hist(autopart2$c_thickness, breaks=50) #histogram으로 분포 확인 

1.2.8 corrplot 패키지를 이용한 변수들간의 상관관계 확인

# corrplot 패키지 불러오기 

library("corrplot")
## corrplot 0.84 loaded
# numeric 변수만 선택

autopartcor=autopart[c(-1,-7,-10)]

# corrplot 그리기 

cor=cor(autopartcor)
corrplot(cor,method = "shade", addCoef.col = "black")

1.2.9 회귀 모형 생성

예시를 위해 내장 women 데이터 사용

attach(women)
lm=lm(weight~height)
summary(lm)
## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

1.2.10 모델을 이용한 예측

new.data=data.frame(height=c(75,76))
predict(lm,newdata = new.data) # interval 지정하지 않으면 점 추정 
##        1        2 
## 171.2333 174.6833

1.2.11 신뢰구간 예측

predict(lm,newdata=new.data,interval = "confidence") # interval="confidence"를 주면 구간 추정
##        fit      lwr      upr
## 1 171.2333 169.0885 173.3781
## 2 174.6833 172.3565 177.0102

1.3 다중선형회귀분석

독립변수가 2개 이상인 회귀분석

1.3.1 데이터 준비

  • autopart 데이터 이용
autopart=read.csv("data/autoparts.csv")
autopart1=autopart[autopart$prod_no=="90784-76001", c(2:11)]
autopart2=autopart1[autopart1$c_thickness<1000,]

1.3.2 데이터 확인

head(autopart2)
##   fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 1     85.5   0.611   1.715      242.0        657.6         95 78.2
## 2     86.2   0.606   1.708      244.7        657.1         95 77.9
## 3     86.0   0.609   1.715      242.7        657.5         95 78.0
## 4     86.1   0.610   1.718      241.9        657.3         95 78.2
## 5     86.1   0.603   1.704      242.5        657.3         95 77.9
## 6     86.3   0.606   1.707      244.5        656.9         95 77.9
##   load_time highpressure_time c_thickness
## 1      18.1                58        24.7
## 2      18.2                58        22.5
## 3      18.1                82        24.1
## 4      18.1                74        25.1
## 5      18.2                56        24.5
## 6      18.0                78        22.9
summary(autopart2)
##     fix_time         a_speed          b_speed        separation   
##  Min.   :  1.00   Min.   :0.4570   Min.   :1.240   Min.   :141.6  
##  1st Qu.: 81.00   1st Qu.:0.5980   1st Qu.:1.597   1st Qu.:185.9  
##  Median : 82.10   Median :0.6090   Median :1.640   Median :190.7  
##  Mean   : 83.14   Mean   :0.6189   Mean   :1.644   Mean   :214.5  
##  3rd Qu.: 85.40   3rd Qu.:0.6520   3rd Qu.:1.676   3rd Qu.:248.7  
##  Max.   :148.60   Max.   :0.8080   Max.   :2.528   Max.   :294.5  
##   s_separation     rate_terms         mpa          load_time    
##  Min.   :623.3   Min.   :76.00   Min.   :24.80   Min.   : 0.00  
##  1st Qu.:651.6   1st Qu.:81.00   1st Qu.:75.30   1st Qu.:18.10  
##  Median :710.3   Median :85.00   Median :76.60   Median :19.20  
##  Mean   :685.9   Mean   :84.54   Mean   :74.21   Mean   :18.68  
##  3rd Qu.:713.6   3rd Qu.:87.00   3rd Qu.:78.10   3rd Qu.:19.20  
##  Max.   :747.3   Max.   :97.00   Max.   :82.10   Max.   :22.30  
##  highpressure_time   c_thickness   
##  Min.   :   37.00   Min.   : 0.30  
##  1st Qu.:   60.00   1st Qu.:21.80  
##  Median :   67.00   Median :23.80  
##  Mean   :   96.38   Mean   :23.84  
##  3rd Qu.:   72.00   3rd Qu.:25.40  
##  Max.   :65534.00   Max.   :65.60

1.3.3 모델 생성 및 검정

leg=lm(c_thickness~fix_time+a_speed, data=autopart2)
leg2=lm(c_thickness~., data=autopart2) # full model 
summary(leg2)
## 
## Call:
## lm(formula = c_thickness ~ ., data = autopart2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.8428  -0.6105  -0.0214   0.5606  29.6508 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        7.146e+02  3.367e+00  212.225  < 2e-16 ***
## fix_time           6.010e-02  5.331e-03   11.273  < 2e-16 ***
## a_speed           -1.738e+01  4.223e-01  -41.152  < 2e-16 ***
## b_speed            1.952e+00  1.516e-01   12.876  < 2e-16 ***
## separation        -7.592e-01  3.635e-03 -208.873  < 2e-16 ***
## s_separation      -7.468e-01  3.673e-03 -203.317  < 2e-16 ***
## rate_terms         1.133e-02  3.597e-03    3.151  0.00163 ** 
## mpa               -1.520e-01  1.458e-03 -104.253  < 2e-16 ***
## load_time         -1.523e-01  8.381e-03  -18.171  < 2e-16 ***
## highpressure_time -2.174e-05  8.738e-06   -2.488  0.01284 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.796 on 21757 degrees of freedom
## Multiple R-squared:  0.7841, Adjusted R-squared:  0.784 
## F-statistic:  8782 on 9 and 21757 DF,  p-value: < 2.2e-16

모형의 유의성 : f-statistic / p-value 역시 0.05 수준에서 유의하다.

계수의 유의성 : 계수들은 유의 수준 0.05 수준에서 모두 유의하다.

모형의 설명력 : Adjusted R-squared : 0.784정도로 78% 정도 설명 가능하다.

1.3.4 잔차 분석

par(mfrow=c(2,2))
plot(leg2)

1번 그림
왼쪽영역이 오른쪽 영역보다 몰려있고, 두 영역의 모습이 서로 달라 등분산을 이루지 않음

  • 잔차는 등분산성을 만족하여야 한다.
  • 잔차에 일정한 패턴이 있다면, 모델 수정이 필요하다.

2번 그림(Q-Q plot)

  • 잔차의 정규성
  • 직선이여야 좋은 형태

3번 그림 (표준화 잔차)

  • 기울기가 0인 직선이 이상적

4번 그림

  • 왼쪽 가운데 몰려있어야 이상적

1.3.5 예측

new.data=data.frame(fix_time=c(86.1,86.1),a_speed=c(0.610,0.603),b_speed=c(1.718,1.704),separation=c(241.9,242.5),s_separation=c(657.3,657.3),rate_terms=c(95,95),mpa=c(78.2,77.9),load_time=c(18.1,18.2),highpressure_time=c(74,56))
predict(leg2,newdata=new.data)
##        1        2 
## 24.42798 24.09751

1.3.6 모형의 선택

  • swiss 데이터 불러오기 및 모델 생성
head(swiss) # swiss 데이터 가져오기 
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
dim(swiss) # 데이터 개수 확인 
## [1] 47  6
lm=lm(Fertility~.,data=swiss)
summary(lm)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
par(mfrow=c(2,2))
plot(lm)

  • 전진선택법
step(lm,direction = "forward")
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture       Examination         Education  
##          66.9152           -0.1721           -0.2580           -0.8709  
##         Catholic  Infant.Mortality  
##           0.1041            1.0770
  • 후진제거법
step(lm,direction = "backward")
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education          Catholic  
##          62.1013           -0.1546           -0.9803            0.1247  
## Infant.Mortality  
##           1.0784
  • stepwise
step(lm,direction = "both")
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## + Examination       1     53.03 2105.0 190.69
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture         Education          Catholic  
##          62.1013           -0.1546           -0.9803            0.1247  
## Infant.Mortality  
##           1.0784