1. 회귀분석
1.1 회귀분석 개요
회귀분석은 변수들 간의 관계를 설명하는 통계적 모형을 생성하는 기법
회귀분석의 종류.
- 단순회귀 - 독립변수가 1개 ex) 단순회귀분석
현실 세계에서 요인과 요인의 1대1 관계는 흔하지 않지만, 모델의 간명성 때문에 종종 사용함. - 다중회귀 - 독립변수가 2개 이상
- 이항회귀 - 종속변수가 2개 ex) 로지스틱 이항 회귀분석
- 다항회귀 - 종속변수가 3개 이상 ex) 로지스틱 다항 회귀분석
1.2 선형회귀분석
- 참값과 예측값의 차로 정의되는 오차를 최소화하는 회귀 방정식을 도출
- 오차는 잔차라고도 한다.
- SST(Total Sum of Squares)=SSR(Sum of Squared Residuals)+SSE(Sum of Sqaured Errors)
- 회귀식의 설명력은 결정계수 (R^2)를 사용한다. 결정계수=SSR/SST
- 결정계수가 높다고 해서 항상 좋은 것은 아니다. (다중공선성 문제 발생 가능)
- 잔차의 독립성 : 더빈-왓슨 / 다중공선성 - VIF(2.5보다 크면 위험)
1.2.1 데이터 준비
autopart=read.csv("data/autoparts.csv") # 자동차 부품에 대한 데이터 1.2.2 데이터 확인
head(autopart) # 6개의 레코드만 출력 ## prod_no fix_time a_speed b_speed separation s_separation rate_terms
## 1 90784-76001 85.5 0.611 1.715 242.0 657.6 95
## 2 90784-76001 86.2 0.606 1.708 244.7 657.1 95
## 3 90784-76001 86.0 0.609 1.715 242.7 657.5 95
## 4 90784-76001 86.1 0.610 1.718 241.9 657.3 95
## 5 90784-76001 86.1 0.603 1.704 242.5 657.3 95
## 6 90784-76001 86.3 0.606 1.707 244.5 656.9 95
## mpa load_time highpressure_time c_thickness
## 1 78.2 18.1 58 24.7
## 2 77.9 18.2 58 22.5
## 3 78.0 18.1 82 24.1
## 4 78.2 18.1 74 25.1
## 5 77.9 18.2 56 24.5
## 6 77.9 18.0 78 22.9
* 데이터 분리
데이터에서 특정 조건을 만족하는 데이터만 추출하기.
특정 컬럼만 추출
autopart1=autopart[autopart$prod_no=="90784-76001", c(2:11)]
head(autopart1)## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## load_time highpressure_time c_thickness
## 1 18.1 58 24.7
## 2 18.2 58 22.5
## 3 18.1 82 24.1
## 4 18.1 74 25.1
## 5 18.2 56 24.5
## 6 18.0 78 22.9
전체 컬럼 추출
autopartfull=autopart[autopart$prod_no=="90784-76001",]
head(autopartfull)## prod_no fix_time a_speed b_speed separation s_separation rate_terms
## 1 90784-76001 85.5 0.611 1.715 242.0 657.6 95
## 2 90784-76001 86.2 0.606 1.708 244.7 657.1 95
## 3 90784-76001 86.0 0.609 1.715 242.7 657.5 95
## 4 90784-76001 86.1 0.610 1.718 241.9 657.3 95
## 5 90784-76001 86.1 0.603 1.704 242.5 657.3 95
## 6 90784-76001 86.3 0.606 1.707 244.5 656.9 95
## mpa load_time highpressure_time c_thickness
## 1 78.2 18.1 58 24.7
## 2 77.9 18.2 58 22.5
## 3 78.0 18.1 82 24.1
## 4 78.2 18.1 74 25.1
## 5 77.9 18.2 56 24.5
## 6 77.9 18.0 78 22.9
1.2.3 결측치 확인
sum(is.na(autopart))## [1] 0
결측치가 없음을 확인.
1.2.4 complete.cases() 이용하여 결측치가 포함된 행만 나타내기
autopart[!complete.cases(autopart),] ## NA가 포함된 행만 보여줌## [1] prod_no fix_time a_speed
## [4] b_speed separation s_separation
## [7] rate_terms mpa load_time
## [10] highpressure_time c_thickness
## <0 rows> (or 0-length row.names)
library("reshape2")
head(french_fries)## time treatment subject rep potato buttery grassy rancid painty
## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5
## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0
## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0
## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0
## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1
## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3
french_fries[!complete.cases(french_fries),]## time treatment subject rep potato buttery grassy rancid painty
## 315 5 3 15 1 NA NA NA NA NA
## 455 7 2 79 1 7.3 NA 0.0 0.7 0
## 515 8 1 79 1 10.5 NA 0.0 0.5 0
## 520 8 2 16 1 4.5 NA 1.4 6.7 0
## 563 8 2 79 2 5.7 0 1.4 2.3 NA
1.2.5 summary 함수를 이용한 기초 통계량 확인
summary 함수를 이용하여 기초 통계량을 확인해 볼 수 있음.
summary(autopart1)## fix_time a_speed b_speed separation
## Min. : 1.00 Min. :0.4570 Min. :1.240 Min. :141.6
## 1st Qu.: 81.00 1st Qu.:0.5980 1st Qu.:1.597 1st Qu.:185.9
## Median : 82.10 Median :0.6090 Median :1.640 Median :190.7
## Mean : 83.14 Mean :0.6189 Mean :1.644 Mean :214.5
## 3rd Qu.: 85.40 3rd Qu.:0.6520 3rd Qu.:1.676 3rd Qu.:248.7
## Max. :148.60 Max. :0.8080 Max. :2.528 Max. :294.5
## s_separation rate_terms mpa load_time
## Min. :623.3 Min. :76.00 Min. :24.80 Min. : 0.00
## 1st Qu.:651.6 1st Qu.:81.00 1st Qu.:75.30 1st Qu.:18.10
## Median :710.3 Median :85.00 Median :76.60 Median :19.20
## Mean :685.9 Mean :84.53 Mean :74.21 Mean :18.68
## 3rd Qu.:713.6 3rd Qu.:87.00 3rd Qu.:78.10 3rd Qu.:19.20
## Max. :747.3 Max. :97.00 Max. :82.10 Max. :22.30
## highpressure_time c_thickness
## Min. : 37.00 Min. : 0.30
## 1st Qu.: 60.00 1st Qu.: 21.80
## Median : 67.00 Median : 23.80
## Mean : 96.36 Mean : 27.44
## 3rd Qu.: 72.00 3rd Qu.: 25.40
## Max. :65534.00 Max. :6553.40
확인한 결과 몇 개의 변수에서 이상치(outlier)가 존재함을 확인. highpressure_time, c_thickness
1.2.6 boxplot을 이용한 분포 시각화
boxplot(autopart1)boxplot(autopart1$separation)1.2.7 이상치 제거
autopart2=autopart1[autopart1$c_thickness<1000,]
dim(autopart2) # 레코드 수 확인## [1] 21767 10
boxplot(autopart2$c_thickness) # 다시 확인hist(autopart2$c_thickness, breaks=50) #histogram으로 분포 확인 1.2.8 corrplot 패키지를 이용한 변수들간의 상관관계 확인
# corrplot 패키지 불러오기
library("corrplot")## corrplot 0.84 loaded
# numeric 변수만 선택
autopartcor=autopart[c(-1,-7,-10)]
# corrplot 그리기
cor=cor(autopartcor)
corrplot(cor,method = "shade", addCoef.col = "black")1.2.9 회귀 모형 생성
예시를 위해 내장 women 데이터 사용
attach(women)
lm=lm(weight~height)
summary(lm)##
## Call:
## lm(formula = weight ~ height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
1.2.10 모델을 이용한 예측
new.data=data.frame(height=c(75,76))
predict(lm,newdata = new.data) # interval 지정하지 않으면 점 추정 ## 1 2
## 171.2333 174.6833
1.2.11 신뢰구간 예측
predict(lm,newdata=new.data,interval = "confidence") # interval="confidence"를 주면 구간 추정## fit lwr upr
## 1 171.2333 169.0885 173.3781
## 2 174.6833 172.3565 177.0102
1.3 다중선형회귀분석
독립변수가 2개 이상인 회귀분석
1.3.1 데이터 준비
- autopart 데이터 이용
autopart=read.csv("data/autoparts.csv")
autopart1=autopart[autopart$prod_no=="90784-76001", c(2:11)]
autopart2=autopart1[autopart1$c_thickness<1000,]1.3.2 데이터 확인
head(autopart2)## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## load_time highpressure_time c_thickness
## 1 18.1 58 24.7
## 2 18.2 58 22.5
## 3 18.1 82 24.1
## 4 18.1 74 25.1
## 5 18.2 56 24.5
## 6 18.0 78 22.9
summary(autopart2)## fix_time a_speed b_speed separation
## Min. : 1.00 Min. :0.4570 Min. :1.240 Min. :141.6
## 1st Qu.: 81.00 1st Qu.:0.5980 1st Qu.:1.597 1st Qu.:185.9
## Median : 82.10 Median :0.6090 Median :1.640 Median :190.7
## Mean : 83.14 Mean :0.6189 Mean :1.644 Mean :214.5
## 3rd Qu.: 85.40 3rd Qu.:0.6520 3rd Qu.:1.676 3rd Qu.:248.7
## Max. :148.60 Max. :0.8080 Max. :2.528 Max. :294.5
## s_separation rate_terms mpa load_time
## Min. :623.3 Min. :76.00 Min. :24.80 Min. : 0.00
## 1st Qu.:651.6 1st Qu.:81.00 1st Qu.:75.30 1st Qu.:18.10
## Median :710.3 Median :85.00 Median :76.60 Median :19.20
## Mean :685.9 Mean :84.54 Mean :74.21 Mean :18.68
## 3rd Qu.:713.6 3rd Qu.:87.00 3rd Qu.:78.10 3rd Qu.:19.20
## Max. :747.3 Max. :97.00 Max. :82.10 Max. :22.30
## highpressure_time c_thickness
## Min. : 37.00 Min. : 0.30
## 1st Qu.: 60.00 1st Qu.:21.80
## Median : 67.00 Median :23.80
## Mean : 96.38 Mean :23.84
## 3rd Qu.: 72.00 3rd Qu.:25.40
## Max. :65534.00 Max. :65.60
1.3.3 모델 생성 및 검정
leg=lm(c_thickness~fix_time+a_speed, data=autopart2)
leg2=lm(c_thickness~., data=autopart2) # full model
summary(leg2)##
## Call:
## lm(formula = c_thickness ~ ., data = autopart2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.8428 -0.6105 -0.0214 0.5606 29.6508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.146e+02 3.367e+00 212.225 < 2e-16 ***
## fix_time 6.010e-02 5.331e-03 11.273 < 2e-16 ***
## a_speed -1.738e+01 4.223e-01 -41.152 < 2e-16 ***
## b_speed 1.952e+00 1.516e-01 12.876 < 2e-16 ***
## separation -7.592e-01 3.635e-03 -208.873 < 2e-16 ***
## s_separation -7.468e-01 3.673e-03 -203.317 < 2e-16 ***
## rate_terms 1.133e-02 3.597e-03 3.151 0.00163 **
## mpa -1.520e-01 1.458e-03 -104.253 < 2e-16 ***
## load_time -1.523e-01 8.381e-03 -18.171 < 2e-16 ***
## highpressure_time -2.174e-05 8.738e-06 -2.488 0.01284 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.796 on 21757 degrees of freedom
## Multiple R-squared: 0.7841, Adjusted R-squared: 0.784
## F-statistic: 8782 on 9 and 21757 DF, p-value: < 2.2e-16
모형의 유의성 : f-statistic / p-value 역시 0.05 수준에서 유의하다.
계수의 유의성 : 계수들은 유의 수준 0.05 수준에서 모두 유의하다.
모형의 설명력 : Adjusted R-squared : 0.784정도로 78% 정도 설명 가능하다.
1.3.4 잔차 분석
par(mfrow=c(2,2))
plot(leg2)1번 그림
왼쪽영역이 오른쪽 영역보다 몰려있고, 두 영역의 모습이 서로 달라 등분산을 이루지 않음
- 잔차는 등분산성을 만족하여야 한다.
- 잔차에 일정한 패턴이 있다면, 모델 수정이 필요하다.
2번 그림(Q-Q plot)
- 잔차의 정규성
- 직선이여야 좋은 형태
3번 그림 (표준화 잔차)
- 기울기가 0인 직선이 이상적
4번 그림
- 왼쪽 가운데 몰려있어야 이상적
1.3.5 예측
new.data=data.frame(fix_time=c(86.1,86.1),a_speed=c(0.610,0.603),b_speed=c(1.718,1.704),separation=c(241.9,242.5),s_separation=c(657.3,657.3),rate_terms=c(95,95),mpa=c(78.2,77.9),load_time=c(18.1,18.2),highpressure_time=c(74,56))
predict(leg2,newdata=new.data)## 1 2
## 24.42798 24.09751
1.3.6 모형의 선택
- swiss 데이터 불러오기 및 모델 생성
head(swiss) # swiss 데이터 가져오기 ## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
dim(swiss) # 데이터 개수 확인 ## [1] 47 6
lm=lm(Fertility~.,data=swiss)
summary(lm)##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
par(mfrow=c(2,2))
plot(lm)- 전진선택법
step(lm,direction = "forward")## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Education
## 66.9152 -0.1721 -0.2580 -0.8709
## Catholic Infant.Mortality
## 0.1041 1.0770
- 후진제거법
step(lm,direction = "backward")## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
- stepwise
step(lm,direction = "both")## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## + Examination 1 53.03 2105.0 190.69
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784