1 회귀분석
- 회귀분석은 변수들 간의 관계를 설명하는 통계적 모형을 생성하는 기법
1.1 변수들의 관계
회귀분석은 변수들 간의 관계를 설명하는 통계적 모형을 생성하는 기법. 결정적 모형 : 오차를 허용하지 않는 모형 통계적 모형 : 변수들 간의 관계에 오차를 허용하는 모형
독립변수(=입력변수)와 종속변수(=결과변수)가 있다. 독립변수는 영향을 주는 변수 종속변수는 영향을 받는 변수
1.2회귀분석의 종류 및 주요개념
- 단순 회귀분석 : 독립변수가 하나인 모형으로, 간명성이 우수하다
- 다중 회귀분석 : 독립변수가 여러개
- 이항 회귀분석 : 종속변수 2개, 대표적으로 로지스틱 이항 회귀분석
- 다항 회귀분석 : 종속변수가 3개 이상
로지스틱 다항 회귀분석
연속형 범주인 Y가 2개 이상 있다면 동시에 분석할 수 있는 방법은 없다.(적어도 다중회귀 에서는, 구조 방정식에서는 예외)
X가 범주형일 때는 X를 0,1,2 등 과 같이 치환해서 분석이 가능하다.
1.3 선형회귀분석
- 참값과 예측값의 차로 정의되는 오차를 최소화하는 회귀 방정식을 도출
- 오차는 잔차라고도 한다.
- SST(Total Sum of Squares)=SSR(Sum of Squared Residuals)+SSE(Sum of Sqaured Errors)
- 회귀식의 설명력은 결정계수 (R^2)를 사용한다. 결정계수=SSR/SST
- 결정계수가 높다고 해서 항상 좋은 것은 아니다. (다중공선성 문제 발생 가능)
- 잔차의 독립성 : 더빈-왓슨 / 다중공선성 - VIF(2.5보다 크면 위험)
1.3.1 데이터준비
autoparts <- read.csv("autoparts.csv", header = T)
dim(autoparts)
## [1] 34139 11
1.3.2 데이터 확인
이상치 여부
is.na(autoparts)
sum(is.na(autoparts))
비어있는 값이 있는지 is.na 와는 반대로 TRUE로 출력
complete.cases(autoparts)
autoparts[!complete.cases(autoparts), ]
head(autoparts)
head(autoparts, n = 10)
head(autoparts, n = 30000)
1.3.3 이상치를 제거하기
- box-plot
autoparts1 <- autoparts[autoparts$prod_no=="90784-76001", ]
boxplot(autoparts1$c_thickness)
dim(autoparts1)
## [1] 21779 11
summary(autoparts1)
## prod_no fix_time a_speed b_speed
## 45231-3B400 : 0 Min. : 1.00 Min. :0.4570 Min. :1.240
## 45231-3B610 : 0 1st Qu.: 81.00 1st Qu.:0.5980 1st Qu.:1.597
## 45231-3B641 : 0 Median : 82.10 Median :0.6090 Median :1.640
## 45231-3B660 : 0 Mean : 83.14 Mean :0.6189 Mean :1.644
## 45231-P3B750: 0 3rd Qu.: 85.40 3rd Qu.:0.6520 3rd Qu.:1.676
## 90784-76001 :21779 Max. :148.60 Max. :0.8080 Max. :2.528
## separation s_separation rate_terms mpa
## Min. :141.6 Min. :623.3 Min. :76.00 Min. :24.80
## 1st Qu.:185.9 1st Qu.:651.6 1st Qu.:81.00 1st Qu.:75.30
## Median :190.7 Median :710.3 Median :85.00 Median :76.60
## Mean :214.5 Mean :685.9 Mean :84.53 Mean :74.21
## 3rd Qu.:248.7 3rd Qu.:713.6 3rd Qu.:87.00 3rd Qu.:78.10
## Max. :294.5 Max. :747.3 Max. :97.00 Max. :82.10
## load_time highpressure_time c_thickness
## Min. : 0.00 Min. : 37.00 Min. : 0.30
## 1st Qu.:18.10 1st Qu.: 60.00 1st Qu.: 21.80
## Median :19.20 Median : 67.00 Median : 23.80
## Mean :18.68 Mean : 96.36 Mean : 27.44
## 3rd Qu.:19.20 3rd Qu.: 72.00 3rd Qu.: 25.40
## Max. :22.30 Max. :65534.00 Max. :6553.40
boxplot(autoparts1)
- 기초 통계량 확인
- out-lier를 확인하기위해 꼭 해야함
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ] # 이상치 제거
dim(autoparts2)
## [1] 21767 11
boxplot(autoparts2$c_thickness)
- Histogram
hist(autoparts2$c_thickness)
hist(autoparts2$c_thickness,breaks = 50) # breaks 옵션을 통해 구간 수를 직접 설정할 수도 있다.
- corrplot 패키지를 이용한 변수들간의 상관관계 확인
# corrplot 패키지 불러오기
library("corrplot")
## corrplot 0.84 loaded
autopartcor <- autoparts2[c(-1,-7,-10)]
# corrplot 그리기
cor=cor(autopartcor)
corrplot(cor,method = "shade", addCoef.col = "black")
1.3.4 모형 만들기
이미 전처리가 되어있는 내장된 women data 사용
women
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
summary(women)
## height weight
## Min. :58.0 Min. :115.0
## 1st Qu.:61.5 1st Qu.:124.5
## Median :65.0 Median :135.0
## Mean :65.0 Mean :136.7
## 3rd Qu.:68.5 3rd Qu.:148.0
## Max. :72.0 Max. :164.0
m <- lm(weight ~ height, data = women)
m
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Coefficients:
## (Intercept) height
## -87.52 3.45
summary(m)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
- 회귀 계수와 intercept 의 p-value가 매우 작기 때문에 유의하다. p-value: 1.091e-14 값은 모델 전체에 대한 유의확률
1.3.5 모델을 이용한 데이터 예측
new.data <- data.frame(height=75) # 예측할 새로운 데이터명 지정
predict(m, newdata = new.data) # predict 함수를 이용해 weight를 예측
## 1
## 171.2333
new.data <- data.frame(height=c(75,76))
predict(m, newdata = new.data)
## 1 2
## 171.2333 174.6833
1.3.6 신뢰구간
# 구간 추정을 하는 함수 interval = "confidence"
new.data <- data.frame(height=c(75,76))
predict(m, newdata = new.data, interval = "confidence")
## fit lwr upr
## 1 171.2333 169.0885 173.3781
## 2 174.6833 172.3565 177.0102
m <- lm(c_thickness ~ fix_time, data = autoparts2) # fix_time이 c_thickness에 미치는 영향을 보기 위해 단순회귀 모형
new.data <- data.frame(fix_time = 86.1)
predict(m,newdata = new.data)
## 1
## 23.62006
1.4 다중 선형 회귀분석
autoparts <- read.csv("autoparts.csv", header = TRUE)
autoparts1 <- autoparts[autoparts$prod_no=="90784-76001", c(2:11)]
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ]
head(autoparts2)
## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## load_time highpressure_time c_thickness
## 1 18.1 58 24.7
## 2 18.2 58 22.5
## 3 18.1 82 24.1
## 4 18.1 74 25.1
## 5 18.2 56 24.5
## 6 18.0 78 22.9
head(autoparts2)
## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## load_time highpressure_time c_thickness
## 1 18.1 58 24.7
## 2 18.2 58 22.5
## 3 18.1 82 24.1
## 4 18.1 74 25.1
## 5 18.2 56 24.5
## 6 18.0 78 22.9
summary(autoparts2)
## fix_time a_speed b_speed separation
## Min. : 1.00 Min. :0.4570 Min. :1.240 Min. :141.6
## 1st Qu.: 81.00 1st Qu.:0.5980 1st Qu.:1.597 1st Qu.:185.9
## Median : 82.10 Median :0.6090 Median :1.640 Median :190.7
## Mean : 83.14 Mean :0.6189 Mean :1.644 Mean :214.5
## 3rd Qu.: 85.40 3rd Qu.:0.6520 3rd Qu.:1.676 3rd Qu.:248.7
## Max. :148.60 Max. :0.8080 Max. :2.528 Max. :294.5
## s_separation rate_terms mpa load_time
## Min. :623.3 Min. :76.00 Min. :24.80 Min. : 0.00
## 1st Qu.:651.6 1st Qu.:81.00 1st Qu.:75.30 1st Qu.:18.10
## Median :710.3 Median :85.00 Median :76.60 Median :19.20
## Mean :685.9 Mean :84.54 Mean :74.21 Mean :18.68
## 3rd Qu.:713.6 3rd Qu.:87.00 3rd Qu.:78.10 3rd Qu.:19.20
## Max. :747.3 Max. :97.00 Max. :82.10 Max. :22.30
## highpressure_time c_thickness
## Min. : 37.00 Min. : 0.30
## 1st Qu.: 60.00 1st Qu.:21.80
## Median : 67.00 Median :23.80
## Mean : 96.38 Mean :23.84
## 3rd Qu.: 72.00 3rd Qu.:25.40
## Max. :65534.00 Max. :65.60
m <- lm(c_thickness ~ fix_time + a_speed, data = autoparts2) # c_thickness에 두 변수가 어떤 영향을 미치는지 알아보겠다.
m2 <- lm(c_thickness ~ ., data = autoparts2) # c_thickness를 제외한 모든 변수를 다 넣어 보겠다.
1.4.1 잔차분석
par(mfrow=c(2,2))
plot(m)
1번 : 두 영역의 모습이 서로 달라 등분산을 이루지 않음
- 잔차는 등분산성을 만족하여야 한다.
- 잔차에 패턴이 있다면 모델 수정이 필요하다.
2번 : Q-Q plot
- 직선일 때 잔차의 정규성 만족
3번 : 표준화 잔차
- 기울기가 0인 직선이 이상적
4번 : cook’s distance
- 왼쪽 가운데 몰려있어야 이상적
1.4.2 예측
#1) 새로운 1개의 데이터로 1개의 종속변수를 예측해 본다
# 데이터의 여러 독립변수를 데이터 프레임으로 구성
new.data <- data.frame(fix_time = 86.1, a_speed = 0.610, b_speed = 1.718,
separation = 241.9, s_separation = 657.3, rate_terms = 95, mpa = 78.2,
load_time = 18.1, highpressuer_time = 74)
predict(m, newdata = new.data) # 제품의 두께(C_thickness)출력
## 1
## 23.44666
#2) 새로운 복수의 데이터로 복수의 종속변수를 예측해 본다
#데이터의 여러 독립변수를 벡터로 구성한 뒤, 데이터프레임으로 구성
new.data <- data.frame(fix_time=c(86.1,86.1), a_speed = c(0.610, 0.603),
b_speed = c(1.718, 1.704), separation = c(241.9, 242.5),
s_separation = c(657.3, 657.3), rate_terms = c(95, 95),
mpa = c(78.2, 77.9),
load_time = c(18.1, 18.2), highpressure_time = c(74, 56))
predict(m, newdata = new.data)
## 1 2
## 23.44666 23.57274
#3) 단순선형 회귀모형과 마찬가지로 신뢰구간을 구할 수 있다.
predict(m, newdata = new.data, interval = "confidence")
## fit lwr upr
## 1 23.44666 23.37760 23.51571
## 2 23.57274 23.50448 23.64101
1.4.3 최적 변수 찾기
- 모형의 선택
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
dim(swiss)
## [1] 47 6
m <- lm(Fertility ~ ., data = swiss)
summary(m)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
- 전진선택법 -AIC가 낮을 수록 개선된 모델
step(m, direction = "forward") # 전진선택법 선택
## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Education
## 66.9152 -0.1721 -0.2580 -0.8709
## Catholic Infant.Mortality
## 0.1041 1.0770
m <- lm(Fertility ~ Agriculture, data = swiss) # 하나의 변수만 선택
step(m, direction = "forward")
## Start: AIC=234.09
## Fertility ~ Agriculture
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture
## 60.3044 0.1942
- 후진제거법
step(m, direction = "backward") # 후진제거법 선택
## Start: AIC=234.09
## Fertility ~ Agriculture
##
## Df Sum of Sq RSS AIC
## <none> 6283.1 234.09
## - Agriculture 1 894.84 7178.0 238.34
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture
## 60.3044 0.1942
m <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = swiss) # 하나의 변수만 선택
step(m, direction = "backward")
## Start: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
- both
step(m, direction = "both")
## Start: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
연습문제
- 1
autoparts <- read.csv("autoparts.csv", header = TRUE)
autoparts1 <- autoparts[autoparts$prod_no=="90784-76001", c(2:11)]
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ]
m <- lm(c_thickness ~., data = autoparts2)
step(m, direction = "forward")
## Start: AIC=25500.35
## c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation +
## rate_terms + mpa + load_time + highpressure_time
##
## Call:
## lm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = autoparts2)
##
## Coefficients:
## (Intercept) fix_time a_speed
## 7.146e+02 6.010e-02 -1.738e+01
## b_speed separation s_separation
## 1.952e+00 -7.592e-01 -7.468e-01
## rate_terms mpa load_time
## 1.133e-02 -1.520e-01 -1.523e-01
## highpressure_time
## -2.174e-05
step(m, direction = "backward")
## Start: AIC=25500.35
## c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation +
## rate_terms + mpa + load_time + highpressure_time
##
## Df Sum of Sq RSS AIC
## <none> 70175 25500
## - highpressure_time 1 20 70195 25505
## - rate_terms 1 32 70207 25508
## - fix_time 1 410 70585 25625
## - b_speed 1 535 70710 25664
## - load_time 1 1065 71240 25826
## - a_speed 1 5462 75637 27130
## - mpa 1 35055 105230 34318
## - s_separation 1 133330 203505 48674
## - separation 1 140717 210892 49450
##
## Call:
## lm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = autoparts2)
##
## Coefficients:
## (Intercept) fix_time a_speed
## 7.146e+02 6.010e-02 -1.738e+01
## b_speed separation s_separation
## 1.952e+00 -7.592e-01 -7.468e-01
## rate_terms mpa load_time
## 1.133e-02 -1.520e-01 -1.523e-01
## highpressure_time
## -2.174e-05
step(m, direction = "both")
## Start: AIC=25500.35
## c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation +
## rate_terms + mpa + load_time + highpressure_time
##
## Df Sum of Sq RSS AIC
## <none> 70175 25500
## - highpressure_time 1 20 70195 25505
## - rate_terms 1 32 70207 25508
## - fix_time 1 410 70585 25625
## - b_speed 1 535 70710 25664
## - load_time 1 1065 71240 25826
## - a_speed 1 5462 75637 27130
## - mpa 1 35055 105230 34318
## - s_separation 1 133330 203505 48674
## - separation 1 140717 210892 49450
##
## Call:
## lm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = autoparts2)
##
## Coefficients:
## (Intercept) fix_time a_speed
## 7.146e+02 6.010e-02 -1.738e+01
## b_speed separation s_separation
## 1.952e+00 -7.592e-01 -7.468e-01
## rate_terms mpa load_time
## 1.133e-02 -1.520e-01 -1.523e-01
## highpressure_time
## -2.174e-05
- 2
A <- data.frame(name = c("A","B","C","D","E","F","G","H","I","J"),
grade = c(90,75,77,83,65,80,83,70,87,79),
IQ = c(140, 125, 120, 135, 105, 123, 132, 115, 128, 131),
school = c(2,1,1,2,0,3,3,1,4,2),
game = c(1,3,0,3,4,1,4,1,0,2),
TV = c(0,3,4,2,4,1,1,3,0,3))
A
## name grade IQ school game TV
## 1 A 90 140 2 1 0
## 2 B 75 125 1 3 3
## 3 C 77 120 1 0 4
## 4 D 83 135 2 3 2
## 5 E 65 105 0 4 4
## 6 F 80 123 3 1 1
## 7 G 83 132 3 4 1
## 8 H 70 115 1 1 3
## 9 I 87 128 4 0 0
## 10 J 79 131 2 2 3
lm <- lm(grade ~ ., data = A)
lm
##
## Call:
## lm(formula = grade ~ ., data = A)
##
## Coefficients:
## (Intercept) nameB nameC nameD nameE
## 90 -15 -13 -7 -25
## nameF nameG nameH nameI nameJ
## -10 -7 -20 -3 -11
## IQ school game TV
## NA NA NA NA
predict(lm, new.data = data.frame(x=5))
## 1 2 3 4 5 6 7 8 9 10
## 90 75 77 83 65 80 83 70 87 79
lm3 <- lm(grade ~ IQ + school + game + TV, data = A)
- 꽃받침의 길이를 예측
x <- iris
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
sepalL <- x$Sepal.Length
sepalW <- x$Sepal.Width
petalL <- x$Petal.Length
petalW <- x$Petal.Width
m <- lm(formula = sepalL~sepalW + petalL + petalW, data=x)
m
##
## Call:
## lm(formula = sepalL ~ sepalW + petalL + petalW, data = x)
##
## Coefficients:
## (Intercept) sepalW petalL petalW
## 1.8560 0.6508 0.7091 -0.5565