rm(list = ls())
library(tidyverse)
ceosal <- readr::read_csv("ceosal1.csv",
col_names = TRUE)regression
회귀분석
모형
- 기본적인 회귀모형은 다음과 같다.
\[ y_i = \beta_0 + \beta_1 x_i + e_i \]
\(=\) 를 중심으로 왼쪽에 있는 변수 \(y\), 종속변수(dependent variable) 혹은 피설명변수(explained variable)
\(=\) 를 중심으로 오른쪽에 있는 변수 \(X\), 독립변수(independent variable) 혹은 설명변수(explanatory variable)
\(e\) 는 오차항(error term, unexplained), 설명되지 않는 부분.
\(x\) 가 1개이면 단순회귀(simple regression), \(x\) 가 여러개이면 다중회귀(multiple regression)
아래의 오차 제곱의 합이 가장 작아지도록 \(\beta_0 , \beta_1\) 을 정한다.
\[ \sum_{i=1}^{n} e_i^2 \]
이 방법을 OLS(ordinary least square)이라고 한다.
OLS를 일종의 총이라고 봐도 됨
단순회귀
- 기업의 CEO의 연봉에 대한 데이터.
str(ceosal)spc_tbl_ [209 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ salary : num [1:209] 1095 1001 1122 578 1368 ...
$ pcsalary: num [1:209] 20 32 9 -9 7 5 10 7 16 5 ...
$ sales : num [1:209] 27595 9958 6126 16246 21783 ...
$ roe : num [1:209] 14.1 10.9 23.5 5.9 13.8 20 16.4 16.3 10.5 26.3 ...
$ pcroe : num [1:209] 106.4 -30.6 -16.3 -25.7 -3 ...
$ ros : num [1:209] 191 13 14 -21 56 55 62 44 37 37 ...
$ indus : num [1:209] 1 1 1 1 1 1 1 1 1 1 ...
$ finance : num [1:209] 0 0 0 0 0 0 0 0 0 0 ...
$ consprod: num [1:209] 0 0 0 0 0 0 0 0 0 0 ...
$ utility : num [1:209] 0 0 0 0 0 0 0 0 0 0 ...
$ lsalary : num [1:209] 7 6.91 7.02 6.36 7.22 ...
$ lsales : num [1:209] 10.23 9.21 8.72 9.7 9.99 ...
- attr(*, "spec")=
.. cols(
.. salary = col_double(),
.. pcsalary = col_double(),
.. sales = col_double(),
.. roe = col_double(),
.. pcroe = col_double(),
.. ros = col_double(),
.. indus = col_double(),
.. finance = col_double(),
.. consprod = col_double(),
.. utility = col_double(),
.. lsalary = col_double(),
.. lsales = col_double()
.. )
- attr(*, "problems")=<externalptr>
- salary 를 수익률 roe(return on equity)가 설명할 수 있는가? 얼마나 설명하는가?
\[ salary_i = a + b \times roe_i + e_i \]
귀무가설 null hypothesis \(H_0\) \[ H_0 : b = 0 \]
roe가 가로축, salary가 세로축인 산포도 – indus 별로 다른 색으로 칠해보시오.
ceosal |>
ggplot(mapping = aes(x = roe, y = salary, color = as.factor(indus))) +
geom_point() +
labs(
color = "OrRd"
)ceosal |>
ggplot(mapping = aes(x = roe, y = salary)) +
geom_point() +
geom_smooth(method = "lm",
se = TRUE)- 모형의 계수 \(a, b\)를 OLS로 추정한다.
lm.roe <- lm(salary ~ roe, data = ceosal)
summary(lm.roe)
Call:
lm(formula = salary ~ roe, data = ceosal)
Residuals:
Min 1Q Median 3Q Max
-1160.2 -526.0 -254.0 138.8 13499.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 963.19 213.24 4.517 1.05e-05 ***
roe 18.50 11.12 1.663 0.0978 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1367 on 207 degrees of freedom
Multiple R-squared: 0.01319, Adjusted R-squared: 0.008421
F-statistic: 2.767 on 1 and 207 DF, p-value: 0.09777
roe가 salary를 잘 설명 못 함. 귀무가설을 기각할 수가 없음.
1종 오류(false positive): 틀린데 맞다 하는 것.
유의수준 5%, 기각역 2(1.96), 신뢰수준 95%
library(stargazer)
stargazer(lm.roe,
type = "text",
keep.stat = c("n", "rsq"))
========================================
Dependent variable:
---------------------------
salary
----------------------------------------
roe 18.501*
(11.123)
Constant 963.191***
(213.240)
----------------------------------------
Observations 209
R2 0.013
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
roe 대신에 다른 설명변수, 예를 들어 sales 매출액
sales와 salary와의 상관관계를 그려보시오.
ceosal |>
ggplot(mapping = aes(x = sales, y = salary)) +
geom_point()- sales < 25000 이고 salary < 5000 인 기업만 대상으로 그리시오. geom_rug() 를 추가해보시오.
#ceosal |>
# ggplot(mapping = aes(x = sales, y = salary)) +
# geom_point() +
# scale_x_continuous('sales', limits=c(0,25000)) +
# scale_y_continuous('salary', limits=c(0,5000))ceosal |>
filter(sales < 25000 & salary < 5000) |>
ggplot(mapping = aes(x = sales, y = salary)) +
geom_point() +
geom_rug()- log(sales), log(salary)의 산포도
ceosal |>
ggplot(mapping = aes(x = lsales, y = lsalary)) +
geom_point() +
geom_rug() +
geom_smooth(method = "lm")- 회귀모형을 다음과 같이 바꾼다.
\[ \log(salary_i) = a + b \log (sales_i) + e_i \]
- 위의 회귀모형을 OLS로 추정한다.
lm.sales <- lm(lsalary ~ lsales, data = ceosal)
stargazer(lm.sales,
type = "text",
keep.stat = c("n", "rsq"))
========================================
Dependent variable:
---------------------------
lsalary
----------------------------------------
lsales 0.257***
(0.035)
Constant 4.822***
(0.288)
----------------------------------------
Observations 209
R2 0.211
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
- 여기까지가 \(x\)가 한 개인 단순회귀(simple regression)
다중회귀
- 다중회귀는 \(x\)가 2개 이상인 회귀모형
rm(list = ls())
gpa <- read.csv("gpa1.csv",
header = TRUE,
sep = "\t")
attach(gpa)
str(gpa)'data.frame': 141 obs. of 29 variables:
$ age : int 21 21 20 19 20 20 22 22 22 19 ...
$ soph : int 0 0 0 1 0 0 0 0 0 1 ...
$ junior : int 0 0 1 0 1 0 0 0 0 0 ...
$ senior : int 1 1 0 0 0 1 0 0 0 0 ...
$ senior5 : int 0 0 0 0 0 0 1 1 1 0 ...
$ male : int 0 0 0 1 0 1 0 0 0 0 ...
$ campus : int 0 0 0 1 0 1 0 0 0 0 ...
$ business: int 1 1 1 1 1 1 1 0 0 1 ...
$ engineer: int 0 0 0 0 0 0 0 0 0 0 ...
$ colGPA : num 3 3.4 3 3.5 3.6 ...
$ hsGPA : num 3 3.2 3.6 3.5 3.9 ...
$ ACT : int 21 24 26 27 28 25 25 22 21 27 ...
$ job19 : int 0 0 1 1 0 0 0 1 1 1 ...
$ job20 : int 1 1 0 0 1 0 0 0 0 0 ...
$ drive : int 1 1 0 0 0 0 0 1 1 0 ...
$ bike : int 0 0 0 0 1 0 1 0 0 0 ...
$ walk : int 0 0 1 1 0 1 0 0 0 1 ...
$ voluntr : int 0 0 0 0 0 0 0 0 0 0 ...
$ PC : int 0 0 0 0 0 0 0 1 0 1 ...
$ greek : int 0 0 0 0 0 0 1 0 0 0 ...
$ car : int 1 1 1 0 1 1 1 0 1 0 ...
$ siblings: int 1 0 1 1 1 1 1 1 1 1 ...
$ bgfriend: int 0 1 0 0 1 0 0 1 1 0 ...
$ clubs : int 0 1 1 0 0 0 1 0 1 1 ...
$ skipped : num 2 0 0 0 0 0 0 3 2 0.5 ...
$ alcohol : num 1 1 1 0 1.5 0 2 3 2.5 0.75 ...
$ gradMI : int 1 1 1 0 1 0 1 1 1 1 ...
$ fathcoll: int 0 1 1 0 1 1 0 1 1 0 ...
$ mothcoll: int 0 1 1 0 0 0 1 1 1 1 ...
- hsGPA, colGPA의 상관관계를 산포도로 그리되, male에 따라서 다른 색으로.
gpa |>
ggplot(mapping = aes(x = hsGPA, y = colGPA, color = as.factor(male))) +
geom_jitter(width = 0.01) +
labs(
color = "1 if male"
)- 산포도는 male 에 따라 다른 색으로
- 추세선은 전체 추세선
gpa |>
ggplot(mapping = aes(x = hsGPA, y = colGPA, color = as.factor(male))) +
geom_jitter(width = 0.01) +
labs(
color = "1 if male"
) +
geom_smooth(method = "lm")\[ colGPA_i = a + b \times hsGPA_i + e_i \]
lm.hsGPA <- lm(colGPA ~ hsGPA, data = gpa)
stargazer::stargazer(lm.hsGPA,
type = "text",
keep.stat = c("n", "rsq"))
========================================
Dependent variable:
---------------------------
colGPA
----------------------------------------
hsGPA 0.482***
(0.090)
Constant 1.415***
(0.307)
----------------------------------------
Observations 141
R2 0.172
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
lm.ACT <- lm(colGPA ~ hsGPA + ACT, data = gpa)
stargazer::stargazer(lm.hsGPA,
lm.ACT,
type = "text",
keep.stat = c("n", "rsq"))
=========================================
Dependent variable:
----------------------------
colGPA
(1) (2)
-----------------------------------------
hsGPA 0.482*** 0.453***
(0.090) (0.096)
ACT 0.009
(0.011)
Constant 1.415*** 1.286***
(0.307) (0.341)
-----------------------------------------
Observations 141 141
R2 0.172 0.176
=========================================
Note: *p<0.1; **p<0.05; ***p<0.01
- 통제변수를 2~4개 정도 추가하고
- 그 이유에 대해서 설명해보시오.
lm.campus <- lm(colGPA ~ hsGPA + ACT + campus + voluntr, data = gpa)
stargazer::stargazer(lm.hsGPA,
lm.ACT,
lm.campus,
type = "text",
keep.stat = c("n", "rsq"))
==========================================
Dependent variable:
-----------------------------
colGPA
(1) (2) (3)
------------------------------------------
hsGPA 0.482*** 0.453*** 0.457***
(0.090) (0.096) (0.096)
ACT 0.009 0.009
(0.011) (0.011)
campus -0.068
(0.077)
voluntr -0.025
(0.070)
Constant 1.415*** 1.286*** 1.294***
(0.307) (0.341) (0.342)
------------------------------------------
Observations 141 141 141
R2 0.172 0.176 0.182
==========================================
Note: *p<0.1; **p<0.05; ***p<0.01
lm.job19 <- lm(colGPA ~ hsGPA + ACT + job19, data = gpa)
stargazer::stargazer(lm.hsGPA,
lm.ACT,
lm.job19,
type = "text",
keep.stat = c("n", "rsq"))
==========================================
Dependent variable:
-----------------------------
colGPA
(1) (2) (3)
------------------------------------------
hsGPA 0.482*** 0.453*** 0.455***
(0.090) (0.096) (0.096)
ACT 0.009 0.010
(0.011) (0.011)
job19 -0.009
(0.059)
Constant 1.415*** 1.286*** 1.284***
(0.307) (0.341) (0.342)
------------------------------------------
Observations 141 141 141
R2 0.172 0.176 0.177
==========================================
Note: *p<0.1; **p<0.05; ***p<0.01
회귀모형을 이용한 기계학습(machine learning)
machine learning 은 목적이 prediction 에 있음
주택가격 예측모형을 구축.
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + \epsilon_i \]
\(y\): 주택가격
\(x\): 지역 혹은 주택의 특성
이전에 회귀분석에서는 모형을 추정(estimation)한다고 표현했는데, 여기에서는 훈련(train)한다고 표현
library(tidyverse)
library(MASS)
library(caret)
data("Boston")
head(Boston) crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
- data 를 train 데이터와 test 데이터로 나눈다.
set.seed(12345)
trainIndex <- caret::createDataPartition(Boston$medv,
p = 0.8,
list = FALSE)
trainData <- Boston[trainIndex, ]
testData <- Boston[-trainIndex, ]
head(trainData) crim zn indus chas nox rm age dis rad tax ptratio black lstat
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15
medv
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
8 27.1
- 모든 features 를 이용해서 medv 를 예측하는 모형을 훈련시킨다
lm.all <- lm( medv ~ ., data = trainData)
summary(lm.all)
Call:
lm(formula = medv ~ ., data = trainData)
Residuals:
Min 1Q Median 3Q Max
-15.0843 -2.7781 -0.5494 1.6943 25.0247
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.765e+01 5.760e+00 6.537 1.95e-10 ***
crim -8.311e-02 4.796e-02 -1.733 0.083904 .
zn 3.717e-02 1.599e-02 2.325 0.020607 *
indus -2.191e-04 6.845e-02 -0.003 0.997448
chas 2.864e+00 1.002e+00 2.858 0.004492 **
nox -1.683e+01 4.422e+00 -3.805 0.000165 ***
rm 3.335e+00 4.688e-01 7.113 5.40e-12 ***
age 8.175e-03 1.470e-02 0.556 0.578403
dis -1.270e+00 2.259e-01 -5.623 3.57e-08 ***
rad 2.971e-01 7.681e-02 3.868 0.000129 ***
tax -1.063e-02 4.269e-03 -2.490 0.013172 *
ptratio -9.414e-01 1.508e-01 -6.244 1.11e-09 ***
black 9.257e-03 3.073e-03 3.012 0.002759 **
lstat -5.688e-01 5.614e-02 -10.132 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.851 on 393 degrees of freedom
Multiple R-squared: 0.709, Adjusted R-squared: 0.6993
F-statistic: 73.65 on 13 and 393 DF, p-value: < 2.2e-16
- 이렇게 훈련된 모형을 가지고, 예측을 할 것임
predictions <- predict(lm.all, newdata = testData)- 실제 집값 medv, 모형이 예측한 집값 predictions
- 두 변수 이용해서 새로운 데이터프레임
results <- data.frame(
actual = testData$medv,
predicted = predictions
)- 이 관계를 시각화 해보면
results |>
ggplot(aes(x = actual, y = predicted )) +
geom_point() +
geom_abline(intercept = 0, slope = 1,
color = "red", linetype = "dashed") +
labs(
x = "실제가격", y = "예측가격"
)모형의 성능(performace), 적합도(goodness of fit), 모형의 설정(specification)을 평가
root mean-squared error
\[ RMSE = \sqrt {\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 } \]
rmse <- sqrt(mean((results$actual - results$predicted)^2))
cat("RMSE = ", rmse)RMSE = 4.422605
- 특성 변수의 선택(selection of features)
- 과적합(overfitting)을 피하기 위해서
- 3가지 방법을 생각해볼 수 있음
- forward(더해가는 방식)
- backward(빼내가는 방식)
- both(혼합)
- 정규화(regularization)라는 방식을 사용한다.
모든 특성변수 features 를 짚어 넣어야 하나?
feature selection
정보기준(information criterion)
- AIC (Akaike information criterion)
- BIC (Bayesian information criterion)
방식
- forward: 더 해나가는 방식.
- backward: 빼내가는 방식.
- both
step_model <- stepAIC(lm(medv ~ . , data = trainData),
direction = "both",
trace = FALSE)
model_selection <- names(coef(step_model))[-1]
temp <- paste(model_selection, collapse = " + ")
formula <- as.formula(paste("medv ~ ", temp))
lm_selection <- lm(formula, data = trainData)
summary(lm_selection)
Call:
lm(formula = formula, data = trainData)
Residuals:
Min 1Q Median 3Q Max
-15.2178 -2.7989 -0.5557 1.7558 25.3234
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.378566 5.716804 6.538 1.93e-10 ***
crim -0.082235 0.047764 -1.722 0.08591 .
zn 0.036341 0.015794 2.301 0.02192 *
chas 2.912554 0.992300 2.935 0.00353 **
nox -16.171441 4.093708 -3.950 9.24e-05 ***
rm 3.384598 0.457399 7.400 8.28e-13 ***
dis -1.306578 0.210369 -6.211 1.34e-09 ***
rad 0.294726 0.074322 3.966 8.69e-05 ***
tax -0.010639 0.003911 -2.720 0.00681 **
ptratio -0.934321 0.148194 -6.305 7.73e-10 ***
black 0.009396 0.003054 3.077 0.00224 **
lstat -0.557737 0.051951 -10.736 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.84 on 395 degrees of freedom
Multiple R-squared: 0.7087, Adjusted R-squared: 0.7006
F-statistic: 87.38 on 11 and 395 DF, p-value: < 2.2e-16
- 정규화: lasso, ridge (penalized regression)