저자 책 웹페이지: https://dataninja.me/ipds-kr/

원 문제:

https://goo.gl/hmyTre 혹은 https://archive.ics.uci.edu/ml/datasets.html 에서 고차원 분류분석 데이터를 찾아서 로지스틱 분류분석을 실행하고, 결과를 슬라이드 10여 장 내외로 요약하라.

UCI 보다는 캐글에 있는 다음 자료를 분석해 보자. https://www.kaggle.com/ludobenistant/hr-analytics

R 환경 준비

일단은 필수패키지인 tidyverse, 그리고 머신러닝을 위한 몇가지 패키지를 로드하자. (로딩 메시지를 감추기 위해 suppressMessages() 명령을 사용.)

# install.packages("tidyverse")
suppressMessages(library(tidyverse))

# install.packages(c("ROCR", "MASS", "glmnet", "randomForest", "gbm", "rpart", "boot"))
suppressMessages(library(gridExtra))
suppressMessages(library(ROCR))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
## Warning: package 'glmnet' was built under R version 3.4.2
suppressMessages(library(randomForest))
suppressMessages(library(gbm))
suppressMessages(library(rpart))
suppressMessages(library(boot))

책에서 기술한대로 이항 오차 함수, 그리고 panel.cor 함수를 정의하자:

binomial_deviance <- function(y_obs, yhat){
  epsilon = 0.0001
  yhat = ifelse(yhat < epsilon, epsilon, yhat)
  yhat = ifelse(yhat > 1-epsilon, 1-epsilon, yhat)
  a = ifelse(y_obs==0, 0, y_obs * log(y_obs/yhat))
  b = ifelse(y_obs==1, 0, (1-y_obs) * log((1-y_obs)/(1-yhat)))
  return(2*sum(a + b))
}

# exmaple(pairs) 에서 따옴
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

자료를 human-resources-analytics.zip 파일로 다운받은 후 다음처럼 R에 읽어들인다:

df <- read_csv("human-resources-analytics.zip")
## Parsed with column specification:
## cols(
##   satisfaction_level = col_double(),
##   last_evaluation = col_double(),
##   number_project = col_integer(),
##   average_montly_hours = col_integer(),
##   time_spend_company = col_integer(),
##   Work_accident = col_integer(),
##   left = col_integer(),
##   promotion_last_5years = col_integer(),
##   sales = col_character(),
##   salary = col_character()
## )
glimpse(df)
## Observations: 14,999
## Variables: 10
## $ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales                 <chr> "sales", "sales", "sales", "sales", "sal...
## $ salary                <chr> "low", "medium", "medium", "low", "low",...

분석의 목적은 다른 변수를 이용하여 left 여부를 예측하는 것이다.

각 변수들의 요약통계량을 살펴보자:

summary(df)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years    sales              salary         
##  Min.   :0.00000       Length:14999       Length:14999      
##  1st Qu.:0.00000       Class :character   Class :character  
##  Median :0.00000       Mode  :character   Mode  :character  
##  Mean   :0.02127                                            
##  3rd Qu.:0.00000                                            
##  Max.   :1.00000

범주형 변수들의 도수분포는 다음과 같다:

table(df$left)
## 
##     0     1 
## 11428  3571
table(df$sales)
## 
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720
table(df$salary)
## 
##   high    low medium 
##   1237   7316   6446

수량형 변수들간의 관계는 산점도 행렬로 살펴볼 수 있다:

set.seed(2017)
df %>% 
  dplyr::select(-sales, -salary) %>% 
  sample_n(500) %>% 
  pairs(lower.panel=function(x,y){ points(x,y); abline(0, 1, col='red')},
    upper.panel = panel.cor)

(select 함수가 MASS 라이브러리에 재정의 된 관계로 dplyr::select()로 표기했다. ) 반응변수와 큰 상관관계가 있는 변수는 satisfaction_level 임을 알 수 있고, 설명변수중 past_evaluation, number_projects, average_monthly_hours 간에 비교적 높은 상관관계가 있음을 알 수 있다.

(혹시 시간이 걸리더라도 좀 더 고급진 산점도행렬을 얻고자 한다면 다음처럼 GGally::ggparis() 함수를 사용하자)

# install.packages("GGally")
# suppressMessages(library(GGally))
# set.seed(2017); df %>% sample_n(1000)  %>% GGally::ggpairs()

훈련, 검증, 테스트셋의 구분

모형행렬은 반응변수 left 를 제외한 모든 변수들을 model.matrix() 에 입력해주면 얻을 수 있다:

x <- model.matrix( ~ . - left, data=df)
glimpse(x)
##  num [1:14999, 1:19] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:14999] "1" "2" "3" "4" ...
##   ..$ : chr [1:19] "(Intercept)" "satisfaction_level" "last_evaluation" "number_project" ...
##  - attr(*, "assign")= int [1:19] 0 1 2 3 4 5 6 7 8 8 ...
##  - attr(*, "contrasts")=List of 2
##   ..$ sales : chr "contr.treatment"
##   ..$ salary: chr "contr.treatment"
colnames(x)
##  [1] "(Intercept)"           "satisfaction_level"   
##  [3] "last_evaluation"       "number_project"       
##  [5] "average_montly_hours"  "time_spend_company"   
##  [7] "Work_accident"         "promotion_last_5years"
##  [9] "saleshr"               "salesIT"              
## [11] "salesmanagement"       "salesmarketing"       
## [13] "salesproduct_mng"      "salesRandD"           
## [15] "salessales"            "salessupport"         
## [17] "salestechnical"        "salarylow"            
## [19] "salarymedium"
dim(x)
## [1] 14999    19

모형의 차원은 \(p=19\) 임을 알 수 있다.

원 데이터를 6:4:4 비율로 훈련, 검증, 테스트셋으로 나누도록 하자. (재현 가능성을 위해 set.seed()를 사용했다.)

set.seed(2017)
n <- nrow(df)
idx <- 1:n
training_idx <- sample(idx, n * .60)
idx <- setdiff(idx, training_idx)
validate_idx = sample(idx, n * .20)
test_idx <- setdiff(idx, validate_idx)
length(training_idx)
## [1] 8999
length(validate_idx)
## [1] 2999
length(test_idx)
## [1] 3001
training <- df[training_idx,]
validation <- df[validate_idx,]
test <- df[test_idx,]

A. 로지스틱 회귀분석

df_glm_full <- glm(left ~ ., data=training, family=binomial)
summary(df_glm_full)
## 
## Call:
## glm(formula = left ~ ., family = binomial, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1949  -0.6781  -0.4119  -0.1144   3.0619  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.3207933  0.2458962  -5.371 7.82e-08 ***
## satisfaction_level    -4.0265898  0.1250099 -32.210  < 2e-16 ***
## last_evaluation        0.7373843  0.1911870   3.857 0.000115 ***
## number_project        -0.3409404  0.0274920 -12.401  < 2e-16 ***
## average_montly_hours   0.0047920  0.0006639   7.217 5.30e-13 ***
## time_spend_company     0.2535567  0.0198581  12.768  < 2e-16 ***
## Work_accident         -1.4459747  0.1117518 -12.939  < 2e-16 ***
## promotion_last_5years -1.5679693  0.3529061  -4.443 8.87e-06 ***
## saleshr                0.0462927  0.1712222   0.270 0.786879    
## salesIT               -0.1770899  0.1544876  -1.146 0.251669    
## salesmanagement       -0.6100011  0.2096773  -2.909 0.003623 ** 
## salesmarketing        -0.0031379  0.1678177  -0.019 0.985082    
## salesproduct_mng      -0.2147629  0.1636240  -1.313 0.189338    
## salesRandD            -0.6806671  0.1858707  -3.662 0.000250 ***
## salessales            -0.0982475  0.1306772  -0.752 0.452151    
## salessupport          -0.0494077  0.1395616  -0.354 0.723324    
## salestechnical         0.0328907  0.1357538   0.242 0.808562    
## salarylow              1.8403821  0.1628753  11.299  < 2e-16 ***
## salarymedium           1.3799875  0.1639047   8.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9921.2  on 8998  degrees of freedom
## Residual deviance: 7845.3  on 8980  degrees of freedom
## AIC: 7883.3
## 
## Number of Fisher Scoring iterations: 5

로지스틱 모형의 예측 정확도 지표는 다음처럼 계산하고 시각화할 수 있다:

y_obs <- validation$left
yhat_lm <- predict(df_glm_full, newdata=validation, type='response')
ggplot(data.frame(y_obs, yhat_lm),
             aes(yhat_lm, fill=factor(y_obs))) +
  geom_density(alpha=.5)