Statistical Analysis-rxLogit

Sample Code


 사용할 데이터는 Revolution 폴더안의 Sample Data로, "mortDefaultSmall2009"라는 이름의 csv파일이다.  

데이터의 행 수는 총 10000개 이고, 'creditScore', 'houseAge', 'yearsEmploy', 'ccDebt', 'year', 'default' 로 총 6개의 변수(열)를 가지고 있다.  

현재 사용한 데이터는 2009년 데이터이지만, 실제는 2000년부터 2009년까지 10년간의 sample data가 있다.  

뒤에 데이터 합치기와 같은 부분에서 다른 연도의 데이터도 사용될 것이다.

1.Data Import하기


# (1-1) data의 위치지정 Revolution R을 다운할 시, 자동으로 생성되는 Sample
# data중 하나인 데이터로, sampleData폴더의 'mortDefaultSmall2009'
# csv파일을 사용

text_mort <- "C:/Revolution/R-Enterprise-6.1/R-2.14.2/library/RevoScaleR/SampleData/mortDefaultSmall2009.csv"

# (1-2) rxImport를 사용한 data import

data_mort <- rxImport(inData = text_mort, overwrite = TRUE)

2.rxLogit을 사용한 로지스틱 회귀분석


# (1-3) rxLogit을 사용한 로지스틱 회귀분석

# rxLogit과 glm의 비교

# rxLogit을 이용한 로지스틱 회귀분석(default값은 0과 1의 데이터만 가지는
# 종속변수, 나머지를 모두 독립변수로하였음)

mort_logit <- rxLogit(default ~ creditScore + houseAge + yearsEmploy + ccDebt, 
    data = data_mort)


# rxLogit 후의 Summary

summary(mort_logit)

## Call:
## rxLogit(formula = default ~ creditScore + houseAge + yearsEmploy + 
##     ccDebt, data = data_mort)
## 
## Logistic Regression Results for: default ~ creditScore + houseAge
##     + yearsEmploy + ccDebt
## Data: data_mort
## Dependent variable(s): default
## Total independent variables: 5 
## Number of valid observations: 10000
## Number of missing observations: 0 
## -2*LogLikelihood: 1318.647 (Residual deviance on 9995 degrees of freedom)
##  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.55e+00   1.14e+00   -4.87  1.1e-06 ***
## creditScore -8.73e-03   1.57e-03   -5.56  2.6e-08 ***
## houseAge     3.20e-02   1.02e-02    3.12   0.0018 ** 
## yearsEmploy -2.67e-01   4.00e-02   -6.66  2.7e-11 ***
## ccDebt       1.22e-03   5.65e-05   21.55  2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Condition number of final variance-covariance matrix: 2.103 
## Number of iterations: 8

3.glm을 사용한 로지스틱 회귀분석


# glm을 이용한 로지스틱 회귀분석(family='binomial'로 설정해주어야 로지스틱
# 회귀분석이 된다. 그렇지 않으면 단순 다중회귀분석을 실시하게 된다.)

mort_glm <- glm(default ~ creditScore + houseAge + yearsEmploy + ccDebt, data = data_mort, 
    family = binomial())

# glm()함수를 사용하여 로지스틱 회귀분석 후의 Summary

summary(mort_glm)

## 
## Call:
## glm(formula = default ~ creditScore + houseAge + yearsEmploy + 
##     ccDebt, family = binomial(), data = data_mort)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.027  -0.134  -0.057  -0.024   3.961  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.55e+00   1.14e+00   -4.87  1.1e-06 ***
## creditScore -8.73e-03   1.57e-03   -5.56  2.6e-08 ***
## houseAge     3.20e-02   1.02e-02    3.12   0.0018 ** 
## yearsEmploy -2.67e-01   4.00e-02   -6.66  2.7e-11 ***
## ccDebt       1.22e-03   5.65e-05   21.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2227.3  on 9999  degrees of freedom
## Residual deviance: 1318.6  on 9995  degrees of freedom
## AIC: 1329
## 
## Number of Fisher Scoring iterations: 8

4.rxLogit과 glm의 비교

system.time(mort_logit <- rxLogit(default ~ creditScore + houseAge + yearsEmploy + 
    ccDebt, data = data_mort))

## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.004 seconds 
## 
## Starting values (iteration 1) time: 0.015 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.006 seconds 
## 
## Iteration 2 time: 0.019 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.007 seconds 
## 
## Iteration 3 time: 0.015 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.006 seconds 
## 
## Iteration 4 time: 0.015 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.006 seconds 
## 
## Iteration 5 time: 0.014 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.007 seconds 
## 
## Iteration 6 time: 0.015 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.005 seconds 
## 
## Iteration 7 time: 0.014 secs.
## Rows Read: 10000, Total Rows Processed: 10000, Total Chunk Time: 0.006 seconds 
## 
## Iteration 8 time: 0.016 secs.
## 
## Elapsed computation time: 0.125 secs.

##    user  system elapsed 
##    0.14    0.00    0.14

system.time(mort_glm <- glm(default ~ creditScore + houseAge + yearsEmploy + 
    ccDebt, data = data_mort, family = binomial()))

##    user  system elapsed 
##    0.50    0.01    0.34

rxLogit과 glm을 비교해 보니, 결과는 모두 같게 나옴을 알 수 있었다.

해서, 구동시간을 확인해보니, 예제는 물론 작은 데이터이지만, rxLogit의 시스템시간이 더 빠른 것을 확인할 수 있다.

Appendix

rxLogit

Usage

rxLogit(formula, data, pweights = NULL, fweights = NULL, cube = FALSE, cubePredictions = FALSE, rowSelection = NULL, transforms = NULL, transformObjects = NULL, transformFunc = NULL, transformVars = NULL, transformPackages = NULL, transformEnvir = NULL,
dropFirst = FALSE, covCoef = FALSE, covData = FALSE, covariance = FALSE, initialValues = NULL, blocksPerRead = rxGetOption(“blocksPerRead”), maxIterations = 25, coeffTolerance = 1e-06, gradientTolerance = 1e-06, objectiveFunctionTolerance = 1e-08, reportProgress = rxGetOption(“reportProgress”), verbose = 0, computeContext = rxGetOption(“computeContext”), …)

Hankuk University of Foreign Studies. Dept of Statistics. Daewoo Choi Lab. Yeeseul Han.
한국외국어대학교 통계학과 최대우 연구실 한이슬
e-mail : han.lolove17@gmail.com