Notes 11-22

hn

November 22, 2017

Get some Data

library(RevoScaleR)
# Note that this file is an xdf file/
#so I can just create a datasource for it.
 myData <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall.xdf")
 rxGetInfo(myData,getVarInfo=T)
## File name: C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR\SampleData\mortDefaultSmall.xdf 
## Number of observations: 1e+05 
## Number of variables: 6 
## Number of blocks: 10 
## Compression type: zlib 
## Variable information: 
## Var 1: creditScore, Type: integer, Low/High: (470, 925)
## Var 2: houseAge, Type: integer, Low/High: (0, 40)
## Var 3: yearsEmploy, Type: integer, Low/High: (0, 14)
## Var 4: ccDebt, Type: integer, Low/High: (0, 14094)
## Var 5: year, Type: integer, Low/High: (2000, 2009)
## Var 6: default, Type: integer, Low/High: (0, 1)
rxOptions(reportProgress = 0)

Explore the Data

Which variables might be related to default?

What about credit card debt? How would we explore this? Use rxHistogram()?

Histogram Results

rxHistogram(~ccDebt|default,data = myData)

This is hard to read since there are so few defaults.

Try to set histType to “percent.”

Results

rxHistogram(~ccDebt|default,data = myData,
  histType = "Percent")  

Much clearer!

rxSummary?

Results

rxSummary(~ccDebt:default,data=myData)
## Call:
## rxSummary(formula = ~ccDebt:default, data = myData)
## 
## Summary Statistics Results for: ~ccDebt:default
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
##     Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05 
##  
##  Name           Mean     StdDev   Min Max   ValidObs MissingObs
##  ccDebt:default 41.96819 618.6241 0   14094 1e+05    0

Why did we only get one summary? Can we fix this?

Results

rxSummary(~ccDebt:F(default),data=myData,reportProgress = 0)
## Call:
## rxSummary(formula = ~ccDebt:F(default), data = myData, reportProgress = 0)
## 
## Summary Statistics Results for: ~ccDebt:F(default)
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
##     Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05 
##  
##  Name             Mean     StdDev  Min Max   ValidObs MissingObs
##  ccDebt:F_default 5004.399 1988.02 0   14094 1e+05    0         
## 
## Statistics by category (2 categories):
## 
##  Category                F_default Means    StdDev   Min  Max   ValidObs
##  ccDebt for F(default)=0 0         4985.914 1971.760    0 12923 99529   
##  ccDebt for F(default)=1 1         8910.444 1494.536 3003 14094   471

Credit Rating

Do the same exercise for Credit Rating.

Credit Rating Results

creditScore

rxHistogram(~creditScore|default,data = myData,histType = "Percent",reportProgress = 0)  

Let’s look at rxSummary. ## Results

rxSummary(~creditScore:F(default),data=myData,reportProgress = 0)
## Call:
## rxSummary(formula = ~creditScore:F(default), data = myData, reportProgress = 0)
## 
## Summary Statistics Results for: ~creditScore:F(default)
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
##     Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05 
##  
##  Name                  Mean     StdDev   Min Max ValidObs MissingObs
##  creditScore:F_default 699.8854 50.15867 470 925 1e+05    0         
## 
## Statistics by category (2 categories):
## 
##  Category                     F_default Means    StdDev   Min Max ValidObs
##  creditScore for F(default)=0 0         699.9573 50.14286 470 925 99529   
##  creditScore for F(default)=1 1         684.6943 51.23031 559 842   471

Scatterplot?