hn
November 22, 2017
library(RevoScaleR)
# Note that this file is an xdf file/
#so I can just create a datasource for it.
myData <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall.xdf")
rxGetInfo(myData,getVarInfo=T)
## File name: C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR\SampleData\mortDefaultSmall.xdf
## Number of observations: 1e+05
## Number of variables: 6
## Number of blocks: 10
## Compression type: zlib
## Variable information:
## Var 1: creditScore, Type: integer, Low/High: (470, 925)
## Var 2: houseAge, Type: integer, Low/High: (0, 40)
## Var 3: yearsEmploy, Type: integer, Low/High: (0, 14)
## Var 4: ccDebt, Type: integer, Low/High: (0, 14094)
## Var 5: year, Type: integer, Low/High: (2000, 2009)
## Var 6: default, Type: integer, Low/High: (0, 1)
rxOptions(reportProgress = 0)
Which variables might be related to default?
What about credit card debt? How would we explore this? Use rxHistogram()?
rxHistogram(~ccDebt|default,data = myData)
This is hard to read since there are so few defaults.
Try to set histType to “percent.”
rxHistogram(~ccDebt|default,data = myData,
histType = "Percent")
Much clearer!
rxSummary(~ccDebt:default,data=myData)
## Call:
## rxSummary(formula = ~ccDebt:default, data = myData)
##
## Summary Statistics Results for: ~ccDebt:default
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
## Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05
##
## Name Mean StdDev Min Max ValidObs MissingObs
## ccDebt:default 41.96819 618.6241 0 14094 1e+05 0
Why did we only get one summary? Can we fix this?
rxSummary(~ccDebt:F(default),data=myData,reportProgress = 0)
## Call:
## rxSummary(formula = ~ccDebt:F(default), data = myData, reportProgress = 0)
##
## Summary Statistics Results for: ~ccDebt:F(default)
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
## Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05
##
## Name Mean StdDev Min Max ValidObs MissingObs
## ccDebt:F_default 5004.399 1988.02 0 14094 1e+05 0
##
## Statistics by category (2 categories):
##
## Category F_default Means StdDev Min Max ValidObs
## ccDebt for F(default)=0 0 4985.914 1971.760 0 12923 99529
## ccDebt for F(default)=1 1 8910.444 1494.536 3003 14094 471
Do the same exercise for Credit Rating.
creditScore
rxHistogram(~creditScore|default,data = myData,histType = "Percent",reportProgress = 0)
Let’s look at rxSummary. ## Results
rxSummary(~creditScore:F(default),data=myData,reportProgress = 0)
## Call:
## rxSummary(formula = ~creditScore:F(default), data = myData, reportProgress = 0)
##
## Summary Statistics Results for: ~creditScore:F(default)
## Data: myData (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
## Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05
##
## Name Mean StdDev Min Max ValidObs MissingObs
## creditScore:F_default 699.8854 50.15867 470 925 1e+05 0
##
## Statistics by category (2 categories):
##
## Category F_default Means StdDev Min Max ValidObs
## creditScore for F(default)=0 0 699.9573 50.14286 470 925 99529
## creditScore for F(default)=1 1 684.6943 51.23031 559 842 471
Space, Right Arrow or swipe left to move to next slide, click help below for more details