Setup

Get the libraries and the mortgage default file.

library(RevoScaleR)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
rxOptions(reportProgress = 0)
# Note that this file is an xdf file/
#so I can just create a datasource for it.
 default <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall.xdf")
 rxGetInfo(default,getVarInfo=T)
## File name: C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR\SampleData\mortDefaultSmall.xdf 
## Number of observations: 1e+05 
## Number of variables: 6 
## Number of blocks: 10 
## Compression type: zlib 
## Variable information: 
## Var 1: creditScore, Type: integer, Low/High: (470, 925)
## Var 2: houseAge, Type: integer, Low/High: (0, 40)
## Var 3: yearsEmploy, Type: integer, Low/High: (0, 14)
## Var 4: ccDebt, Type: integer, Low/High: (0, 14094)
## Var 5: year, Type: integer, Low/High: (2000, 2009)
## Var 6: default, Type: integer, Low/High: (0, 1)
rxOptions(reportProgress = 0)

Random sample from an xdf file.

Visit

https://support.microsoft.com/en-us/help/3104278/qa-how-can-i-randomly-select-data-from-an--xdf-file

Apply this code to get a 10% random sample from our mortgage default file. Modify the code to get a dataframe instead of an xdf.

Do a summary and histogram of ccDebt several times after removing the set.seed command.

set.seed(13) 
xform <- function(data) { 
data$.rxRowSelection<-as.logical(rbinom(length(data[[1]]),1,.1)) 
return(data) 
} 


sample1 = rxDataStep(default,transformFunc=xform, overwrite=TRUE)

# check that subsetting was done and the row selection variable is not kept in the data set. 

str(sample1)
## 'data.frame':    10211 obs. of  6 variables:
##  $ creditScore: int  594 716 634 725 740 709 729 645 627 649 ...
##  $ houseAge   : int  29 23 3 7 13 27 22 23 26 14 ...
##  $ yearsEmploy: int  7 7 4 0 4 3 3 2 4 6 ...
##  $ ccDebt     : int  3897 5833 4957 4479 6261 6487 2100 7736 5596 2490 ...
##  $ year       : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ default    : int  0 0 0 0 0 0 0 0 0 0 ...

Repeat

Examine a summary and histogram of ccDebt on several different random samples by re-running the code without set.seed().

sample1 = rxDataStep(default,transformFunc=xform, overwrite=TRUE)

hist(sample1$ccDebt)

summary(sample1$ccDebt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3635    4992    4971    6315   12862

The whole Thing

Do the summary and histogram on the entire xdf file using RevoScaleR functions.

rxSummary(~ccDebt,data=default)
## Call:
## rxSummary(formula = ~ccDebt, data = default)
## 
## Summary Statistics Results for: ~ccDebt
## Data: default (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
##     Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05 
##  
##  Name   Mean     StdDev  Min Max   ValidObs MissingObs
##  ccDebt 5004.399 1988.02 0   14094 1e+05    0
rxHistogram(~ccDebt,data=default)

The Big Question

How much more do you know as a result of using the entire xdf file.

Look at a few scatterplots using ggplot2.

sample1 %>% ggplot(aes(ccDebt,creditScore,color=as.factor(default))) + geom_point(alpha=.2)

Linear Model

First run a linear model on Sample1.

lm1 = lm(default~ccDebt + creditScore,data=sample1)
summary(lm1)
## 
## Call:
## lm(formula = default ~ ccDebt + creditScore, data = sample1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03014 -0.00807 -0.00353  0.00109  0.99148 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.686e-03  8.437e-03  -0.318     0.75    
## ccDebt       3.436e-06  3.001e-07  11.448   <2e-16 ***
## creditScore -1.549e-05  1.182e-05  -1.310     0.19    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05904 on 9874 degrees of freedom
## Multiple R-squared:  0.01327,    Adjusted R-squared:  0.01307 
## F-statistic: 66.41 on 2 and 9874 DF,  p-value: < 2.2e-16

Now run a linear model on the whole xdf file.

lm2 = rxLinMod(default~ccDebt + creditScore,data=default)
summary(lm2)
## Call:
## rxLinMod(formula = default ~ ccDebt + creditScore, data = default)
## 
## Linear Regression Results for: default ~ ccDebt + creditScore
## Data: default (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
##     Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Dependent variable(s): default
## Total independent variables: 3 
## Number of valid observations: 1e+05
## Number of missing observations: 0 
##  
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.228e-04  3.051e-03    0.27    0.787    
## ccDebt       4.652e-06  1.079e-07   43.12 2.22e-16 ***
## creditScore -2.771e-05  4.276e-06   -6.48 9.20e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06783 on 99997 degrees of freedom
## Multiple R-squared: 0.01868 
## Adjusted R-squared: 0.01866 
## F-statistic: 951.8 on 2 and 99997 DF,  p-value: < 2.2e-16 
## Condition number: 1.0079