Get the libraries and the mortgage default file.
library(RevoScaleR)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
rxOptions(reportProgress = 0)
# Note that this file is an xdf file/
#so I can just create a datasource for it.
default <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall.xdf")
rxGetInfo(default,getVarInfo=T)
## File name: C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR\SampleData\mortDefaultSmall.xdf
## Number of observations: 1e+05
## Number of variables: 6
## Number of blocks: 10
## Compression type: zlib
## Variable information:
## Var 1: creditScore, Type: integer, Low/High: (470, 925)
## Var 2: houseAge, Type: integer, Low/High: (0, 40)
## Var 3: yearsEmploy, Type: integer, Low/High: (0, 14)
## Var 4: ccDebt, Type: integer, Low/High: (0, 14094)
## Var 5: year, Type: integer, Low/High: (2000, 2009)
## Var 6: default, Type: integer, Low/High: (0, 1)
rxOptions(reportProgress = 0)
Visit
https://support.microsoft.com/en-us/help/3104278/qa-how-can-i-randomly-select-data-from-an--xdf-file
Apply this code to get a 10% random sample from our mortgage default file. Modify the code to get a dataframe instead of an xdf.
Do a summary and histogram of ccDebt several times after removing the set.seed command.
set.seed(13)
xform <- function(data) {
data$.rxRowSelection<-as.logical(rbinom(length(data[[1]]),1,.1))
return(data)
}
sample1 = rxDataStep(default,transformFunc=xform, overwrite=TRUE)
# check that subsetting was done and the row selection variable is not kept in the data set.
str(sample1)
## 'data.frame': 10211 obs. of 6 variables:
## $ creditScore: int 594 716 634 725 740 709 729 645 627 649 ...
## $ houseAge : int 29 23 3 7 13 27 22 23 26 14 ...
## $ yearsEmploy: int 7 7 4 0 4 3 3 2 4 6 ...
## $ ccDebt : int 3897 5833 4957 4479 6261 6487 2100 7736 5596 2490 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ default : int 0 0 0 0 0 0 0 0 0 0 ...
Examine a summary and histogram of ccDebt on several different random samples by re-running the code without set.seed().
sample1 = rxDataStep(default,transformFunc=xform, overwrite=TRUE)
hist(sample1$ccDebt)
summary(sample1$ccDebt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3635 4992 4971 6315 12862
Do the summary and histogram on the entire xdf file using RevoScaleR functions.
rxSummary(~ccDebt,data=default)
## Call:
## rxSummary(formula = ~ccDebt, data = default)
##
## Summary Statistics Results for: ~ccDebt
## Data: default (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
## Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Number of valid observations: 1e+05
##
## Name Mean StdDev Min Max ValidObs MissingObs
## ccDebt 5004.399 1988.02 0 14094 1e+05 0
rxHistogram(~ccDebt,data=default)
How much more do you know as a result of using the entire xdf file.
sample1 %>% ggplot(aes(ccDebt,creditScore,color=as.factor(default))) + geom_point(alpha=.2)
First run a linear model on Sample1.
lm1 = lm(default~ccDebt + creditScore,data=sample1)
summary(lm1)
##
## Call:
## lm(formula = default ~ ccDebt + creditScore, data = sample1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03014 -0.00807 -0.00353 0.00109 0.99148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.686e-03 8.437e-03 -0.318 0.75
## ccDebt 3.436e-06 3.001e-07 11.448 <2e-16 ***
## creditScore -1.549e-05 1.182e-05 -1.310 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05904 on 9874 degrees of freedom
## Multiple R-squared: 0.01327, Adjusted R-squared: 0.01307
## F-statistic: 66.41 on 2 and 9874 DF, p-value: < 2.2e-16
Now run a linear model on the whole xdf file.
lm2 = rxLinMod(default~ccDebt + creditScore,data=default)
summary(lm2)
## Call:
## rxLinMod(formula = default ~ ccDebt + creditScore, data = default)
##
## Linear Regression Results for: default ~ ccDebt + creditScore
## Data: default (RxXdfData Data Source)
## File name: C:/Program Files/Microsoft/ML
## Server/R_SERVER/library/RevoScaleR/SampleData/mortDefaultSmall.xdf
## Dependent variable(s): default
## Total independent variables: 3
## Number of valid observations: 1e+05
## Number of missing observations: 0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.228e-04 3.051e-03 0.27 0.787
## ccDebt 4.652e-06 1.079e-07 43.12 2.22e-16 ***
## creditScore -2.771e-05 4.276e-06 -6.48 9.20e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06783 on 99997 degrees of freedom
## Multiple R-squared: 0.01868
## Adjusted R-squared: 0.01866
## F-statistic: 951.8 on 2 and 99997 DF, p-value: < 2.2e-16
## Condition number: 1.0079