What is this documents !?

This document is here to explain you, with graphic and picture, what data say to us !

First lets define data :

setwd("C:/Users/Avner/MarketMaker") # Set the working Directory
train<-read.csv("cs-training.csv") # Import train Data
test<-read.csv("cs-test.csv") # Import Test data
sample_submission<-read.csv("sampleEntry.csv")
train$X <- NULL 
names(train)

All Variable :

There 150 000 custumers which have each these variable:

Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit expt r. estate and no installment debt ((car loans / (sum of credit limits) percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer

Each variable are quantitive, except the conclusion of the algorithm ‘SeriousDlqin2yrs’.

Let’s see all information about each variable :

Variable Name vars n mean sd median trimmed mad min max range skew kurtosis se
SeriousDlqin2yrs 1 150000 0.07 0.25 0.00 0.00 0.00 0 1 1 3.47 10.03 0.00
RevolvingUtilizationOfUnsecuredLines 2 150000 6.05 249.76 0.15 0.27 0.22 0 50708 50708 97.63 14544.03 0.64
Age 3 150000 52.30 14.77 52.00 51.97 16.31 0 109 109 0.19 -0.49 0.04
NumberOfTime30.59DaysPastDueNotWorse 4 150000 0.42 4.19 0.00 0.07 0.00 0 98 98 22.60 522.35 0.01
DebtRatio 5 150000 353.01 2037.82 0.37 51.65 0.36 0 329664 329664 95.16 13733.65 5.26
MonthlyIncome 6 120269 6670.22 14384.67 5400.00 5787.56 3435.18 0 3008750 3008750 114.04 19503.57 41.48
NumberOfOpenCreditLinesAndLoans 7 150000 8.45 5.15 8.00 7.96 4.45 0 58 58 1.22 3.09 0.01
NumberOfTimes90DaysLate 8 150000 0.27 4.17 0.00 0.00 0.00 0 98 98 23.09 537.71 0.01
NumberRealEstateLoansOrLines 9 150000 1.02 1.13 1.00 0.88 1.48 0 54 54 3.48 60.47 0.00
NumberOfTime60.89DaysPastDueNotWorse 10 150000 0.24 4.16 0.00 0.00 0.00 0 98 98 23.33 545.66 0.01
NumberOfDependents 11 146076 0.76 1.12 0.00 0.54 0.00 0 20 20 1.59 3.00 0.00

Now let’s see the correlation between data !

Use of ‘corrplot()’:

Now let’s see the graphic of each variables

Chaque var: moyenne ecart type, graph(nuage de point etc…)

We try to vizualize a 3d plot, but data isn’t show !

train_2 <- train[-(1:70000),] # On supprime la 20eme ligne
plot3d(train_2$NbPastDueNotWorse3059,train_2$NbPastDueNotWorse6089,train_2$Nb90DaysLate,type="p",col = rainbow(1000),xlab="NumberOfTime30-59DaysPastDueNotWorse",ylab="NumberOfTime60-89DaysPastDueNotWorse",zlab="NumberOfTimes90DaysLate")

You must enable Javascript to view this page properly.