This document is here to explain you, with graphic and picture, what data say to us !
First lets define data :
setwd("C:/Users/Avner/MarketMaker") # Set the working Directory
train<-read.csv("cs-training.csv") # Import train Data
test<-read.csv("cs-test.csv") # Import Test data
sample_submission<-read.csv("sampleEntry.csv")
train$X <- NULL
names(train)
There 150 000 custumers which have each these variable:
| Variable Name | Description | Type |
|---|---|---|
| SeriousDlqin2yrs | Person experienced 90 days past due delinquency or worse | Y/N |
| RevolvingUtilizationOfUnsecuredLines | Total balance on credit cards and personal lines of credit expt r. estate and no installment debt ((car loans / (sum of credit limits) | percentage |
| age | Age of borrower in years | integer |
| NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due but no worse in the last 2 years. | integer |
| DebtRatio | Monthly debt payments, alimony,living costs divided by monthy gross income | percentage |
| MonthlyIncome | Monthly income | real |
| NumberOfOpenCreditLinesAndLoans | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | integer |
| NumberOfTimes90DaysLate | Number of times borrower has been 90 days or more past due. | integer |
| NumberRealEstateLoansOrLines | Number of mortgage and real estate loans including home equity lines of credit | integer |
| NumberOfTime60-89DaysPastDueNotWorse | Number of times borrower has been 60-89 days past due but no worse in the last 2 years. | integer |
| NumberOfDependents | Number of dependents in family excluding themselves (spouse, children etc.) | integer |
Each variable are quantitive, except the conclusion of the algorithm ‘SeriousDlqin2yrs’.
| Variable Name | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SeriousDlqin2yrs | 1 | 150000 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 1 | 1 | 3.47 | 10.03 | 0.00 |
| RevolvingUtilizationOfUnsecuredLines | 2 | 150000 | 6.05 | 249.76 | 0.15 | 0.27 | 0.22 | 0 | 50708 | 50708 | 97.63 | 14544.03 | 0.64 |
| Age | 3 | 150000 | 52.30 | 14.77 | 52.00 | 51.97 | 16.31 | 0 | 109 | 109 | 0.19 | -0.49 | 0.04 |
| NumberOfTime30.59DaysPastDueNotWorse | 4 | 150000 | 0.42 | 4.19 | 0.00 | 0.07 | 0.00 | 0 | 98 | 98 | 22.60 | 522.35 | 0.01 |
| DebtRatio | 5 | 150000 | 353.01 | 2037.82 | 0.37 | 51.65 | 0.36 | 0 | 329664 | 329664 | 95.16 | 13733.65 | 5.26 |
| MonthlyIncome | 6 | 120269 | 6670.22 | 14384.67 | 5400.00 | 5787.56 | 3435.18 | 0 | 3008750 | 3008750 | 114.04 | 19503.57 | 41.48 |
| NumberOfOpenCreditLinesAndLoans | 7 | 150000 | 8.45 | 5.15 | 8.00 | 7.96 | 4.45 | 0 | 58 | 58 | 1.22 | 3.09 | 0.01 |
| NumberOfTimes90DaysLate | 8 | 150000 | 0.27 | 4.17 | 0.00 | 0.00 | 0.00 | 0 | 98 | 98 | 23.09 | 537.71 | 0.01 |
| NumberRealEstateLoansOrLines | 9 | 150000 | 1.02 | 1.13 | 1.00 | 0.88 | 1.48 | 0 | 54 | 54 | 3.48 | 60.47 | 0.00 |
| NumberOfTime60.89DaysPastDueNotWorse | 10 | 150000 | 0.24 | 4.16 | 0.00 | 0.00 | 0.00 | 0 | 98 | 98 | 23.33 | 545.66 | 0.01 |
| NumberOfDependents | 11 | 146076 | 0.76 | 1.12 | 0.00 | 0.54 | 0.00 | 0 | 20 | 20 | 1.59 | 3.00 | 0.00 |
Use of ‘corrplot()’:
Chaque var: moyenne ecart type, graph(nuage de point etc…)
We try to vizualize a 3d plot, but data isn’t show !
train_2 <- train[-(1:70000),] # On supprime la 20eme ligne
plot3d(train_2$NbPastDueNotWorse3059,train_2$NbPastDueNotWorse6089,train_2$Nb90DaysLate,type="p",col = rainbow(1000),xlab="NumberOfTime30-59DaysPastDueNotWorse",ylab="NumberOfTime60-89DaysPastDueNotWorse",zlab="NumberOfTimes90DaysLate")
You must enable Javascript to view this page properly.