The objective of this project is to classify people described by a set of attributes as good or bad credit risks. Build a model to predict the credit risk associated with a customer, based on his profile attributes.
The German Credit Data contains data on 21 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision :
If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.
If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank. Objective of Analysis:
Minimization of risk and maximization of profit on behalf of the bank. ???
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. ??? The German Credit Data contains data on 21 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
dim is used to check weather the imported data is correct or not(no of observations and columns).
german_credit<-read.csv("F://desktop 18th jan//DATA ANALYTICS//1.R-prog//projects//00000 project3//german_credit.csv")
dim(german_credit)
## [1] 1000 21
names is used to print all the variables which are imported by a given dataset.
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
head is used to print first 6 observations.
head(german_credit)
## Creditability Account.Balance Duration.of.Credit..month.
## 1 1 1 18
## 2 1 1 9
## 3 1 2 12
## 4 1 1 12
## 5 1 1 12
## 6 1 1 10
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1 4 2 1049
## 2 4 0 2799
## 3 2 9 841
## 4 4 0 2122
## 5 4 0 2171
## 6 4 0 2241
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1 1 2 4
## 2 1 3 2
## 3 2 4 2
## 4 1 3 3
## 5 1 3 4
## 6 1 2 1
## Sex...Marital.Status Guarantors Duration.in.Current.address
## 1 2 1 4
## 2 3 1 2
## 3 2 1 4
## 4 3 1 2
## 5 3 1 4
## 6 3 1 3
## Most.valuable.available.asset Age..years. Concurrent.Credits
## 1 2 21 3
## 2 1 36 3
## 3 1 23 3
## 4 1 39 3
## 5 2 38 1
## 6 1 48 3
## Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1 1 1 3 1
## 2 1 2 3 2
## 3 1 1 2 1
## 4 1 2 2 2
## 5 2 2 2 1
## 6 1 2 2 2
## Telephone Foreign.Worker
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 2
## 5 1 2
## 6 1 2
tail is used to print last few observations.
tail(german_credit)
## Creditability Account.Balance Duration.of.Credit..month.
## 995 0 1 12
## 996 0 1 24
## 997 0 1 24
## 998 0 4 21
## 999 0 2 12
## 1000 0 1 30
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 995 0 3 6199
## 996 2 3 1987
## 997 2 0 2303
## 998 4 0 12680
## 999 2 3 6468
## 1000 2 2 6350
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 995 1 3 4
## 996 1 3 2
## 997 1 5 4
## 998 5 5 4
## 999 5 1 2
## 1000 5 5 4
## Sex...Marital.Status Guarantors Duration.in.Current.address
## 995 3 1 2
## 996 3 1 4
## 997 3 2 1
## 998 3 1 4
## 999 3 1 1
## 1000 3 1 4
## Most.valuable.available.asset Age..years. Concurrent.Credits
## 995 2 28 3
## 996 1 21 3
## 997 1 45 3
## 998 4 30 3
## 999 4 52 3
## 1000 2 31 3
## Type.of.apartment No.of.Credits.at.this.Bank Occupation
## 995 1 2 3
## 996 1 1 2
## 997 2 1 3
## 998 3 1 4
## 999 2 1 4
## 1000 2 1 3
## No.of.dependents Telephone Foreign.Worker
## 995 1 2 1
## 996 2 1 1
## 997 1 1 1
## 998 1 2 1
## 999 1 2 1
## 1000 1 1 1
str is used to find structure of given data
str(german_credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
view is used weather the dataset we import is correct or not
View(german_credit)
summary which gives the overall information of given dataset
summary(german_credit)
## Creditability Account.Balance Duration.of.Credit..month.
## Min. :0.0 Min. :1.000 Min. : 4.0
## 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:12.0
## Median :1.0 Median :2.000 Median :18.0
## Mean :0.7 Mean :2.577 Mean :20.9
## 3rd Qu.:1.0 3rd Qu.:4.000 3rd Qu.:24.0
## Max. :1.0 Max. :4.000 Max. :72.0
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## Min. :0.000 Min. : 0.000 Min. : 250
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.: 1366
## Median :2.000 Median : 2.000 Median : 2320
## Mean :2.545 Mean : 2.828 Mean : 3271
## 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.: 3972
## Max. :4.000 Max. :10.000 Max. :18424
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :3.000
## Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :4.000
## Sex...Marital.Status Guarantors Duration.in.Current.address
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :3.000
## Mean :2.682 Mean :1.145 Mean :2.845
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :4.000 Max. :3.000 Max. :4.000
## Most.valuable.available.asset Age..years. Concurrent.Credits
## Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000
## Median :2.000 Median :33.00 Median :3.000
## Mean :2.358 Mean :35.54 Mean :2.675
## 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :4.000 Max. :75.00 Max. :3.000
## Type.of.apartment No.of.Credits.at.this.Bank Occupation
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :1.000 Median :3.000
## Mean :1.928 Mean :1.407 Mean :2.904
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000 Max. :4.000
## No.of.dependents Telephone Foreign.Worker
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.155 Mean :1.404 Mean :1.037
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000
x<-sum(is.na(german_credit))
x
## [1] 0
It is a Binary data ,which can take only two possible values.The two values in a binary variable, numerically as 0 and 1
0: No, 1:Yes
It is a type of categorical data, which more generally represents experiments with a fixed number of possible outcomes.
This is the target/response variable
It contains qualitative data. there are four categories.
1 : … < 0 DM
2 : 0 <= … < 200 DM
3 : … >= 200 DM / salary assignments for at least 1 year
4 : no checking account
DM-Deutsche mark.The basic unit of money in Germany.
Account.Balance contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.
frequency table of Account.Balance
tab<-table(german_credit$Account.Balance)
tab
##
## 1 2 3 4
## 274 269 63 394
names(tab)
## [1] "1" "2" "3" "4"
x<-sum(is.na(german_credit$Account.Balance))
x
## [1] 0
1 -stands for zero balance, 2 -stands for below 200 balance, 3 -stands for above 200 balance, 4 -stands for no checking accounts.
mode of Account.Balance.It gives the maximum value.
temp <- table(as.vector(german_credit$Account.Balance))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Status of Account.Balance is 4.
ggplot of Account.Balance
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.3.2
qplot(data<-german_credit$Account.Balance,main="Account.Balance", ylab="German_currency-Dm", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 274 people have zero balance.
2 stands for 269 people have below 200 DM balance.
3 stands for 63 people have above 200 DM balance.
4 stands for 394 people have no checking account.
It is a Numerical data. boxplot of duration of credit month
quantile(german_credit$Duration.of.Credit..month.)
## 0% 25% 50% 75% 100%
## 4 12 18 24 72
quantile(german_credit$Duration.of.Credit..month.,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 24 30 36 72
boxplot(german_credit$Duration.of.Credit..month.)
output description
In this boxplot the minimum is 4 , maximum is 72, and median is 18. first quartile is 12,third quartile is 24.
histogram of Duration.of.Credit..month.
hist(german_credit$Duration.of.Credit..month.)
correlation between Duration.of.Credit..month. and response
library("ltm")
## Warning: package 'ltm' was built under R version 3.3.2
## Loading required package: MASS
## Loading required package: msm
## Warning: package 'msm' was built under R version 3.3.2
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 3.3.2
biserial.cor(german_credit$Duration.of.Credit..month.,german_credit$Creditability)
## [1] 0.2148192
correlation is 0.21.Duration.of.Credit..month. and Creditability positively correlated. #t-test
t.test(german_credit$Duration.of.Credit..month.)
##
## One Sample t-test
##
## data: german_credit$Duration.of.Credit..month.
## t = 54.816, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 20.15469 21.65131
## sample estimates:
## mean of x
## 20.903
It is a Categorical data.It contains 5 categories.
0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
Payment.Status.of.Previous.Credit contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.
frequency table of Payment.Status.of.Previous.Credit
tab<-table(german_credit$Payment.Status.of.Previous.Credit)
tab
##
## 0 1 2 3 4
## 40 49 530 88 293
names(tab)
## [1] "0" "1" "2" "3" "4"
0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
mode of Payment.Status.of.Previous.Credit It gives the maximum value.
temp <- table(as.vector(german_credit$Payment.Status.of.Previous.Credit))
names(temp)[temp == max(temp)]
## [1] "2"
mode of Status of Payment.Status.of.Previous.Credit.=2
ggplot of Payment.Status.of.Previous.Credit
library("ggplot2")
qplot(data<-german_credit$Payment.Status.of.Previous.Credit,main="Payment.Status.of.Previous.Credit", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
0 stands for 40 customers have no credits taken/all credits paid back duly.
1 stands for 49 customers are all credits at this bank paid back duly.
2 stands for 530 customers have existing credits paid back duly till now.
3 stands for 88 customers are delay in paying off in the past.
4 stands for 293 customers have critical account/other credits existing (not at this bank)
correlation between Payment.Status.of.Previous.Credit and creditability
library("ltm")
biserial.cor(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)
## [1] -0.2286703
library(vcd)
## Warning: package 'vcd' was built under R version 3.3.2
## Loading required package: grid
contin_table<-table(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)
contin_table
##
## 0 1
## 0 25 15
## 1 28 21
## 2 169 361
## 3 28 60
## 4 50 243
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 60.467 4 2.3139e-12
## Pearson 61.691 4 1.2792e-12
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.241
## Cramer's V : 0.248
correlation is -0.22. Payment.Status.of.Previous.Credit and creditability are negatively correlated.
library("gmodels")
## Warning: package 'gmodels' was built under R version 3.3.2
CrossTable(german_credit$Creditability, german_credit$Payment.Status.of.Previous.Credit, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Payment.Status.of.Previous.Credit
## german_credit$Creditability | 0 | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 25 | 28 | 169 | 28 | 50 | 300 |
## | 0.6 | 0.6 | 0.3 | 0.3 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 15 | 21 | 361 | 60 | 243 | 700 |
## | 0.4 | 0.4 | 0.7 | 0.7 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 40 | 49 | 530 | 88 | 293 | 1000 |
## | 0.0 | 0.0 | 0.5 | 0.1 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 61.6914 d.f. = 4 p = 1.279187e-12
##
##
##
purpose is a qualitative data.It contains 11 categories. 0 : car (new)
1 : car (used)
2 : furniture/equipment
3 : radio/television
4 : domestic appliances
5 : repairs
6 : education
7 : (vacation - does not exist?)
8 : retraining
9 : business
10 : others
purpose contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of purpose
frequency table of purpose
tab<-table(german_credit$Purpose)
tab
##
## 0 1 2 3 4 5 6 8 9 10
## 234 103 181 280 12 22 50 9 97 12
names(tab)
## [1] "0" "1" "2" "3" "4" "5" "6" "8" "9" "10"
0 : car (new)
1 : car (used)
2 : furniture/equipment
3 : radio/television
4 : domestic appliances
5 : repairs
6 : education
7 : (vacation - does not exist?)
8 : retraining
9 : business
10 : others
mode of purpose.It gives the maximum value.
temp <- table(as.vector(german_credit$Purpose))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Account.Balance is 3.
ggplot of purpose
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Purpose,main="purpose", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between purpose and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Purpose,german_credit$Creditability)
## [1] 0.01796988
library(vcd)
contin_table<-table(german_credit$Purpose,german_credit$Creditability)
contin_table
##
## 0 1
## 0 89 145
## 1 17 86
## 2 58 123
## 3 62 218
## 4 4 8
## 5 8 14
## 6 22 28
## 8 1 8
## 9 34 63
## 10 5 7
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 34.510 9 7.2688e-05
## Pearson 33.356 9 1.1575e-04
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.18
## Cramer's V : 0.183
correlation is 0.017. purpose and creditability are positivevely correlated.
crosstable of purpose and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Purpose, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Purpose
## german_credit$Creditability | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 | 10 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 89 | 17 | 58 | 62 | 4 | 8 | 22 | 1 | 34 | 5 | 300 |
## | 0.4 | 0.2 | 0.3 | 0.2 | 0.3 | 0.4 | 0.4 | 0.1 | 0.4 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 145 | 86 | 123 | 218 | 8 | 14 | 28 | 8 | 63 | 7 | 700 |
## | 0.6 | 0.8 | 0.7 | 0.8 | 0.7 | 0.6 | 0.6 | 0.9 | 0.6 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 234 | 103 | 181 | 280 | 12 | 22 | 50 | 9 | 97 | 12 | 1000 |
## | 0.2 | 0.1 | 0.2 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 33.35645 d.f. = 9 p = 0.0001157491
##
##
##
It is a numerical data. summary gives four quartiles of Credit.Amount
summary(german_credit$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
hist(german_credit$Credit.Amount)
boxplot of Credit.Amount
quantile(german_credit$Credit.Amount)
## 0% 25% 50% 75% 100%
## 250.00 1365.50 2319.50 3972.25 18424.00
quantile(german_credit$Credit.Amount,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 3972.25 4720.00 7179.40 18424.00
boxplot(german_credit$Credit.Amount)
output description Notethat outliers are discussed later.
histogram of Credit.Amount
hist(german_credit$Credit.Amount)
correlation between Credit.Amount and response
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Credit.Amount,german_credit$Creditability)
## [1] 0.1546628
correlation is 0.15. Credit.Amount and Creditability positively correlated. #t-test
t.test(german_credit$Credit.Amount)
##
## One Sample t-test
##
## data: german_credit$Credit.Amount
## t = 36.647, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3096.083 3446.413
## sample estimates:
## mean of x
## 3271.248
Average balance in savings account
Average balance in savings account is a qualitative data.It contains 5 categories.
1 : < 100 DM
2 : 100<= … < 500 DM
3 : 500<= … < 1000 DM
4 : =>1000 DM
5 : unknown/ no savings account
DM-Deutsche mark.The basic unit of money in Germany.
Average balance in savings account contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Average balance in savings account
frequency table of Average balance in savings account
tab<-table(german_credit$Value.Savings.Stocks)
tab
##
## 1 2 3 4 5
## 603 103 63 48 183
names(tab)
## [1] "1" "2" "3" "4" "5"
1 : < 100 DM
2 : 100<= … < 500 DM
3 : 500<= … < 1000 DM
4 : =>1000 DM
5 : unknown/ no savings account
mode of Value.Savings.Stocks.It gives the maximum value.
temp <- table(as.vector(german_credit$Value.Savings.Stocks))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Value.Savings.Stocks is 1.
ggplot of Value.Savings.Stocks
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Value.Savings.Stocks,main="Value.Savings.Stocks", ylab="German_currency-Dm", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description: 1 stands for 603 people have below 100 DM balance.
2 stands for 103 people have below 500 DM balance.
3 stands for 63 people have below 1000 DM balance.
4 stands for 48 people have above 1000 DM balance.
5 stands for 183 people have no checking account.
correlation between Value.Savings.Stocks and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Value.Savings.Stocks,german_credit$Creditability)
## [1] -0.1788532
library(vcd)
contin_table<-table(german_credit$Value.Savings.Stocks,german_credit$Creditability)
contin_table
##
## 0 1
## 1 217 386
## 2 34 69
## 3 11 52
## 4 6 42
## 5 32 151
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 38.975 4 7.0491e-08
## Pearson 36.099 4 2.7612e-07
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.187
## Cramer's V : 0.19
correlation is -0.17. Value.Savings.Stocks and creditability are negatively correlated.
crosstable of Account.Balance and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Value.Savings.Stocks, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Value.Savings.Stocks
## german_credit$Creditability | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 217 | 34 | 11 | 6 | 32 | 300 |
## | 0.4 | 0.3 | 0.2 | 0.1 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 386 | 69 | 52 | 42 | 151 | 700 |
## | 0.6 | 0.7 | 0.8 | 0.9 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 603 | 103 | 63 | 48 | 183 | 1000 |
## | 0.6 | 0.1 | 0.1 | 0.0 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 36.09893 d.f. = 4 p = 2.761214e-07
##
##
##
It is a qualitative data.It has 5 categories.
1 : unemployed
2: < 1 year
3 : 1 <= … < 4 years
4 : 4 <=… < 7 years
4 : >= 7 years
Length.of.current.employment contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Length.of.current.employment
frequency table of Length.of.current.employment
tab<-table(german_credit$Length.of.current.employment)
tab
##
## 1 2 3 4 5
## 62 172 339 174 253
names(tab)
## [1] "1" "2" "3" "4" "5"
1 : unemployed
2: < 1 year
3 : 1 <= … < 4 years
4 : 4 <=… < 7 years
4 : >= 7 years
mode of Length.of.current.employment.It gives the maximum value.
temp <- table(as.vector(german_credit$Length.of.current.employment))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Length.of.current.employment is 3.
ggplot of Length.of.current.employment
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Length.of.current.employment,main="Length.of.current.employment", ylab="employees", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Length.of.current.employment and creditability
correlation is -0.11. Length.of.current.employment and creditability are negatively correlated.
crosstable of Length.of.current.employment and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Length.of.current.employment,german_credit$Creditability)
## [1] -0.115944
library(vcd)
contin_table<-table(german_credit$Length.of.current.employment,german_credit$Creditability)
contin_table
##
## 0 1
## 1 23 39
## 2 70 102
## 3 104 235
## 4 39 135
## 5 64 189
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 18.164 4 0.0011464
## Pearson 18.368 4 0.0010455
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.134
## Cramer's V : 0.136
correlation is -0.11. Length.of.current.employment and creditability are negatively correlated.
crosstable of Length.of.current.employment and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Length.of.current.employment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Length.of.current.employment
## german_credit$Creditability | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 23 | 70 | 104 | 39 | 64 | 300 |
## | 0.4 | 0.4 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 39 | 102 | 235 | 135 | 189 | 700 |
## | 0.6 | 0.6 | 0.7 | 0.8 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 62 | 172 | 339 | 174 | 253 | 1000 |
## | 0.1 | 0.2 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 18.36827 d.f. = 4 p = 0.001045452
##
##
##
Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories.
frequency table of Instalment.per.cent
tab<-table(german_credit$Instalment.per.cent)
tab
##
## 1 2 3 4
## 136 231 157 476
names(tab)
## [1] "1" "2" "3" "4"
mode of Instalment.per.cent.It gives the maximum value.
temp <- table(as.vector(german_credit$Instalment.per.cent))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Status of Instalment.per.cent is 4.
ggplot of Instalment.per.cent
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Instalment.per.cent,main="Instalment.per.cent", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Instalment.per.cent and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Instalment.per.cent,german_credit$Creditability)
## [1] 0.07236773
library(vcd)
contin_table<-table(german_credit$Instalment.per.cent,german_credit$Creditability)
contin_table
##
## 0 1
## 1 34 102
## 2 62 169
## 3 45 112
## 4 159 317
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 5.5065 3 0.13825
## Pearson 5.4768 3 0.14003
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.074
## Cramer's V : 0.074
correlation is 0.072. Instalment.per.cent and creditability are positively correlated.
crosstable of Instalment.per.cent and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Instalment.per.cent, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Instalment.per.cent
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 34 | 62 | 45 | 159 | 300 |
## | 0.2 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 102 | 169 | 112 | 317 | 700 |
## | 0.8 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 136 | 231 | 157 | 476 | 1000 |
## | 0.1 | 0.2 | 0.2 | 0.5 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 5.476792 d.f. = 3 p = 0.1400333
##
##
##
variable 10 is Personal status and sex .It is a qualitative data.There are 4 categories.
1 : male : divorced/separated
2 : female : divorced/separated/married
3 : male : single
4 : male : married/widowed
frequency table of Sex…Marital.Status
tab<-table(german_credit$Sex...Marital.Status)
tab
##
## 1 2 3 4
## 50 310 548 92
names(tab)
## [1] "1" "2" "3" "4"
1 : male : divorced/separated
2 : female : divorced/separated/married
3 : male : single
4 : male : married/widowed
mode of Sex…Marital.Status.It gives the maximum value.
temp <- table(as.vector(german_credit$Sex...Marital.Status))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Sex…Marital.Status is 3.
ggplot of Sex…Marital.Status
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Sex...Marital.Status,main="Sex...Marital.Status", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1-50 mens are divorced/separated.
2-310 womens aredivorced/separated/married.
3-548 males are single.
4-92 males are married/widowed.
correlation between Sex…Marital.Status and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Sex...Marital.Status,german_credit$Creditability)
## [1] -0.0881402
library(vcd)
contin_table<-table(german_credit$Sex...Marital.Status,german_credit$Creditability)
contin_table
##
## 0 1
## 1 20 30
## 2 109 201
## 3 146 402
## 4 25 67
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 9.4414 3 0.023963
## Pearson 9.6052 3 0.022238
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.098
## Cramer's V : 0.098
correlation is -0.088. Sex…Marital.Status and creditability are negatively correlated.
crosstable of Sex…Marital.Status and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Sex...Marital.Status, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Sex...Marital.Status
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 20 | 109 | 146 | 25 | 300 |
## | 0.4 | 0.4 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 30 | 201 | 402 | 67 | 700 |
## | 0.6 | 0.6 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 50 | 310 | 548 | 92 | 1000 |
## | 0.0 | 0.3 | 0.5 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 9.605214 d.f. = 3 p = 0.02223801
##
##
##
It is a qualitative data.It contain 3 categories.
1 : none
2 : co-applicant
3 : guarantor
frequency table of Guarantors
tab<-table(german_credit$Guarantors)
tab
##
## 1 2 3
## 907 41 52
names(tab)
## [1] "1" "2" "3"
1 -stands for none, 2 -stands for co-applicant, 3 -stands for guarantor .
mode of Guarantors.It gives the maximum value.
temp <- table(as.vector(german_credit$Guarantors))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of Guarantors 1.
ggplot of Guarantors
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Guarantors,main="Guarantors", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 907 customers have no Guarantors.
2 stands for 41 customers have co-applicants.
3 stands for 52 customers have Guarantors.
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Guarantors,german_credit$Creditability)
## [1] -0.0251242
library(vcd)
contin_table<-table(german_credit$Guarantors,german_credit$Creditability)
contin_table
##
## 0 1
## 1 272 635
## 2 18 23
## 3 10 42
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 6.6501 2 0.035971
## Pearson 6.6454 2 0.036056
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.081
## Cramer's V : 0.082
correlation is -0.025. Guarantors and creditability are negatively correlated.
crosstable of Guarantors and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Guarantors, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Guarantors
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 272 | 18 | 10 | 300 |
## | 0.3 | 0.4 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 635 | 23 | 42 | 700 |
## | 0.7 | 0.6 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 907 | 41 | 52 | 1000 |
## | 0.9 | 0.0 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6.645367 d.f. = 2 p = 0.03605595
##
##
##
It is a Categorical data.It has a 4 categories.
1: <= 1 year
2: <.<=2 years
3: <.<=3 years
4: >4years
frequency table of Duration.in.Current.address
tab<-table(german_credit$Duration.in.Current.address)
tab
##
## 1 2 3 4
## 130 308 149 413
names(tab)
## [1] "1" "2" "3" "4"
1: <= 1 year
2: <.<=2 years
3: <.<=3 years
4: >4years
mode of Duration.in.Current.address.It gives the maximum value.
temp <- table(as.vector(german_credit$Duration.in.Current.address))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Duration.in.Current.address is 4.
ggplot of Duration.in.Current.address
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Duration.in.Current.address,main="Duration.in.Current.address", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Duration.in.Current.address,german_credit$Creditability)
## [1] 0.002965675
library(vcd)
contin_table<-table(german_credit$Duration.in.Current.address,german_credit$Creditability)
contin_table
##
## 0 1
## 1 36 94
## 2 97 211
## 3 43 106
## 4 124 289
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 0.75207 3 0.86089
## Pearson 0.74930 3 0.86155
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.027
## Cramer's V : 0.027
correlation is 0.002. Duration.in.Current.address and creditability are positively correlated.
crosstable of Duration.in.Current.address and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Duration.in.Current.address, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Duration.in.Current.address
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 36 | 97 | 43 | 124 | 300 |
## | 0.3 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 94 | 211 | 106 | 289 | 700 |
## | 0.7 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 130 | 308 | 149 | 413 | 1000 |
## | 0.1 | 0.3 | 0.1 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.7492964 d.f. = 3 p = 0.8615521
##
##
##
It is a qualitative data.It contains 4 categories.
1 : real estate
2 : if not A121 : building society savings agreement/life insurance
3 : if not A121/A122 : car or other, not in variable 7
4 : unknown / no property
frequency table Most.valuable.available.asset
tab<-table(german_credit$Most.valuable.available.asset)
tab
##
## 1 2 3 4
## 282 232 332 154
names(tab)
## [1] "1" "2" "3" "4"
1 : real estate
2 : if not A121 : building society savings agreement/life insurance
3 : if not A121/A122 : car or other, not in variable 7
4 : unknown / no property
mode of Most.valuable.available.asset.It gives the maximum value.
temp <- table(as.vector(german_credit$Most.valuable.available.asset))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Most.valuable.available.asset is 3.
ggplot of Most.valuable.available.asset
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Most.valuable.available.asset,main="Most.valuable.available.asset", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Most.valuable.available.asset and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Most.valuable.available.asset,german_credit$Creditability)
## [1] 0.1425406
library(vcd)
contin_table<-table(german_credit$Most.valuable.available.asset,german_credit$Creditability)
contin_table
##
## 0 1
## 1 60 222
## 2 71 161
## 3 102 230
## 4 67 87
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 23.546 3 3.1063e-05
## Pearson 23.720 3 2.8584e-05
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.152
## Cramer's V : 0.154
correlation is 0.14. Most.valuable.available.asset and creditability are positively correlated.
crosstable of Most.valuable.available.asset and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Most.valuable.available.asset, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Most.valuable.available.asset
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 60 | 71 | 102 | 67 | 300 |
## | 0.2 | 0.3 | 0.3 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 222 | 161 | 230 | 87 | 700 |
## | 0.8 | 0.7 | 0.7 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 282 | 232 | 332 | 154 | 1000 |
## | 0.3 | 0.2 | 0.3 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 23.71955 d.f. = 3 p = 2.858442e-05
##
##
##
It is a Numerical data.
head(german_credit$Age..years.)
## [1] 21 36 23 39 38 48
Central tendencies of Age..years.
mean of Age..years.
mean(german_credit$Age..years.)
## [1] 35.542
median of Age..years.
median(german_credit$Age..years.)
## [1] 33
Dispersion of Age..years. Variance of Age..years.
var(german_credit$Age..years.)
## [1] 128.8831
Standard deviation of Age..years.
sd(german_credit$Age..years.)
## [1] 11.35267
summary gives four quartiles of Age..years.
summary(german_credit$Age..years.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 33.00 35.54 42.00 75.00
boxplot of Age..years.
quantile(german_credit$Age..years.)
## 0% 25% 50% 75% 100%
## 19 27 33 42 75
quantile(german_credit$Age..years.,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 42 44 52 75
boxplot(german_credit$Age..years.)
output description
In this boxplot the minimum is 19 , maximum is 75, and median is 33. first quartile is 27,third quartile is 42. Notethat outliers are discussed later.
histogram of Age..years.
hist(german_credit$Age..years.)
correlation between Age..years. and Creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Age..years.,german_credit$Creditability)
## [1] -0.0912263
correlation is -0.091.Age..years.and Creditability negatively correlated. #t-test
t.test(german_credit$Age..years.)
##
## One Sample t-test
##
## data: german_credit$Age..years.
## t = 99.002, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 34.83751 36.24649
## sample estimates:
## mean of x
## 35.542
It is a qualitative data.It contains 3 categories.
1 : bank
2 : stores
3 : none
frequency table of Concurrent.Credits
tab<-table(german_credit$Concurrent.Credits)
tab
##
## 1 2 3
## 139 47 814
names(tab)
## [1] "1" "2" "3"
1 : bank
2 : stores
3 : none
mode of Concurrent.Credits.It gives the maximum value.
temp <- table(as.vector(german_credit$Concurrent.Credits))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Concurrent.Credits is 3.
ggplot of Concurrent.Credits
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Concurrent.Credits,main="Concurrent.Credits", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 139 customers are in bank.
2 stands for 47 customers are in store.
3 stands for 814 customers have no concurrent credits.
correlation between Concurrent.Credits and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Concurrent.Credits,german_credit$Creditability)
## [1] -0.1097892
library(vcd)
contin_table<-table(german_credit$Concurrent.Credits,german_credit$Creditability)
contin_table
##
## 0 1
## 1 57 82
## 2 19 28
## 3 224 590
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 12.303 2 0.0021298
## Pearson 12.839 2 0.0016293
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.113
## Cramer's V : 0.113
correlation is -0.109. Concurrent.Credits and creditability are negatively correlated.
crosstable of Concurrent.Credits and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Concurrent.Credits, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Concurrent.Credits
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 57 | 19 | 224 | 300 |
## | 0.4 | 0.4 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 82 | 28 | 590 | 700 |
## | 0.6 | 0.6 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 139 | 47 | 814 | 1000 |
## | 0.1 | 0.0 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 12.83919 d.f. = 2 p = 0.001629318
##
##
##
It is a qualitative data.It has a 3 categories.
1 : rent
2 : own
3 : for free
frequency table of Type.of.apartment
tab<-table(german_credit$Type.of.apartment)
tab
##
## 1 2 3
## 179 714 107
names(tab)
## [1] "1" "2" "3"
1 : rent
2 : own
3 : for free
mode of Type.of.apartment.It gives the maximum value.
temp <- table(as.vector(german_credit$Type.of.apartment))
names(temp)[temp == max(temp)]
## [1] "2"
mode of Status of Type.of.apartment is 2.
ggplot of Type.of.apartment
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Type.of.apartment,main="Type.of.apartment", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 179 customers are staying in rent houses.
2 stands for 714 customers are staying in own houses.
3 stands for 107 customers are staying in free quaters.
correlation between Type.of.apartment and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Type.of.apartment,german_credit$Creditability)
## [1] -0.01810985
library(vcd)
contin_table<-table(german_credit$Type.of.apartment,german_credit$Creditability)
contin_table
##
## 0 1
## 1 70 109
## 2 186 528
## 3 44 63
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 18.129 2 1.1573e-04
## Pearson 18.674 2 8.8103e-05
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.135
## Cramer's V : 0.137
correlation is -0.018. Type.of.apartment and creditability are negatively correlated.
crosstable of Type.of.apartment and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Type.of.apartment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Type.of.apartment
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 70 | 186 | 44 | 300 |
## | 0.4 | 0.3 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 109 | 528 | 63 | 700 |
## | 0.6 | 0.7 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 179 | 714 | 107 | 1000 |
## | 0.2 | 0.7 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 18.67401 d.f. = 2 p = 8.810311e-05
##
##
##
It is a qualitative data. It has a 4 categories.
frequency table No.of.Credits.at.this.Bank
tab<-table(german_credit$No.of.Credits.at.this.Bank)
tab
##
## 1 2 3 4
## 633 333 28 6
names(tab)
## [1] "1" "2" "3" "4"
mode of No.of.Credits.at.this.Bank.It gives the maximum value.
temp <- table(as.vector(german_credit$No.of.Credits.at.this.Bank))
names(temp)[temp == max(temp)]
## [1] "1"
mode of No.of.Credits.at.this.Bank is 1.
ggplot of No.of.Credits.at.this.Bank
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.Credits.at.this.Bank,main="No.of.Credits.at.this.Bank", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between No.of.Credits.at.this.Bank and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)
## [1] -0.04570962
library(vcd)
contin_table<-table(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)
contin_table
##
## 0 1
## 1 200 433
## 2 92 241
## 3 6 22
## 4 2 4
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 2.7425 3 0.43304
## Pearson 2.6712 3 0.44514
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.052
## Cramer's V : 0.052
correlation is -0.045. No.of.Credits.at.this.Bank and creditability are negatively correlated.
crosstable of No.of.Credits.at.this.Bank and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.Credits.at.this.Bank, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$No.of.Credits.at.this.Bank
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 200 | 92 | 6 | 2 | 300 |
## | 0.3 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 433 | 241 | 22 | 4 | 700 |
## | 0.7 | 0.7 | 0.8 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 633 | 333 | 28 | 6 | 1000 |
## | 0.6 | 0.3 | 0.0 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 2.671198 d.f. = 3 p = 0.4451441
##
##
##
It is a qualitative data.It contains 4 categories.
1 : unemployed/ unskilled - non-resident
2 : unskilled - resident
3 : skilled employee / official
4 : management/ self-employed/highly qualified employee/ officer
frequency table of Occupation
tab<-table(german_credit$Occupation)
tab
##
## 1 2 3 4
## 22 200 630 148
names(tab)
## [1] "1" "2" "3" "4"
1 : unemployed/ unskilled - non-resident
2 : unskilled - resident
3 : skilled employee / official
4 : management/ self-employed/highly qualified employee/ officer
mode of Occupation.It gives the maximum value.
temp <- table(as.vector(german_credit$Occupation))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Occupation is 3.
ggplot of Occupation
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Occupation,main="Occupation", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 : 22 customers are unemployed/ unskilled - non-resident
2 : 200 customers are unskilled - resident
3 : 630 customers are skilled employee / official
4 : 148 customers are management/ self-employed/highly qualified employee/ officer
correlation between Occupation and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Occupation,german_credit$Creditability)
## [1] 0.03271863
library(vcd)
contin_table<-table(german_credit$Occupation,german_credit$Creditability)
contin_table
##
## 0 1
## 1 7 15
## 2 56 144
## 3 186 444
## 4 51 97
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 1.8540 3 0.60326
## Pearson 1.8852 3 0.59658
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.043
## Cramer's V : 0.043
correlation is 0.032. Occupation and creditability are positively correlated.
crosstable of Occupation and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Occupation, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Occupation
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 7 | 56 | 186 | 51 | 300 |
## | 0.3 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 15 | 144 | 444 | 97 | 700 |
## | 0.7 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 22 | 200 | 630 | 148 | 1000 |
## | 0.0 | 0.2 | 0.6 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.885156 d.f. = 3 p = 0.5965816
##
##
##
It is a qualitative data.
frequency table of No.of.dependents
tab<-table(german_credit$No.of.dependents)
tab
##
## 1 2
## 845 155
names(tab)
## [1] "1" "2"
mode of No.of.dependents.It gives the maximum value.
temp <- table(as.vector(german_credit$No.of.dependents))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of No.of.dependents 1.
ggplot of No.of.dependents
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.dependents,main="No.of.dependents", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between No.of.dependents and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$No.of.dependents,german_credit$Creditability)
## [1] -0.003013345
library(vcd)
contin_table<-table(german_credit$No.of.dependents,german_credit$Creditability)
contin_table
##
## 0 1
## 1 254 591
## 2 46 109
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 0.0091047 1 0.92398
## Pearson 0.0090893 1 0.92405
##
## Phi-Coefficient : 0.003
## Contingency Coeff.: 0.003
## Cramer's V : 0.003
correlation is -0.003. No.of.dependents and creditability are negatively correlated.
crosstable of No.of.dependents and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.dependents, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$No.of.dependents
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 254 | 46 | 300 |
## | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 591 | 109 | 700 |
## | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 845 | 155 | 1000 |
## | 0.8 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.009089339 d.f. = 1 p = 0.9240463
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 0 d.f. = 1 p = 1
##
##
It is a qualitative data. it contains 2 categories.
1 : none
2 : yes, registered under the customers name
frequency table of Telephone
tab<-table(german_credit$Telephone)
tab
##
## 1 2
## 596 404
names(tab)
## [1] "1" "2"
1 -stands for none, 2 -stands for yes, registered under the customers name.
mode of Telephone.It gives the maximum value.
temp <- table(as.vector(german_credit$Telephone))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of Telephone is 1.
ggplot of Telephone
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Telephone,main="Telephone", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Telephone and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Telephone,german_credit$Creditability)
## [1] -0.03644795
library(vcd)
contin_table<-table(german_credit$Telephone,german_credit$Creditability)
contin_table
##
## 0 1
## 1 187 409
## 2 113 291
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 1.3359 1 0.24776
## Pearson 1.3298 1 0.24884
##
## Phi-Coefficient : 0.036
## Contingency Coeff.: 0.036
## Cramer's V : 0.036
correlation is -0.35. Telephone and creditability are negatively correlated.
crosstable of Telephone and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Telephone, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Telephone
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 187 | 113 | 300 |
## | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 409 | 291 | 700 |
## | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 596 | 404 | 1000 |
## | 0.6 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.329783 d.f. = 1 p = 0.2488438
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 1.172559 d.f. = 1 p = 0.2788762
##
##
It is a qualitative data.It contains a two categories.
1 : yes
2 : no
frequency table of Foreign.Worker
tab<-table(german_credit$Foreign.Worker)
tab
##
## 1 2
## 963 37
names(tab)
## [1] "1" "2"
1 -stands for yes, 2 -stands for no
mode of Foreign.Worker.It gives the maximum value.
temp <- table(as.vector(german_credit$Foreign.Worker))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Foreign.Workere is 1.
ggplot of Foreign.Worker
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Foreign.Worker,main="Foreign.Worker", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Foreign.Worker and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
a<-biserial.cor(german_credit$Foreign.Worker,german_credit$Creditability)
library(vcd)
contin_table<-table(german_credit$Foreign.Worker,german_credit$Creditability)
contin_table
##
## 0 1
## 1 296 667
## 2 4 33
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 8.0724 1 0.0044945
## Pearson 6.7370 1 0.0094431
##
## Phi-Coefficient : 0.082
## Contingency Coeff.: 0.082
## Cramer's V : 0.082
correlation is -0.08. Foreign.Worker and creditability are negatively correlated.
crosstable of Foreign.Worker and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Foreign.Worker, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Foreign.Worker
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 296 | 4 | 300 |
## | 0.3 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 667 | 33 | 700 |
## | 0.7 | 0.9 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 963 | 37 | 1000 |
## | 1.0 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6.737044 d.f. = 1 p = 0.009443096
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 5.821576 d.f. = 1 p = 0.01583075
##
##