Outline

Main goal of this task is to predict the Class of a loan profile.

1. Import Raw Data

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'

df <- read.table(url, sep=' ', header = 0)

head(df,n=5)
##    V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17
## 1 A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173
## 2 A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173
## 3 A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172
## 4 A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173
## 5 A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173
##   V18  V19  V20 V21
## 1   1 A192 A201   1
## 2   1 A191 A201   2
## 3   2 A191 A201   1
## 4   2 A191 A201   1
## 5   2 A191 A201   2
str(df)
## 'data.frame':    1000 obs. of  21 variables:
##  $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ V8 : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ V10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ V11: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ V12: Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ V13: int  67 22 49 45 53 35 53 35 61 28 ...
##  $ V14: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ V15: Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ V16: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ V17: Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ V18: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ V19: Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ V20: Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ V21: int  1 2 1 1 2 1 1 1 1 2 ...

As it is the data doesn’t tell you much. We don’t even know what each column represents let alone the values within the column. Our first task is to clean it so that we can perform intial exploration.

2. Clean the Raw Data

Since this dataframe is not that huge (only 21 columns and max 10 factors within each column), we will build a new DataFrame “german_credit”" from the loaded dataframe with desired details column by column.

  1. Add “Class”" attribute

Adding the Class attribute to the german credit data which is the indicator of whether the profile is good or bad.

german_credit <- data.frame(Class = df$V21)
german_credit$Class <- 'Good'
german_credit$Class[df$V21 == 2] <- 'Bad'
  1. Add “CheckingAccountStatus” attribute

Adding the Class attribute to the german credit data which is the indicator of whether the profile is good or bad. -Attribute 1: (qualitative) - Status of existing checking account - A11 : … < 0 DM (We will say ‘lt.0’) - A12 : 0 <= … < 200 DM (We will say ‘0.to.200’) - A13 : … >= 200 DM / salary assignments for at least 1 year (We will say ‘gt.200’) - A14 : no checking account (We will say ‘none’)

german_credit$CheckingAccountStatus <- df$V1
levels(german_credit$CheckingAccountStatus) <- c('lt.0'
                                                ,'0.to.200'
                                                ,'gt.200'
                                                ,'none'
)
  1. Add “Duration” attribute
german_credit$Duration <- df$V2
  1. Add “CreditHistory” attribute
german_credit$CreditHistory <- df$V3
levels(german_credit$CreditHistory) <- c(
  'NoCredit.AllPaid'
  ,'ThisBank.AllPaid'
  ,'PaidDuly'
  ,'Delay'
  ,'Critical'
)
  1. Add “Purpose” attribute
german_credit$Purpose <- df$V4

levels(german_credit$Purpose) <- c(
  'NewCar'
  ,'UsedCar'
  ,'Others'
  ,'Furniture.Equipment'
  ,'Radio.Television' 
  ,'DomesticAppliance'
  ,'Repairs' 
  ,'Education'
  ,'Retraining' 
  ,'Business'
)
  1. Add “Amount” attribute
german_credit$Amount <- df$V5
  1. Add “SavingsAccountBonds” attribute
german_credit$SavingsAccountBonds <- df$V6
levels(german_credit$SavingsAccountBonds) <- c(
  'lt.100'
  ,'100.to.500'
  ,'500.to.1000'
  ,'gt.1000'
  ,'Unknown'
)
  1. Add “EmploymentDuration” attribute
german_credit$EmploymentDuration <- df$V7
levels(german_credit$EmploymentDuration) <- c(
  'Unemployed'
  ,'0.to.1'
  ,'1.to.4'
  ,'4.to.7'
  ,'gt.7'
)
  1. Add “InstallmentRatePercentage” attribute
german_credit$InstallmentRatePercentage <- df$V8
  1. Add “Personal” attribute
german_credit$Personal <- df$V9
levels(german_credit$Personal) <- c(
  'Male.Divorced.Seperated'
  ,'Female.NotSingle'
  ,'Male.Single'
  ,'Male.Married.Widowed'
)
  1. Add “OtherDebtorsGuarantors” attribute
german_credit$OtherDebtorsGuarantors <- df$V10
levels(german_credit$OtherDebtorsGuarantors) <- c(
  'None'
  ,'CoApplicant'
  ,'Guarantor'
)
  1. Add “ResidenceDuration” attribute
german_credit$ResidenceDuration <- df$V11
  1. Add “Property” attribute
german_credit$Property <- df$V12
levels(german_credit$Property) <- c(
  'RealEstate'
  ,'Insurance'
  ,'CarOther'
  ,'Unknown'
)
  1. Add “Age” attribute
german_credit$Age <- df$V13
  1. Add “OtherInstallmentPlans” attribute
german_credit$OtherInstallmentPlans <- df$V14
levels(german_credit$OtherInstallmentPlans) <- c(
  'Bank'
  ,'Stores'
  ,'None'
)
  1. Add “Housing” attribute
german_credit$Housing <- df$V15
levels(german_credit$Housing) <- c('Rent', 'Own', 'ForFree')
  1. Add “NumberExistingCredits” attribute
german_credit$NumberExistingCredits <- df$V16
  1. Add “Job” attribute
german_credit$Job <- df$V17
levels(german_credit$Job) <- c(
  'UnemployedUnskilled'
  ,'UnskilledResident'
  ,'SkilledEmployee'
  ,'Management.SelfEmp.HighlyQualified'
)
  1. Add “NumberPeopleMaintenance” attribute
german_credit$NumberPeopleMaintenance <- df$V18
  1. Add “Telephone” attribute
german_credit$Telephone <- df$V19
levels(german_credit$Telephone) <- c(
  0
  ,1
)
  1. Add “ForeignWorker” attribute
german_credit$ForeignWorker <- df$V20
levels(german_credit$ForeignWorker) <- c(1, 0)

Finally let’s save all these changes into a file, which we can import anytime without going through all these steps.

save(german_credit, file = 'german_credit')
write.csv(german_credit, 'german_credit_full.csv',
          row.names = FALSE)