Main goal of this task is to predict the Class of a loan profile.
Step 1 : Import Raw Data
Step 2 : Clean Data
Step 3 : Data Exploration : Visual & Statistic
Step 4 : Prepare the data for Machine Learning Model
Step 5 : Prepare the ML model for training and define the Cost Function
Step 6 : Tune the parameter to minimize the cost further
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
df <- read.table(url, sep=' ', header = 0)
head(df,n=5)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
## 1 A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173
## 2 A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173
## 3 A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172
## 4 A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173
## 5 A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173
## V18 V19 V20 V21
## 1 1 A192 A201 1
## 2 1 A191 A201 2
## 3 2 A191 A201 1
## 4 2 A191 A201 1
## 5 2 A191 A201 2
str(df)
## 'data.frame': 1000 obs. of 21 variables:
## $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ V2 : int 6 48 12 42 24 36 24 36 12 30 ...
## $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ V5 : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ V8 : int 4 2 2 2 3 2 3 2 2 4 ...
## $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ V10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ V11: int 4 2 3 4 4 4 4 2 4 2 ...
## $ V12: Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ V13: int 67 22 49 45 53 35 53 35 61 28 ...
## $ V14: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ V15: Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ V16: int 2 1 1 1 2 1 1 1 1 2 ...
## $ V17: Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ V18: int 1 1 2 2 2 2 1 1 1 1 ...
## $ V19: Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ V20: Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ V21: int 1 2 1 1 2 1 1 1 1 2 ...
As it is the data doesn’t tell you much. We don’t even know what each column represents let alone the values within the column. Our first task is to clean it so that we can perform intial exploration.
Since this dataframe is not that huge (only 21 columns and max 10 factors within each column), we will build a new DataFrame “german_credit”" from the loaded dataframe with desired details column by column.
Adding the Class attribute to the german credit data which is the indicator of whether the profile is good or bad.
german_credit <- data.frame(Class = df$V21)
german_credit$Class <- 'Good'
german_credit$Class[df$V21 == 2] <- 'Bad'
Adding the Class attribute to the german credit data which is the indicator of whether the profile is good or bad. -Attribute 1: (qualitative) - Status of existing checking account - A11 : … < 0 DM (We will say ‘lt.0’) - A12 : 0 <= … < 200 DM (We will say ‘0.to.200’) - A13 : … >= 200 DM / salary assignments for at least 1 year (We will say ‘gt.200’) - A14 : no checking account (We will say ‘none’)
german_credit$CheckingAccountStatus <- df$V1
levels(german_credit$CheckingAccountStatus) <- c('lt.0'
,'0.to.200'
,'gt.200'
,'none'
)
german_credit$Duration <- df$V2
german_credit$CreditHistory <- df$V3
levels(german_credit$CreditHistory) <- c(
'NoCredit.AllPaid'
,'ThisBank.AllPaid'
,'PaidDuly'
,'Delay'
,'Critical'
)
german_credit$Purpose <- df$V4
levels(german_credit$Purpose) <- c(
'NewCar'
,'UsedCar'
,'Others'
,'Furniture.Equipment'
,'Radio.Television'
,'DomesticAppliance'
,'Repairs'
,'Education'
,'Retraining'
,'Business'
)
german_credit$Amount <- df$V5
german_credit$SavingsAccountBonds <- df$V6
levels(german_credit$SavingsAccountBonds) <- c(
'lt.100'
,'100.to.500'
,'500.to.1000'
,'gt.1000'
,'Unknown'
)
german_credit$EmploymentDuration <- df$V7
levels(german_credit$EmploymentDuration) <- c(
'Unemployed'
,'0.to.1'
,'1.to.4'
,'4.to.7'
,'gt.7'
)
german_credit$InstallmentRatePercentage <- df$V8
german_credit$Personal <- df$V9
levels(german_credit$Personal) <- c(
'Male.Divorced.Seperated'
,'Female.NotSingle'
,'Male.Single'
,'Male.Married.Widowed'
)
german_credit$OtherDebtorsGuarantors <- df$V10
levels(german_credit$OtherDebtorsGuarantors) <- c(
'None'
,'CoApplicant'
,'Guarantor'
)
german_credit$ResidenceDuration <- df$V11
german_credit$Property <- df$V12
levels(german_credit$Property) <- c(
'RealEstate'
,'Insurance'
,'CarOther'
,'Unknown'
)
german_credit$Age <- df$V13
german_credit$OtherInstallmentPlans <- df$V14
levels(german_credit$OtherInstallmentPlans) <- c(
'Bank'
,'Stores'
,'None'
)
german_credit$Housing <- df$V15
levels(german_credit$Housing) <- c('Rent', 'Own', 'ForFree')
german_credit$NumberExistingCredits <- df$V16
german_credit$Job <- df$V17
levels(german_credit$Job) <- c(
'UnemployedUnskilled'
,'UnskilledResident'
,'SkilledEmployee'
,'Management.SelfEmp.HighlyQualified'
)
german_credit$NumberPeopleMaintenance <- df$V18
german_credit$Telephone <- df$V19
levels(german_credit$Telephone) <- c(
0
,1
)
german_credit$ForeignWorker <- df$V20
levels(german_credit$ForeignWorker) <- c(1, 0)
Finally let’s save all these changes into a file, which we can import anytime without going through all these steps.
save(german_credit, file = 'german_credit')
write.csv(german_credit, 'german_credit_full.csv',
row.names = FALSE)