R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

#Name of dataset #default of credit card clients Data Set #Data Set Information:

This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Methodâ€ to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. #Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. X2: Gender (1 = male; 2 = female). X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). X4: Marital status (1 = married; 2 = single; 3 = others). X5: Age (year). X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

library(XLConnect)

## Loading required package: XLConnectJars

## XLConnect 0.2-15 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI),
##   Graph Builder [ctb, cph] (Curvesapi Java library)

## http://www.mirai-solutions.com
## https://github.com/miraisolutions/xlconnect

library(readxl)
library(plyr)

## Warning: package 'plyr' was built under R version 3.6.3

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(naniar) # for missing values

## Warning: package 'naniar' was built under R version 3.6.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.3

## corrplot 0.84 loaded

library(caret)   # Classification and Regression Training

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

library(tidyr)   # Easily Tidy Data 
library(dplyr)   # A Grammar of Data Manipulation

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(pROC)    # Display and Analyze ROC Curves

## Warning: package 'pROC' was built under R version 3.6.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(car)     # Box-Cox, Yeo-Johnson and Basic Power Transformations

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(lmtest)  # Testing Linear Regression Models

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(e1071)   # Tuning of Functions Using Grid Search, Support Vector Machines
library(nnet)    # Feed-Forward Neural Networks and Multinomial Log-Linear Models
library(ranger)  # A Fast Implementation of Random Forests

## Warning: package 'ranger' was built under R version 3.6.2

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

library(sandwich)    # Robust Covariance Matrix Estimators
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.2

## Loading required package: rpart

library(VIM)

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:pROC':
## 
##     coords

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

#setup dataset

set.seed(123)
mydata1 <- loadWorkbook("D:/BIG DATA SGH/semester2/STATISTIC LEARNING METHOD/default of credit card clients.xls") 
mydata2 <- readWorksheet(mydata1, sheet = 1, startRow =2, startCol = 1, autofitCol = TRUE)
View(mydata2)

Prepaing our data set

##checking our dataset

glimpse(mydata2)

## Observations: 30,000
## Variables: 25
## $ ID                         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ LIMIT_BAL                  <dbl> 20000, 120000, 90000, 50000, 50000, 5000...
## $ SEX                        <dbl> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1...
## $ EDUCATION                  <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2...
## $ MARRIAGE                   <dbl> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2...
## $ AGE                        <dbl> 24, 26, 34, 37, 57, 37, 29, 23, 28, 35, ...
## $ PAY_0                      <dbl> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0, -1, ...
## $ PAY_2                      <dbl> 2, 2, 0, 0, 0, 0, 0, -1, 0, -2, 0, -1, 0...
## $ PAY_3                      <dbl> -1, 0, 0, 0, -1, 0, 0, -1, 2, -2, 2, -1,...
## $ PAY_4                      <dbl> -1, 0, 0, 0, 0, 0, 0, 0, 0, -2, 0, -1, -...
## $ PAY_5                      <dbl> -2, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, -...
## $ PAY_6                      <dbl> -2, 2, 0, 0, 0, 0, 0, -1, 0, -1, -1, 2, ...
## $ BILL_AMT1                  <dbl> 3913, 2682, 29239, 46990, 8617, 64400, 3...
## $ BILL_AMT2                  <dbl> 3102, 1725, 14027, 48233, 5670, 57069, 4...
## $ BILL_AMT3                  <dbl> 689, 2682, 13559, 49291, 35835, 57608, 4...
## $ BILL_AMT4                  <dbl> 0, 3272, 14331, 28314, 20940, 19394, 542...
## $ BILL_AMT5                  <dbl> 0, 3455, 14948, 28959, 19146, 19619, 483...
## $ BILL_AMT6                  <dbl> 0, 3261, 15549, 29547, 19131, 20024, 473...
## $ PAY_AMT1                   <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000, 380...
## $ PAY_AMT2                   <dbl> 689, 1000, 1500, 2019, 36681, 1815, 4000...
## $ PAY_AMT3                   <dbl> 0, 1000, 1000, 1200, 10000, 657, 38000, ...
## $ PAY_AMT4                   <dbl> 0, 1000, 1000, 1100, 9000, 1000, 20239, ...
## $ PAY_AMT5                   <dbl> 0, 0, 1000, 1069, 689, 1000, 13750, 1687...
## $ PAY_AMT6                   <dbl> 0, 2000, 5000, 1000, 679, 800, 13770, 15...
## $ default.payment.next.month <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...

##checking our dataset variable format

sapply(mydata2, class)

##                         ID                  LIMIT_BAL 
##                  "numeric"                  "numeric" 
##                        SEX                  EDUCATION 
##                  "numeric"                  "numeric" 
##                   MARRIAGE                        AGE 
##                  "numeric"                  "numeric" 
##                      PAY_0                      PAY_2 
##                  "numeric"                  "numeric" 
##                      PAY_3                      PAY_4 
##                  "numeric"                  "numeric" 
##                      PAY_5                      PAY_6 
##                  "numeric"                  "numeric" 
##                  BILL_AMT1                  BILL_AMT2 
##                  "numeric"                  "numeric" 
##                  BILL_AMT3                  BILL_AMT4 
##                  "numeric"                  "numeric" 
##                  BILL_AMT5                  BILL_AMT6 
##                  "numeric"                  "numeric" 
##                   PAY_AMT1                   PAY_AMT2 
##                  "numeric"                  "numeric" 
##                   PAY_AMT3                   PAY_AMT4 
##                  "numeric"                  "numeric" 
##                   PAY_AMT5                   PAY_AMT6 
##                  "numeric"                  "numeric" 
## default.payment.next.month 
##                  "numeric"

##Checking the missing value

vis_miss(mydata2)

aggr(mydata2, numbers = TRUE, prop = c(TRUE, FALSE))

##cheking for correlation between numeric variable

mydata2$default.payment.next.month = as.numeric(mydata2$default.payment.next.month)
r = cor(mydata2[-c(3, 4, 5, 26, 27)])
corrplot(r, method = "circle")

pairs(mydata2)

##Convert int variables to factors 1. DEFAULT to a categorical variable

mydata2$default.payment.next.month = as.factor(mydata2$default.payment.next.month)
# SEX to categorical variable
# Convert SEX to factor
mydata2$SEX = as.factor(mydata2$SEX)

##checking dataframe

sapply(mydata2, class)

##                         ID                  LIMIT_BAL 
##                  "numeric"                  "numeric" 
##                        SEX                  EDUCATION 
##                   "factor"                  "numeric" 
##                   MARRIAGE                        AGE 
##                  "numeric"                  "numeric" 
##                      PAY_0                      PAY_2 
##                  "numeric"                  "numeric" 
##                      PAY_3                      PAY_4 
##                  "numeric"                  "numeric" 
##                      PAY_5                      PAY_6 
##                  "numeric"                  "numeric" 
##                  BILL_AMT1                  BILL_AMT2 
##                  "numeric"                  "numeric" 
##                  BILL_AMT3                  BILL_AMT4 
##                  "numeric"                  "numeric" 
##                  BILL_AMT5                  BILL_AMT6 
##                  "numeric"                  "numeric" 
##                   PAY_AMT1                   PAY_AMT2 
##                  "numeric"                  "numeric" 
##                   PAY_AMT3                   PAY_AMT4 
##                  "numeric"                  "numeric" 
##                   PAY_AMT5                   PAY_AMT6 
##                  "numeric"                  "numeric" 
## default.payment.next.month 
##                   "factor"

##Checking the distribution of âdefultpaymentâ values

table(mydata2$default.payment.next.month)

## 
##     0     1 
## 23364  6636

#Add one feature of GENDER

mydata2$GENDER = ifelse(mydata2$SEX == 1, "Male", "Female")

Plotting Bar graph for GENDER

ggplot(data = mydata2, mapping = aes(x = GENDER, fill =default.payment.next.month)) +
  geom_bar() +
  ggtitle("Gender") +
  stat_count(aes(label = ..count..), geom = "label")

#1 Merging 0, 5 and 6 to 4(others)

mydata2$EDUCATION = ifelse(mydata2$EDUCATION == 0 |mydata2$EDUCATION == 5 | mydata2$EDUCATION == 6,
       4, mydata2$EDUCATION)

#Converting EDUCATION to a categorical variable

mydata2$EDUCATION = as.factor(mydata2$EDUCATION)

#Plotting Bar graph for EDUCATION

ggplot(data = mydata2, mapping = aes(x = EDUCATION, fill = default.payment.next.month)) +
  geom_bar() +
  ggtitle("EDUCATION") +
  stat_count(aes(label = ..count..), geom = "label")

#Checking the unique value of MARRIAGE variable

unique(mydata2$MARRIAGE)

## [1] 1 2 3 0

merge 0 t0 3 (i.e. others)

mydata2$MARRIAGE = ifelse(mydata2$MARRIAGE == 3, 0, mydata2$MARRIAGE)

#Convert to a categorical variable

mydata2$MARRIAGE = as.factor(mydata2$MARRIAGE)
table(mydata2$MARRIAGE)

## 
##     0     1     2 
##   377 13659 15964

#Adding new feature MARITALSTATUS

ggplot(data = mydata2, mapping = aes(x = MARRIAGE , fill = default.payment.next.month)) +
  geom_bar() +
  xlab("Marital status") +
  ggtitle(" Defaulters on Marital Status") +
  stat_count(aes(label = ..count..), geom = "label")

## merging value 0 and 3 for mariage

mydata2$MARRIAGE = ifelse(mydata2$MARRIAGE == 3, 0, mydata2$MARRIAGE)
mydata2$MARRIAGE = as.factor(mydata2$MARRIAGE)
table(mydata2$MARRIAGE)

## 
##     1     2     3 
##   377 13659 15964

##Changing variable pay_0 into a categorical variable

mydata2$PAY_0 = as.factor(mydata2$PAY_0)

ggplot(data = mydata2, mapping = aes(x = PAY_0 , fill = default.payment.next.month)) +
  geom_bar() +
  xlab("pay_0") +
  ggtitle(" Defaulters regarding pay_0s") +
  stat_count(aes(label = ..count..), geom = "label")

## Find the min and max of LIMIT_BAL

ggplot(data = mydata2, mapping = aes(x = LIMIT_BAL)) + 
  geom_density(fill = "#f0f9a7") +
  ggtitle("LIMIT_BAL Distribution") +
  xlab("LIMIT_BAL") +
  geom_vline(xintercept = mean(mydata2$LIMIT_BAL), col = "green", 
             linetype = "dashed", size = 0.6) +
  annotate("text", 
           x = -Inf, y = Inf, 
           label = paste("Mean:", round(mean(mydata2$LIMIT_BAL), digits = 2)), 
           hjust = 0, vjust = 1, col = "blue", size = 3)

## taking important columnn

remove_feature = c(1, 26)
modify.data2 = mydata2[, -remove_feature]

##chekig final datset

str(modify.data2)

## 'data.frame':    30000 obs. of  24 variables:
##  $ LIMIT_BAL                 : num  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : Factor w/ 3 levels "1","2","3": 2 3 3 2 2 3 3 3 2 3 ...
##  $ AGE                       : num  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : Factor w/ 11 levels "-2","-1","0",..: 5 2 3 3 2 3 3 3 3 1 ...
##  $ PAY_2                     : num  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : num  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : num  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : num  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : num  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : num  3913 2682 29239 46990 8617 ...
##  $ BILL_AMT2                 : num  3102 1725 14027 48233 5670 ...
##  $ BILL_AMT3                 : num  689 2682 13559 49291 35835 ...
##  $ BILL_AMT4                 : num  0 3272 14331 28314 20940 ...
##  $ BILL_AMT5                 : num  0 3455 14948 28959 19146 ...
##  $ BILL_AMT6                 : num  0 3261 15549 29547 19131 ...
##  $ PAY_AMT1                  : num  0 0 1518 2000 2000 ...
##  $ PAY_AMT2                  : num  689 1000 1500 2019 36681 ...
##  $ PAY_AMT3                  : num  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : num  0 1000 1000 1100 9000 ...
##  $ PAY_AMT5                  : num  0 0 1000 1069 689 ...
##  $ PAY_AMT6                  : num  0 2000 5000 1000 679 ...
##  $ default.payment.next.month: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...

#Modeling

##Sampling

index = createDataPartition(y = modify.data2$default.payment.next.month, times = 1, 
                            p = 0.8, list = F)
trd = modify.data2[index, ]
tsd = modify.data2[-index, ]

##Checking the distribution of our DEFAULT variable in our data sets

prop.table(table(modify.data2$default.payment.next.month))

## 
##      0      1 
## 0.7788 0.2212

##our training set

prop.table(table(trd$default.payment.next.month))

## 
##         0         1 
## 0.7788009 0.2211991

#our testing set
prop.table(table(tsd$default.payment.next.month))

## 
##         0         1 
## 0.7787965 0.2212035

###Fitting a Random Forest Classifier to our training set

classifier.rf = randomForest(formula = default.payment.next.month ~., 
                           data = trd, ntree = 10)
summary(classifier.rf)

##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted       24001  factor numeric  
## err.rate           30  -none- numeric  
## confusion           6  -none- numeric  
## votes           48002  matrix numeric  
## oob.times       24001  -none- numeric  
## classes             2  -none- character
## importance         23  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y               24001  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

Predicting results for testing data

our.predict.rf = predict(classifier.rf, newdata = tsd, type = "class")

#####Evaluating model

confusionMatrix(our.predict.rf, tsd$default.payment.next.month)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4351  859
##          1  321  468
##                                          
##                Accuracy : 0.8033         
##                  95% CI : (0.793, 0.8133)
##     No Information Rate : 0.7788         
##     P-Value [Acc > NIR] : 1.941e-06      
##                                          
##                   Kappa : 0.3322         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9313         
##             Specificity : 0.3527         
##          Pos Pred Value : 0.8351         
##          Neg Pred Value : 0.5932         
##              Prevalence : 0.7788         
##          Detection Rate : 0.7253         
##    Detection Prevalence : 0.8685         
##       Balanced Accuracy : 0.6420         
##                                          
##        'Positive' Class : 0              
##

##Evaluating Model using K-fold Cross-Validation

fold = createFolds(y = trd$default.payment.next.month, k = 10)
cv_rf = lapply(fold, function(x){
  trd_fold = trd[-x, ]
  tsd_fold = trd[x, ]
  
  classifier.rf_f = randomForest(formula =default.payment.next.month ~., 
                               data = trd_fold, ntree = 20)
  our.predict.rf_f = predict(classifier.rf_f, newdata = tsd_fold, type = "class")
  
  rf_cm_f = confusionMatrix(our.predict.rf_f, tsd_fold$default.payment.next.month)
  accuracy = rf_cm_f$overall[1]
  
  return(accuracy)
})

Means (prediction for next month payment of popluation study)

mean(as.numeric(cv_rf))

## [1] 0.8135912

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

```

BIGDATA_FINAL_PROJECT

sajad