Assignment 3 Notebook

A.1.

cc_data <- read.csv('UCI_Credit_Card.csv')
sapply(cc_data,class)

##                         ID                  LIMIT_BAL 
##                  "integer"                  "numeric" 
##                        SEX                  EDUCATION 
##                  "integer"                  "integer" 
##                   MARRIAGE                        AGE 
##                  "integer"                  "integer" 
##                      PAY_0                      PAY_2 
##                  "integer"                  "integer" 
##                      PAY_3                      PAY_4 
##                  "integer"                  "integer" 
##                      PAY_5                      PAY_6 
##                  "integer"                  "integer" 
##                  BILL_AMT1                  BILL_AMT2 
##                  "numeric"                  "numeric" 
##                  BILL_AMT3                  BILL_AMT4 
##                  "numeric"                  "numeric" 
##                  BILL_AMT5                  BILL_AMT6 
##                  "numeric"                  "numeric" 
##                   PAY_AMT1                   PAY_AMT2 
##                  "numeric"                  "numeric" 
##                   PAY_AMT3                   PAY_AMT4 
##                  "numeric"                  "numeric" 
##                   PAY_AMT5                   PAY_AMT6 
##                  "numeric"                  "numeric" 
## default.payment.next.month 
##                  "integer"

The limit_balance is numeric. The limit balance could affect default because it is the limit in dollars a family has. It could tell if the dollars are low or high and whether they will default the next payment.

Pay_0, Pay_2, Pay_3, Pay_4, Pay_5 and Pay_6 are integer data types. These tell the repayment statuses during the August to April, 2005 dates. It can show the months whether or not it is vacation time/school times. These times could potential affect the default payment.

BILL_AMT 1,2,3,4,5,6 are all numeric data types. This is the bill statement and can tell how much a person needs to pay back. If it is high or low, it can tell whether or not a person will default paymen.

Pay_AMT1,2,3,4,5,6 are all numeric data types. This is the previous payment amount and this can tell how much the person paid last time and be used to predict the next month’s payment.

A.2.

BILL_AMT1 and PAY_AMT1 can be paired together due to the fact it is the bill amount and how much the person paid last time. Perhaps a way to combine them is to potential find the difference between them and use this to predict the default payment.

PAY_AMT1 and PAY_AMT6 can be paired together due to the fact it is both previous payments. It describes a time in August, 2005 and in April, 2005. It can describe a broader view of previous payments through the months. These variables can possibly use ratio as a way to combine them.

B.1.

head(cc_data)

cc_data$SEX <- as.factor(cc_data$SEX)
cc_data$EDUCATION <- as.factor(cc_data$EDUCATION)
cc_data$MARRIAGE <- as.factor(cc_data$MARRIAGE)
cc_data$default.payment.next.month <- as.factor(cc_data$default.payment.next.month)
class(cc_data$SEX)

## [1] "factor"

class(cc_data$EDUCATION)

## [1] "factor"

class(cc_data$MARRIAGE)

## [1] "factor"

class(cc_data$default.payment.next.month)

## [1] "factor"

B.2.

ccnn <- subset(cc_data, BILL_AMT1 >= 0 & BILL_AMT2 >= 0 & BILL_AMT3 >= 0 & BILL_AMT4 >= 0 & BILL_AMT5 >= 0 & BILL_AMT6 >= 0 & PAY_AMT1 >= 0 & PAY_AMT2 >= 0 & PAY_AMT3 >= 0 & PAY_AMT4 >= 0 & PAY_AMT5 >= 0 & PAY_AMT6 >= 0 )

nrow(ccnn)

## [1] 28070

View(ccnn)

C.1.

r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("rpart.plot")

## Installing package into 'C:/Users/Toby/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'rpart.plot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Toby\AppData\Local\Temp\RtmpkBw7Cw\downloaded_packages

install.packages("e1071")

## Installing package into 'C:/Users/Toby/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## 
##   There is a binary version available but the source version is
##   later:
##       binary source needs_compilation
## e1071  1.7-2  1.7-3              TRUE
## 
##   Binaries will be installed
## package 'e1071' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Toby\AppData\Local\Temp\RtmpkBw7Cw\downloaded_packages

install.packages('caTools')

## Installing package into 'C:/Users/Toby/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'caTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Toby\AppData\Local\Temp\RtmpkBw7Cw\downloaded_packages

install.packages('MASS')

## Installing package into 'C:/Users/Toby/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'MASS' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Toby\AppData\Local\Temp\RtmpkBw7Cw\downloaded_packages

library('e1071')

## Warning: package 'e1071' was built under R version 3.6.1

library(caTools)

## Warning: package 'caTools' was built under R version 3.6.1

library(MASS)

## Warning: package 'MASS' was built under R version 3.6.1

parameters <- c('PAY_0','BILL_AMT1', 'PAY_AMT1', 'default.payment.next.month')
cc_data2 <- cc_data[, (colnames(cc_data) %in% parameters)]
split = sample.split(cc_data2, SplitRatio = 0.90)

training_data = subset(cc_data2, split == TRUE) 
test_data = subset(cc_data2, split == FALSE)

svm1 <- svm(default.payment.next.month ~ ., data = training_data, probablity = TRUE)
summary(svm1)

## 
## Call:
## svm(formula = default.payment.next.month ~ ., data = training_data, 
##     probablity = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  9291
## 
##  ( 4149 5142 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

prediction1 <- predict(svm1, training_data, decision.values = TRUE)
confusion <- table(training_data$default.payment.next.month, prediction1)
confusion

##    prediction1
##         0     1
##   0 16831   721
##   1  3330  1618

1 = YES, 0 = NO

True Negative = 16878 True Positive = 1590 False Positive = 670 False Negative = 3362

Accuracy = 0.8208 Precision = 0.70354 Recall = 0.3211 F = 0.4409 Kappa = 0.3513

C.2

Created new attributes with differences between Bill Amount and Pay Amount for 2 and 3.

cc_data$difference2 <- cc_data$BILL_AMT2 - cc_data$PAY_2
cc_data$difference3 <- cc_data$BILL_AMT3 - cc_data$PAY_3

parameters2 <- c('difference2', 'difference3', 'default.payment.next.month')
cc_data3 <- cc_data[, (colnames(cc_data) %in% parameters2)]
split = sample.split(cc_data3, SplitRatio = 0.90)

training_data2 = subset(cc_data3, split == TRUE)
test_data2 = subset(cc_data3, split == FALSE)

svm2 <- svm(default.payment.next.month ~ ., data = training_data2, probablity = TRUE)
summary(svm2)

## 
## Call:
## svm(formula = default.payment.next.month ~ ., data = training_data2, 
##     probablity = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  9146
## 
##  ( 4415 4731 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

prediction2 <- predict(svm2, training_data2, decision.values = TRUE)
confusion2 <- table(training_data2$default.payment.next.month, prediction2)
confusion2

##    prediction2
##         0     1
##   0 15585     0
##   1  4415     0

1 = YES, 0 = NO

True Negative = 16831 True Positive = 1618 False Positive = 721 False Negative = 3330

Accuracy = 0.8199 Precision = 0.6917 Recall = 0.3270 F = 0.4441 Kappa = 0.3525

C.3

hist(cc_data$LIMIT_BAL, breaks = 10, xlab = 'Limit Balance', main = 'Histogram of Limit Balance')

The limit balance is not very skewed. I believe that it does not need log transformation performed of it.

hist(cc_data$PAY_0, breaks = 10, xlab = 'Pay_0', main = 'Histogram of Pay_0')

hist(cc_data$PAY_2, breaks = 10, xlab = 'Pay_2', main = 'Histogram of Pay_2')

hist(cc_data$BILL_AMT1, breaks = 10, xlab = 'Bill_Amount1')

hist(cc_data$BILL_AMT2, breaks = 10, xlab = 'Bill_Amount2')

hist(cc_data$PAY_AMT1, breaks = 30, xlab = 'Pay_Amount')

hist(cc_data$PAY_AMT6, breaks = 10)

I decided to perform log transformations on the Pay_Amount variables.

cc_data$PAY_AMT1 <- log(cc_data$PAY_AMT1)

hist(cc_data$PAY_AMT1, breaks = 10, xlab = 'Pay Amount 1', main = 'Log Histogram of Pay Amount 1')

cc_data$PAY_AMT3 <- log(cc_data$BILL_AMT3)

## Warning in log(cc_data$BILL_AMT3): NaNs produced

hist(cc_data$PAY_AMT3, breaks = 10, xlab = 'Pay Amount 3', main = 'Log Histogram of Pay Amount 3')

cc_data$PAY_AMT6 <- log(cc_data$PAY_AMT6)

hist(cc_data$PAY_AMT6, breaks = 10, xlab = 'Pay Amount 6', main = 'Log Histogram of Pay Amount 6')

parameters3 <- c('PAY_AMT1', 'PAY_AMT3', 'PAY_AMT6', 'default.payment.next.month')
cc_data4 <- ccnn[, (colnames(ccnn) %in% parameters3)]
split = sample.split(cc_data4, SplitRatio = 0.90)

training_data3 = subset(cc_data4, split == TRUE) 
test_data3 = subset(cc_data4, split == FALSE)

svm3 <- svm(default.payment.next.month ~ ., data = training_data3, probablity = TRUE)
summary(svm3)

## 
## Call:
## svm(formula = default.payment.next.month ~ ., data = training_data3, 
##     probablity = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  9904
## 
##  ( 4724 5180 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

prediction3 <- predict(svm3, training_data3, decision.values = TRUE)
confusion3 <- table(training_data3$default.payment.next.month, prediction3)
confusion3

##    prediction3
##         0     1
##   0 16329     0
##   1  4722     2

1 = YES, 0 = NO

True Negative = 16848 True Positive = 1591 False Positive = 685 False Negative = 3377Show in New WindowClear OutputExpand/Collapse Output ID LIMIT_BAL SEX “integer” “numeric” “integer” EDUCATION MARRIAGE AGE “integer” “integer” “integer” PAY_0 PAY_2 PAY_3 “integer” “integer” “integer” PAY_4 PAY_5 PAY_6 “integer” “integer” “integer” BILL_AMT1 BILL_AMT2 BILL_AMT3 “numeric” “numeric” “numeric” BILL_AMT4 BILL_AMT5 BILL_AMT6 “numeric” “numeric” “numeric” PAY_AMT1 PAY_AMT2 PAY_AMT3 “numeric” “numeric” “numeric” PAY_AMT4 PAY_AMT5 PAY_AMT6 “numeric” “numeric” “numeric” default.payment.next.month “integer” Show in New WindowClear OutputExpand/Collapse Output Installing package into ???C:/Users/Toby/Documents/R/win-library/3.6??? (as ???lib??? is unspecified) trying URL ‘http://cran.us.r-project.org/bin/windows/contrib/3.6/rpart.plot_3.0.8.zip’ Content type ‘application/zip’ length 1078198 bytes (1.0 MB) downloaded 1.0 MB

package ‘rpart.plot’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:_packages Installing package into ???C:/Users/Toby/Documents/R/win-library/3.6??? (as ???lib??? is unspecified)

There is a binary version available but the source version is later:

trying URL ‘http://cran.us.r-project.org/bin/windows/contrib/3.6/e1071_1.7-2.zip’ Content type ‘application/zip’ length 1021860 bytes (997 KB) downloaded 997 KB

package ‘e1071’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:_packages Installing package into ???C:/Users/Toby/Documents/R/win-library/3.6??? (as ???lib??? is unspecified) trying URL ‘http://cran.us.r-project.org/bin/windows/contrib/3.6/caTools_1.17.1.2.zip’ Content type ‘application/zip’ length 331004 bytes (323 KB) downloaded 323 KB

package ‘caTools’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:_packages Installing package into ???C:/Users/Toby/Documents/R/win-library/3.6??? (as ???lib??? is unspecified) trying URL ‘http://cran.us.r-project.org/bin/windows/contrib/3.6/MASS_7.3-51.4.zip’ Content type ‘application/zip’ length 1183565 bytes (1.1 MB) downloaded 1.1 MB

package ‘MASS’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:_packages package ???e1071??? was built under R version 3.6.1package ???caTools??? was built under R version 3.6.1package ???MASS??? was built under R version 3.6.1 R Console

binary source needs_compilation e1071 1.7-2 1.7-3 TRUE
1 row data.frame 1 x 3

binary source needs_compilation e1071 1.7-2 1.7-3 TRUE
1 row

Accuracy = 0.8195 Precision = 0.6990 Recall = 0.3202 F = 0.4392 Kappa = 0.3488

C.4.

parameters4 <- c('BILL_AMT6', 'PAY_AMT6', 'default.payment.next.month')
cc_data5 <- cc_data[, (colnames(cc_data) %in% parameters4)]
split = sample.split(cc_data5, SplitRatio = 0.90)

training_data4 = subset(cc_data5, split == TRUE) 
test_data4 = subset(cc_data5, split == FALSE)

nbDem <- naiveBayes(default.payment.next.month ~ BILL_AMT6 + PAY_AMT6, training_data4)
nbDem

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##       0       1 
## 0.77925 0.22075 
## 
## Conditional probabilities:
##    BILL_AMT6
## Y       [,1]     [,2]
##   0 39170.14 59705.14
##   1 38170.39 60231.63
## 
##    PAY_AMT6
## Y   [,1] [,2]
##   0 -Inf  NaN
##   1 -Inf  NaN

prediction4 <- predict(nbDem, training_data4)
prediction4

## factor(0)
## Levels: 0 1

1 = YES, 0 = NO

True Negative = 15585 True Positive = 0 False Positive = 0 False Negative = 4415

Accuracy = 0.77925 Precision = 0 Recall = 0 F = 0 Kappa = 0

C.1 SVM

Accuracy = 0.8208 Precision = 0.70354 Recall = 0.3211 F = 0.4409 Kappa = 0.3513

C.2 SVM

Accuracy = 0.8199 Precision = 0.6917 Recall = 0.3270 F = 0.4441 Kappa = 0.3525

C.3 SVM

Accuracy = 0.8195 Precision = 0.6990 Recall = 0.3202 F = 0.4392 Kappa = 0.3488

C.4 NaiveBayes

Accuracy = 0.77925 Precision = 0 Recall = 0 F = 0 Kappa = 0

E.1.

It has confirmed my theory that using PAY_AMT1, PAY_AMT6 are good predictors of default month status. I used PAY_AMT1, 3 and 6 to determine C.3 but all of those variable give a broad view of the payment amount from the last month. By using the similar data variables, the analysis does not involve different aspects of another variable.

E.2.

I believe that data transformation does help. It helped the PAY_AMT variables the most. Before the log transformation, the data was very flat and not of normal distribution. But after the tranformation, the data had become more useful and described more of a bell curve shape. It did not help for variables like Pay_0 due to the factor data type.

E.3.

I believe that the SVM was the best method to perform well. The NaiveBayes did not perform well. Maybe I did not correct use the method but the results were not the best. I received more data with the SVM and its method is easier to understand.

E4.

I thought that accuracy helped to capture whether or not a metric was good or not. But in actuality, Kappa and precision is the metrics that determines whether or not a metric is good or not. I felt that recall and F did not help out as well.

E.5.

I believe that precision is the most helpful metric and SVM is the best method that recieved the best performance to this prediction. I think it does well because it is more of an advanced machine learning technique and helps to classify points into classes.