CredX Business Context

Summary

In this research article, I have examined and analyzed the credit bureau and demographic datasets. These datasets have been collected and documented to be used for the purpose of analyzing credit card details of clients. These datasets can be analyzed used predictive modeling based on factors such as credit risk, acquisition and other personal details of clients.

Loading the datasets

#install.packages('knitr')
#install.packages('dplyr')
#install.packages('ggplot2')
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
credit <- read.csv("~/Credit_Bureau.csv")
demo <- read.csv("~/demogs.csv")

Duplicated Data

merged<-merge(credit,demo,"Application.ID")
duplicatedData <- merged[duplicated(merged$Application.ID)
                         |duplicated(merged$Application.ID, fromLast = TRUE),]
kable(duplicatedData)
Application.ID No.of.times.90.DPD.or.worse.in.last.6.months No.of.times.60.DPD.or.worse.in.last.6.months No.of.times.30.DPD.or.worse.in.last.6.months No.of.times.90.DPD.or.worse.in.last.12.months No.of.times.60.DPD.or.worse.in.last.12.months No.of.times.30.DPD.or.worse.in.last.12.months Avgas.CC.Utilization.in.last.12.months No.of.trades.opened.in.last.6.months No.of.trades.opened.in.last.12.months No.of.PL.trades.opened.in.last.6.months No.of.PL.trades.opened.in.last.12.months No.of.Inquiries.in.last.6.months..excluding.home…auto.loans. No.of.Inquiries.in.last.12.months..excluding.home…auto.loans. Presence.of.open.home.loan Outstanding.Balance Total.No.of.Trades Presence.of.open.auto.loan Performance.Tag.x Age Gender Marital.Status..at.the.time.of.application. No.of.dependents Income Education Profession Type.of.residence No.of.months.in.current.residence No.of.months.in.current.company Performance.Tag.y
46613 653287861 0 0 0 0 0 0 15 0 0 0 0 0 0 0 10494 2 0 0 26 M Married 3 25.0 Bachelor SAL Rented 112 58 0
46614 653287861 0 0 0 0 0 0 15 0 0 0 0 0 0 0 10494 2 0 0 40 M Married 5 32.0 Phd SE Rented 45 46 1
46615 653287861 1 1 1 2 2 2 113 2 5 1 3 1 3 0 628075 6 0 1 26 M Married 3 25.0 Bachelor SAL Rented 112 58 0
46616 653287861 1 1 1 2 2 2 113 2 5 1 3 1 3 0 628075 6 0 1 40 M Married 5 32.0 Phd SE Rented 45 46 1
47931 671989187 1 2 3 2 2 3 63 2 6 1 4 3 5 0 806473 7 1 0 27 M Married 2 35.0 Professional SAL Living with Parents 108 7 0
47932 671989187 1 2 3 2 2 3 63 2 6 1 4 3 5 0 806473 7 1 0 57 M Married 4 7.0 Professional SE Rented 42 3 0
47933 671989187 0 1 2 0 2 3 76 3 7 1 4 2 6 0 822298 7 0 0 27 M Married 2 35.0 Professional SAL Living with Parents 108 7 0
47934 671989187 0 1 2 0 2 3 76 3 7 1 4 2 6 0 822298 7 0 0 57 M Married 4 7.0 Professional SE Rented 42 3 0
54595 765011468 0 0 0 0 0 0 11 1 2 0 1 2 2 0 212656 3 0 0 57 M Single 4 4.5 Bachelor SAL Company provided 6 56 0
54596 765011468 0 0 0 0 0 0 11 1 2 0 1 2 2 0 212656 3 0 0 38 M Married 4 4.5 Professional SAL Rented 6 72 0
54597 765011468 0 0 0 0 0 0 11 1 3 0 0 1 3 0 29817 6 0 0 57 M Single 4 4.5 Bachelor SAL Company provided 6 56 0
54598 765011468 0 0 0 0 0 0 11 1 3 0 0 1 3 0 29817 6 0 0 38 M Married 4 4.5 Professional SAL Rented 6 72 0

Missing Data

The two datasets had missing values in their original existence. Some of variables with missing values include:

missingData <- colSums(is.na((merged)[colSums(is.na(merged)) >=50]))
kable(missingData)
x
Avgas.CC.Utilization.in.last.12.months 1058
Presence.of.open.home.loan 272
Outstanding.Balance 272
Performance.Tag.x 1425
Performance.Tag.y 1425
#Data Ommit
completedata<-na.omit(merged)
View(completedata)

Boxplot

The boxplot diagram below shows the outstanding balance based on gender. One can use this boxplot to make a comparison of the outstanding balance of the customers according to their gender.

Histogram

Below is a histogram showing the frequency of the age of the customers. The histogram shows the distribution of customers according to their age.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Observations

After completing this assignment, I was able to learn several things about data exploration and visualization using R studio. The whole exercise was fun as I was able to make several observations as far as the analysis of credit card and demographic data are concerned. The strength of the R programming language has been proven by its ability to handle cases of missing and incomplete data. Additionally, the diagrams obtained from this tool are visually impressive and can be used to make observations from the data. However, there are a few challenges that I experienced while handling this data analysis process. For example, merging the datasets was difficult to join the two datasets into one.