library(readr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)This project analyzes two financial datasets using data science techniques in R. The first dataset is used for regression analysis to examine factors associated with loan amounts. The second dataset will be used for classification analysis to predict credit card default behavior.
The first dataset was obtained from Kaggle and contains information about loan applicants, including income, employment length, loan intent, and interest rates.
The dataset is appropriate for regression analysis because it includes a continuous target variable called loan amount.
What factors are associated with loan amounts?
library(readr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2).
credit_data <- read_csv("credit_risk_dataset.csv")Rows: 32581 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): person_home_ownership, loan_intent, loan_grade, cb_person_default_o...
dbl (8): person_age, person_income, person_emp_length, loan_amnt, loan_int_r...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(credit_data)# A tibble: 6 × 12
person_age person_income person_home_ownership person_emp_length loan_intent
<dbl> <dbl> <chr> <dbl> <chr>
1 22 59000 RENT 123 PERSONAL
2 21 9600 OWN 5 EDUCATION
3 25 9600 MORTGAGE 1 MEDICAL
4 23 65500 RENT 4 MEDICAL
5 24 54400 RENT 8 MEDICAL
6 21 9900 OWN 2 VENTURE
# ℹ 7 more variables: loan_grade <chr>, loan_amnt <dbl>, loan_int_rate <dbl>,
# loan_status <dbl>, loan_percent_income <dbl>,
# cb_person_default_on_file <chr>, cb_person_cred_hist_length <dbl>
The colnames() function was used to inspect all variables in the dataset.
colnames(credit_data) [1] "person_age" "person_income"
[3] "person_home_ownership" "person_emp_length"
[5] "loan_intent" "loan_grade"
[7] "loan_amnt" "loan_int_rate"
[9] "loan_status" "loan_percent_income"
[11] "cb_person_default_on_file" "cb_person_cred_hist_length"
The dataset contains demographic and financial variables related to loan applicants. Variables include age, income, employment length, loan intent, interest rate, loan amount, and credit history information.
The dataset was cleaned by selecting relevant variables and removing missing values.
clean_credit <- credit_data %>%
select(
person_age,
person_income,
person_emp_length,
loan_intent,
loan_int_rate,
loan_amnt,
loan_status
) %>%
na.omit()
head(clean_credit)# A tibble: 6 × 7
person_age person_income person_emp_length loan_intent loan_int_rate loan_amnt
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 22 59000 123 PERSONAL 16.0 35000
2 21 9600 5 EDUCATION 11.1 1000
3 25 9600 1 MEDICAL 12.9 5500
4 23 65500 4 MEDICAL 15.2 35000
5 24 54400 8 MEDICAL 14.3 35000
6 21 9900 2 VENTURE 7.14 2500
# ℹ 1 more variable: loan_status <dbl>
Summary statistics were calculated for the loan amount variable to understand its distribution and variability.
summary(clean_credit$loan_amnt) Min. 1st Qu. Median Mean 3rd Qu. Max.
500 5000 8000 9656 12500 35000
sd(clean_credit$loan_amnt)[1] 6329.683
The minimum loan amount is 500, while the maximum loan amount is 35,000. The median loan amount is 8,000, and the mean is 9,656. Since the mean is higher than the median, the distribution appears to be positively skewed. The standard deviation is 6,329.683, indicating considerable variation in loan amounts across borrowers.
A histogram was created to visualize the distribution of loan amounts.
ggplot(clean_credit, aes(x = loan_amnt)) +
geom_histogram(bins = 30) +
theme_minimal()The histogram shows that the distribution of loan amounts is positively skewed. Most borrowers have relatively small loan amounts between 5,000 and 15,000, while fewer borrowers have very large loan amounts. The long right tail indicates the presence of high value loans.
Since the loan amount variable is skewed, a log transformation was applied to make the distribution more balanced.
clean_credit$log_loan <- log(clean_credit$loan_amnt)
ggplot(clean_credit, aes(x = log_loan)) +
geom_histogram(bins = 30) +
theme_minimal()After applying the log transformation, the distribution became more symmetric and balanced. Compared to the original histogram, the extreme values are less noticeable, and the shape is closer to a normal distribution.
This dataset includes financial and personal information about credit card customers in Taiwan. The variables provide details about customers credit limits, repayment history, bill amounts, payment behavior, and demographic characteristics.
The dataset is appropriate for classification analysis because the target variable shows whether a customer defaulted on credit card payments in the following month.
Can we predict whether a customer will default on credit card payments?
credit_card <- read_csv("UCI_Credit_Card.csv")Rows: 30000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (25): ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(credit_card)# A tibble: 6 × 25
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20000 2 2 1 24 2 2 -1 -1 -2
2 2 120000 2 2 2 26 -1 2 0 0 0
3 3 90000 2 2 2 34 0 0 0 0 0
4 4 50000 2 2 1 37 0 0 0 0 0
5 5 50000 1 2 1 57 -1 0 -1 0 0
6 6 50000 1 1 2 37 0 0 0 0 0
# ℹ 14 more variables: PAY_6 <dbl>, BILL_AMT1 <dbl>, BILL_AMT2 <dbl>,
# BILL_AMT3 <dbl>, BILL_AMT4 <dbl>, BILL_AMT5 <dbl>, BILL_AMT6 <dbl>,
# PAY_AMT1 <dbl>, PAY_AMT2 <dbl>, PAY_AMT3 <dbl>, PAY_AMT4 <dbl>,
# PAY_AMT5 <dbl>, PAY_AMT6 <dbl>, default.payment.next.month <dbl>
The variables in the dataset were inspected using the colnames() function.
colnames(credit_card) [1] "ID" "LIMIT_BAL"
[3] "SEX" "EDUCATION"
[5] "MARRIAGE" "AGE"
[7] "PAY_0" "PAY_2"
[9] "PAY_3" "PAY_4"
[11] "PAY_5" "PAY_6"
[13] "BILL_AMT1" "BILL_AMT2"
[15] "BILL_AMT3" "BILL_AMT4"
[17] "BILL_AMT5" "BILL_AMT6"
[19] "PAY_AMT1" "PAY_AMT2"
[21] "PAY_AMT3" "PAY_AMT4"
[23] "PAY_AMT5" "PAY_AMT6"
[25] "default.payment.next.month"
The dataset contains variables related to customer demographics, financial status, repayment history, bill amounts, and payment behavior. Important variables include credit limit, age, education level, repayment status, and default payment status.
clean_card <- credit_card %>%
select(
LIMIT_BAL,
SEX,
EDUCATION,
MARRIAGE,
AGE,
PAY_0,
BILL_AMT1,
PAY_AMT1,
default.payment.next.month
) %>%
na.omit()
head(clean_card)# A tibble: 6 × 9
LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 BILL_AMT1 PAY_AMT1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20000 2 2 1 24 2 3913 0
2 120000 2 2 2 26 -1 2682 0
3 90000 2 2 2 34 0 29239 1518
4 50000 2 2 1 37 0 46990 2000
5 50000 1 2 1 57 -1 8617 2000
6 50000 1 1 2 37 0 64400 2500
# ℹ 1 more variable: default.payment.next.month <dbl>
summary(clean_card$LIMIT_BAL) Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 50000 140000 167484 240000 1000000
sd(clean_card$LIMIT_BAL)[1] 129747.7
The minimum credit limit is 10,000, while the maximum credit limit is 1,000,000. The median value is 140,000, and the mean is 167,484. Since the mean is higher than the median, the distribution appears to be positively skewed. The standard deviation is also quite large, showing that there is substantial variation in credit limits among customers.
A histogram was created to visualize the distribution of credit limits.
ggplot(clean_card, aes(x = LIMIT_BAL)) +
geom_histogram(bins = 30) +
theme_minimal()The histogram shows that the distribution of credit limits is positively skewed. Most customers have lower or medium credit limits, while only a small number of customers have very high limits. The long right tail suggests the presence of extreme values.
clean_card$log_limit <- log(clean_card$LIMIT_BAL)
ggplot(clean_card, aes(x = log_limit)) +
geom_histogram(bins = 30) +
theme_minimal()After applying the log transformation, the distribution became more balanced and less skewed. Compared to the original histogram, the extreme values are less noticeable, and the overall shape is closer to a normal distribution.
The loan amount variable looks like it follows a log-normal distribution because the original histogram is right-skewed. After applying the log transformation, the distribution becomes more balanced and closer to a normal distribution.
The credit limit variable also seems to follow a log-normal distribution. In the original histogram, most customers have lower credit limits, while a few customers have very high limits. After the log transformation, the shape becomes more balanced and easier to interpret.