ECON 465 – Stage 1

Author

Efe Şahin-Ömer Faruk Yılmaz

Introduction

This project analyzes two financial datasets using data science techniques in R. The first dataset is used for regression analysis to examine factors associated with loan amounts. The second dataset will be used for classification analysis to predict credit card default behavior.

Dataset 1 – Credit Risk Dataset

The first dataset was obtained from Kaggle and contains information about loan applicants, including income, employment length, loan intent, and interest rates.

The dataset is appropriate for regression analysis because it includes a continuous target variable called loan amount.

Economic Question

What factors are associated with loan amounts?

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

credit_data <- read_csv("credit_risk_dataset.csv")

Rows: 32581 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): person_home_ownership, loan_intent, loan_grade, cb_person_default_o...
dbl (8): person_age, person_income, person_emp_length, loan_amnt, loan_int_r...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(credit_data)

# A tibble: 6 × 12
  person_age person_income person_home_ownership person_emp_length loan_intent
       <dbl>         <dbl> <chr>                             <dbl> <chr>      
1         22         59000 RENT                                123 PERSONAL   
2         21          9600 OWN                                   5 EDUCATION  
3         25          9600 MORTGAGE                              1 MEDICAL    
4         23         65500 RENT                                  4 MEDICAL    
5         24         54400 RENT                                  8 MEDICAL    
6         21          9900 OWN                                   2 VENTURE    
# ℹ 7 more variables: loan_grade <chr>, loan_amnt <dbl>, loan_int_rate <dbl>,
#   loan_status <dbl>, loan_percent_income <dbl>,
#   cb_person_default_on_file <chr>, cb_person_cred_hist_length <dbl>

Variable Inspection

The colnames() function was used to inspect all variables in the dataset.

colnames(credit_data)

 [1] "person_age"                 "person_income"             
 [3] "person_home_ownership"      "person_emp_length"         
 [5] "loan_intent"                "loan_grade"                
 [7] "loan_amnt"                  "loan_int_rate"             
 [9] "loan_status"                "loan_percent_income"       
[11] "cb_person_default_on_file"  "cb_person_cred_hist_length"

The dataset contains demographic and financial variables related to loan applicants. Variables include age, income, employment length, loan intent, interest rate, loan amount, and credit history information.

Data Cleaning

The dataset was cleaned by selecting relevant variables and removing missing values.

clean_credit <- credit_data %>%
  select(
    person_age,
    person_income,
    person_emp_length,
    loan_intent,
    loan_int_rate,
    loan_amnt,
    loan_status
  ) %>%
  na.omit()

head(clean_credit)

# A tibble: 6 × 7
  person_age person_income person_emp_length loan_intent loan_int_rate loan_amnt
       <dbl>         <dbl>             <dbl> <chr>               <dbl>     <dbl>
1         22         59000               123 PERSONAL            16.0      35000
2         21          9600                 5 EDUCATION           11.1       1000
3         25          9600                 1 MEDICAL             12.9       5500
4         23         65500                 4 MEDICAL             15.2      35000
5         24         54400                 8 MEDICAL             14.3      35000
6         21          9900                 2 VENTURE              7.14      2500
# ℹ 1 more variable: loan_status <dbl>

Summary Statistics

Summary statistics were calculated for the loan amount variable to understand its distribution and variability.

summary(clean_credit$loan_amnt)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500    5000    8000    9656   12500   35000

sd(clean_credit$loan_amnt)

[1] 6329.683

The minimum loan amount is 500, while the maximum loan amount is 35,000. The median loan amount is 8,000, and the mean is 9,656. Since the mean is higher than the median, the distribution appears to be positively skewed. The standard deviation is 6,329.683, indicating considerable variation in loan amounts across borrowers.

Histogram of Loan Amount

A histogram was created to visualize the distribution of loan amounts.

ggplot(clean_credit, aes(x = loan_amnt)) +
  geom_histogram(bins = 30) +
  theme_minimal()

The histogram shows that the distribution of loan amounts is positively skewed. Most borrowers have relatively small loan amounts between 5,000 and 15,000, while fewer borrowers have very large loan amounts. The long right tail indicates the presence of high value loans.

Log Transformation

Since the loan amount variable is skewed, a log transformation was applied to make the distribution more balanced.

clean_credit$log_loan <- log(clean_credit$loan_amnt)

ggplot(clean_credit, aes(x = log_loan)) +
  geom_histogram(bins = 30) +
  theme_minimal()

After applying the log transformation, the distribution became more symmetric and balanced. Compared to the original histogram, the extreme values are less noticeable, and the shape is closer to a normal distribution.

Dataset 2 – Credit Card Default Dataset

This dataset includes financial and personal information about credit card customers in Taiwan. The variables provide details about customers credit limits, repayment history, bill amounts, payment behavior, and demographic characteristics.

The dataset is appropriate for classification analysis because the target variable shows whether a customer defaulted on credit card payments in the following month.

Classification Question

Can we predict whether a customer will default on credit card payments?

credit_card <- read_csv("UCI_Credit_Card.csv")

Rows: 30000 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (25): ID, LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(credit_card)

# A tibble: 6 × 25
     ID LIMIT_BAL   SEX EDUCATION MARRIAGE   AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
  <dbl>     <dbl> <dbl>     <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     20000     2         2        1    24     2     2    -1    -1    -2
2     2    120000     2         2        2    26    -1     2     0     0     0
3     3     90000     2         2        2    34     0     0     0     0     0
4     4     50000     2         2        1    37     0     0     0     0     0
5     5     50000     1         2        1    57    -1     0    -1     0     0
6     6     50000     1         1        2    37     0     0     0     0     0
# ℹ 14 more variables: PAY_6 <dbl>, BILL_AMT1 <dbl>, BILL_AMT2 <dbl>,
#   BILL_AMT3 <dbl>, BILL_AMT4 <dbl>, BILL_AMT5 <dbl>, BILL_AMT6 <dbl>,
#   PAY_AMT1 <dbl>, PAY_AMT2 <dbl>, PAY_AMT3 <dbl>, PAY_AMT4 <dbl>,
#   PAY_AMT5 <dbl>, PAY_AMT6 <dbl>, default.payment.next.month <dbl>

Variable Inspection

The variables in the dataset were inspected using the colnames() function.

colnames(credit_card)

 [1] "ID"                         "LIMIT_BAL"                 
 [3] "SEX"                        "EDUCATION"                 
 [5] "MARRIAGE"                   "AGE"                       
 [7] "PAY_0"                      "PAY_2"                     
 [9] "PAY_3"                      "PAY_4"                     
[11] "PAY_5"                      "PAY_6"                     
[13] "BILL_AMT1"                  "BILL_AMT2"                 
[15] "BILL_AMT3"                  "BILL_AMT4"                 
[17] "BILL_AMT5"                  "BILL_AMT6"                 
[19] "PAY_AMT1"                   "PAY_AMT2"                  
[21] "PAY_AMT3"                   "PAY_AMT4"                  
[23] "PAY_AMT5"                   "PAY_AMT6"                  
[25] "default.payment.next.month"

The dataset contains variables related to customer demographics, financial status, repayment history, bill amounts, and payment behavior. Important variables include credit limit, age, education level, repayment status, and default payment status.

Data Cleaning

clean_card <- credit_card %>%
  select(
    LIMIT_BAL,
    SEX,
    EDUCATION,
    MARRIAGE,
    AGE,
    PAY_0,
    BILL_AMT1,
    PAY_AMT1,
    default.payment.next.month
  ) %>%
  na.omit()

head(clean_card)

# A tibble: 6 × 9
  LIMIT_BAL   SEX EDUCATION MARRIAGE   AGE PAY_0 BILL_AMT1 PAY_AMT1
      <dbl> <dbl>     <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
1     20000     2         2        1    24     2      3913        0
2    120000     2         2        2    26    -1      2682        0
3     90000     2         2        2    34     0     29239     1518
4     50000     2         2        1    37     0     46990     2000
5     50000     1         2        1    57    -1      8617     2000
6     50000     1         1        2    37     0     64400     2500
# ℹ 1 more variable: default.payment.next.month <dbl>

Summary Statistics

summary(clean_card$LIMIT_BAL)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10000   50000  140000  167484  240000 1000000

sd(clean_card$LIMIT_BAL)

[1] 129747.7

The minimum credit limit is 10,000, while the maximum credit limit is 1,000,000. The median value is 140,000, and the mean is 167,484. Since the mean is higher than the median, the distribution appears to be positively skewed. The standard deviation is also quite large, showing that there is substantial variation in credit limits among customers.

Histogram of Credit Limit

A histogram was created to visualize the distribution of credit limits.

ggplot(clean_card, aes(x = LIMIT_BAL)) +
  geom_histogram(bins = 30) +
  theme_minimal()

The histogram shows that the distribution of credit limits is positively skewed. Most customers have lower or medium credit limits, while only a small number of customers have very high limits. The long right tail suggests the presence of extreme values.

Log Transformation

clean_card$log_limit <- log(clean_card$LIMIT_BAL)

ggplot(clean_card, aes(x = log_limit)) +
  geom_histogram(bins = 30) +
  theme_minimal()

After applying the log transformation, the distribution became more balanced and less skewed. Compared to the original histogram, the extreme values are less noticeable, and the overall shape is closer to a normal distribution.

Theoretical Distribution

Regression Dataset

The loan amount variable looks like it follows a log-normal distribution because the original histogram is right-skewed. After applying the log transformation, the distribution becomes more balanced and closer to a normal distribution.

Classification Dataset

The credit limit variable also seems to follow a log-normal distribution. In the original histogram, most customers have lower credit limits, while a few customers have very high limits. After the log transformation, the shape becomes more balanced and easier to interpret.