credit_risik

Jakob Pürzelmayer

HOMEWORK ASSIGNMENT 1 - Descriptive statistcs

mydata <- read.table("./credit_risk_dataset.csv", header=TRUE, sep=";", dec=".")

head(mydata)

##   person_age person_income loan_intent loan_amnt loan_status
## 1         22         59000    PERSONAL     35000           1
## 2         21          9600   EDUCATION      1000           0
## 3         25          9600     MEDICAL      5500           1
## 4         23         65500     MEDICAL     35000           1
## 5         24         54400     MEDICAL     35000           1
## 6         21          9900     VENTURE      2500           1
##   loan_percent_income
## 1                0.59
## 2                0.10
## 3                0.57
## 4                0.53
## 5                0.55
## 6                0.25

4) Data Explanation:

General:

The dataset contains 150 observations and 6 variables.
The unit of observation is an individual loan applicant.

Variables:

person_age
- Age of the applicant
- Type: Numeric – Ratio
- Unit: Years
person_income
- Annual income of the applicant
- Type: Numeric – Ratio
- Unit: Currency
loan_intent
- Purpose of the loan (e.g., education, medical, personal)
- Type: Categorical – Nominal
loan_amnt
- Loan amount requested
- Type: Numeric – Ratio
- Unit: Currency
loan_status
- Whether the loan was defaulted
- 0 = non-default, 1 = default
- Type: Categorical – Ordinal
loan_percent_income
- Ratio of loan amount to income
- Type: Numeric – Ratio
- Unit: Decimal

5) Source of the data

Credit Risk Dataset - Kaggle.com
URL: https://www.kaggle.com/datasets/laotse/credit-risk-dataset/data
(22.03.2025)

6) Data manipulation

head(mydata)

##   person_age person_income loan_intent loan_amnt loan_status
## 1         22         59000    PERSONAL     35000           1
## 2         21          9600   EDUCATION      1000           0
## 3         25          9600     MEDICAL      5500           1
## 4         23         65500     MEDICAL     35000           1
## 5         24         54400     MEDICAL     35000           1
## 6         21          9900     VENTURE      2500           1
##   loan_percent_income
## 1                0.59
## 2                0.10
## 3                0.57
## 4                0.53
## 5                0.55
## 6                0.25

mydata$loan_status <- factor(mydata$loan_status, 
                             levels = c(0, 1), 
                             labels = c("non-default", "default"))

There was one individual with 144 years.

mydata[82, 1] <- 27

names(mydata)[names(mydata) == "person_age"] <- "Age"

head(mydata)

##   Age person_income loan_intent loan_amnt loan_status loan_percent_income
## 1  22         59000    PERSONAL     35000     default                0.59
## 2  21          9600   EDUCATION      1000 non-default                0.10
## 3  25          9600     MEDICAL      5500     default                0.57
## 4  23         65500     MEDICAL     35000     default                0.53
## 5  24         54400     MEDICAL     35000     default                0.55
## 6  21          9900     VENTURE      2500     default                0.25

7) Descriptive Statistics

library(pastecs)
round(stat.desc(mydata[ , -3])[ , -4], 1)

##                 Age person_income   loan_amnt loan_percent_income
## nbr.val       150.0         150.0       150.0               150.0
## nbr.null        0.0           0.0         0.0                 0.0
## nbr.na          0.0           0.0         0.0                 0.0
## min            21.0        9600.0      1000.0                 0.0
## max            27.0      500000.0     35000.0                 0.6
## range           6.0      490400.0     34000.0                 0.6
## sum          3564.0    16784761.0   3110300.0                42.6
## median         24.0       77550.0     25000.0                 0.3
## mean           23.8      111898.4     20735.3                 0.3
## SE.mean         0.1        7996.0       950.7                 0.0
## CI.mean.0.95    0.3       15800.2      1878.5                 0.0
## var             2.7  9590395264.4 135565110.6                 0.0
## std.dev         1.7       97930.6     11643.2                 0.2
## coef.var        0.1           0.9         0.6                 0.6

Calculated descriptive statistics for the numeric variables in the dataset, excluding categorical variables.

The mean income is 89.847,6 , while the mean loan amount is 20.735,3. But it may be influenced by outliers (e.g. very high incomes)
The median income is 77.550, which is lower than the mean, could mean right-skewed distribution.
The median loan amount is 25.000, it means that half to the people request loans below this amount and the other half higher.

Mode - reason for loan

library(modeest)
mlv(mydata$loan_intent)

## [1] "EDUCATION"

Most observed people tend to take a loan for education.

Standard deviation

sd(mydata$person_income)

## [1] 97930.56

Individual incomes deviate about 97930.56$ from the mean income in my Dataset. Which explains a wide spread in income levels

8) Graph the distribution

hist(mydata$person_income, 
    main = "Distribution of Income", 
    xlab = "Income", 
    ylab = "Amount",
    col  = "green")

options(scipen = 999)

Explanation:
A histogram that shows the distribution of applicants annual income in the dataset.
A small number of individuals earn between $300,000 and $500,000 — these are outliers that make a right-skew (common with income)

Number of Loans by Intent

library(ggplot2)
ggplot(mydata, aes(x= loan_intent))+
  geom_bar() + 
  ylab("Frequency") + 
  xlab("Loan Intent") + 
  ggtitle("Number of Loans by Intent") +
  theme_minimal()

Explanation:
Education loans are the most common, followed by venture and debt consolidation.
It is a Bar chart

Loan Amount by Loan Intent

library(ggplot2)

ggplot(mydata, aes(x = loan_intent, y = loan_amnt)) +
  geom_boxplot(fill = "lightgray", color = "black") +
  xlab("Loan Intent") +
  ylab("Loan Amount") +
  ggtitle("Loan Amount by Loan Intent") +
  theme_minimal()

Explanation:
A boxplot displays the distribution of loan amounts for each loan intent.
Personal loans tend to have the highest median loan amounts. (very little difference)
All categories are in a similar range; only “Home Improvements” seems to have strong outliers.

credit_risik_dataset

2025-03-22