Descriptive Statistics

Univariate Statistics

Katarzyna Batlińska, Jakub Flizikowski, Kacper Dziurgot

2024-04-26

Your turn!

Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.

Summary and frequency

let’s look at our data and TA index

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##        # classes  Goodness of fit Tabular accuracy 
##       15.0000000        0.9790551        0.8302798
Frequency Table for Income
x.. x..x x..label x..Freq x..Percent x..Valid.Percent x..Cumulative.Percent
Valid (0,0.9] 2 0.2 0.2 0.2
(0.9,1.8] 138 10.5 10.5 10.6
(1.8,2.7] 448 34.0 34.0 44.6
(2.7,3.6] 328 24.9 24.9 69.4
(3.6,4.5] 172 13.0 13.0 82.5
(4.5,5.4] 92 7.0 7.0 89.5
(5.4,6.3] 52 3.9 3.9 93.4
(6.3,7.2] 38 2.9 2.9 96.3
(7.2,8.1] 19 1.4 1.4 97.7
(8.1,9] 8 0.6 0.6 98.3
(9,9.9] 3 0.2 0.2 98.6
(9.9,10.8] 14 1.1 1.1 99.6
(10.8,11.7] 2 0.2 0.2 99.8
(11.7,12.6] 2 0.2 0.2 99.9
(12.6,13.5] 1 0.1 0.1 100.0
Total 1319 100.0 100.0
Missing <blank> 0 0.0
<NA> 0 0.0
Total 1319 100.0

We see that our TA index is pretty good.

Plots

Here let’s look at some plots

Here we can observe distribution of age and income and expenditure. Also boxplot of Ratio of Monthly Credit Card Expenditure to Yearly Income by Credit Risk.

Corelation Heat map

We can observe that expenditures are highly corelated with share values.

Further Analysis

Are the yearly incomes (in USD 10,000), credit card expenditures, age, ratio of monthly credit card expenditure to yearly income - significantly different for applicants for customers with different credit risk (“card” variable - factor)?

# Histogram for Yearly Income
ggplot(CreditCard, aes(x = income, fill = card)) +
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Histogram of Yearly Income", x = "Yearly Income (x10,000 USD)", y = "Frequency")

# Histogram for Age
ggplot(CreditCard, aes(x = age, fill = card)) +
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Histogram of Age", x = "Age", y = "Frequency")

# Box plot for Credit Card Expenditure
ggplot(CreditCard, aes(x = card, y = expenditure, fill = card)) +
  geom_boxplot() +
  labs(title = "Box Plot of Credit Card Expenditure by Card Type", x = "Card", y = "Expenditure")

# Box plot for Expenditure to Income Ratio
ggplot(CreditCard, aes(x = card, y = expenditure_income_ratio, fill = card)) +
  geom_boxplot() +
  labs(title = "Expenditure to Income Ratio by Card Type", x = "Card", y = "Ratio")

# Scatter plot for Age vs. Yearly Income
ggplot(CreditCard, aes(x = age, y = income, color = card)) +
  geom_point(alpha = 0.6) +
  labs(title = "Age vs. Yearly Income", x = "Age", y = "Yearly Income (x10,000 USD)")

# Scatter plot for Age vs. Expenditure to Income Ratio
ggplot(CreditCard, aes(x = age, y = expenditure_income_ratio, color = card)) +
  geom_point(alpha = 0.6) +
  labs(title = "Age vs. Expenditure to Income Ratio", x = "Age", y = "Ratio")

ggplot(CreditCard, aes(x = income)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Income distribution of cardholders",
       x = "Income",
       y = "Number of people") +
  theme_minimal()

We see that most cards belong to people whose income is around 2,500

ggplot(CreditCard, aes(x = age, y = monthly_income)) +
  geom_point( alpha = 0.6) +
  labs(title = "Correlation between age and monthly income",
       x = "Age",
       y = "Monthly income") +
  theme_minimal()

Young people have the lowest income, there is a slight tendency to increase income in middle age, followed by stabilization or decline in old age

Results

Yearly Incomes: Significant differences were observed in yearly incomes among applicants with different credit risk levels. The analysis revealed that applicants classified as “high-risk” tend to have lower yearly incomes compared to those classified as “low-risk”.

Credit Card Expenditures: Credit card expenditures also varied significantly based on credit risk levels. Applicants with higher credit risk tend to have higher credit card expenditures compared to lower-risk applicants.

Age: Age distributions differ significantly across credit risk levels. The analysis suggests that younger applicants are more likely to be classified as high-risk, while older applicants are more prevalent among low-risk individuals.

Ratio of Monthly Expenditure to Yearly Income: There are significant differences in the ratio of monthly credit card expenditure to yearly income among applicants with different credit risk levels. High-risk applicants tend to have higher ratios, indicating potentially risky financial behavior.

Conclusion

In conclusion, the analysis indicates that financial attributes such as yearly incomes, credit card expenditures, age, and the ratio of monthly expenditure to yearly income are significantly different for applicants with different credit risk levels. These findings can inform credit risk assessment strategies and aid in decision-making processes for credit card issuers.