Descriptive Statistics

Univariate Statistics

Your turn!

Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.

New dataset

this dataset displays info about credit cards of type “yes” and “no”. based on this test we will perform analisys of those two types.

data(CreditCard)

head(CreditCard, 10)
##    card reports      age income       share expenditure owner selfemp
## 1   yes       0 37.66667 4.5200 0.033269910  124.983300   yes      no
## 2   yes       0 33.25000 2.4200 0.005216942    9.854167    no      no
## 3   yes       0 33.66667 4.5000 0.004155556   15.000000   yes      no
## 4   yes       0 30.50000 2.5400 0.065213780  137.869200    no      no
## 5   yes       0 32.16667 9.7867 0.067050590  546.503300   yes      no
## 6   yes       0 23.25000 2.5000 0.044438400   91.996670    no      no
## 7   yes       0 27.91667 3.9600 0.012575760   40.833330    no      no
## 8   yes       0 29.16667 2.3700 0.076433760  150.790000   yes      no
## 9   yes       0 37.00000 3.8000 0.245627900  777.821700   yes      no
## 10  yes       0 28.41667 3.2000 0.019780000   52.580000    no      no
##    dependents months majorcards active
## 1           3     54          1     12
## 2           3     34          1     13
## 3           4     58          1      5
## 4           0     25          1      7
## 5           2     64          1      5
## 6           0     54          1      1
## 7           2      7          1      5
## 8           0     77          1      3
## 9           0     97          1      6
## 10          0     65          1     18

Format

A data frame containing 1,319 observations on 12 variables.

card: Factor. Was the application for a credit card accepted?

reports: Number of major derogatory reports.

age: Age in years plus twelfths of a year.

income: Yearly income (in USD 10,000).

share: Ratio of monthly credit card expenditure to yearly income.

expenditure: Average monthly credit card expenditure.

owner: Factor. Does the individual own their home?

selfemp: Factor. Is the individual self-employed?

dependents: Number of dependents.

months: Months living at current address.

majorcards: Number of major credit cards held.

active: Number of active credit accounts.

Age in relation to succsessful credit aplication

First analysis covers the relevance of age in getting approval for credit. my first step will be to round the age to integer form so it is easier to group it.

# add column of age in integer form

age_analysis <- CreditCard
age_analysis$int_age <- as.integer(age_analysis$age)

# remove old column of age in double form

age_analysis <- subset(age_analysis, select = c (card, int_age))

# i found some age columns equal 0, it seems like wrong input/ missing data
# therefore i will get rid of them

age_analysis <- filter(age_analysis, int_age !=0)

age_analysis$age_group = as.character("")

age_analysis$age_group <- case_when(
  age_analysis$int_age >= 18 & age_analysis$int_age <= 25 ~ "18 - 25",
  age_analysis$int_age >= 26 & age_analysis$int_age <= 35 ~ "26 - 35",
  age_analysis$int_age >= 36 & age_analysis$int_age <= 50 ~ "36 - 50",
  age_analysis$int_age >= 51  ~ "51",

  TRUE ~ NA_character_ 
)

result <- age_analysis %>%
  group_by(age_group) %>%
  summarise(Yes_Count = sum(card == "yes"), No_Count = sum(card == "no"))

result <- result %>%
  mutate(Percentage_Yes = (Yes_Count / (Yes_Count+No_Count)) * 100)

The graph shows us that most of the samples were collected from age group 26-35 years. Every age group had more instances of successful credit applications. From this data we can assume that age isnt a significant factor in credit aplications.

ggplot(age_analysis, aes(x = age_group, fill = card)) +
  geom_bar(position = "stack") +
  labs(title = "Accepted and denied credit cards in different age groups",
       x = "Age Group",
       y = "Ammount of applications") +
  scale_fill_manual(values = c("yes" = "lightgreen", "no" = "red")) +
  theme_minimal()

## Yearly income relevance in connection to application for credit

Now we can take a look at yearly income.

mean_income <- aggregate(income ~ card, data = CreditCard, FUN = mean)

ggplot(CreditCard, aes(x = income, colour = card, fill = card)) +
  geom_histogram(position = "stack", alpha=0.5, bins = 30, size = 0.75) +
  geom_vline(data = mean_income, aes(xintercept = income, color = card), linetype = "dashed", size=1) +
  scale_color_manual(values = c("yes" = "darkblue", "no" = "gold")) +
  scale_fill_manual(values = c("yes" = "darkblue", "no" = "gold")) +
  labs(title="Credit applications and their outcomes in relation to income",x = "Income in USD 10.000", y = "Applications") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the histogram above we can determine that our income data doesn’t follow normal distribution.

mean of accepted applications is slightly different from the denied ones but the difference isn’t really significant. The values mostly cover each other and behave almost identically therefore i can’t see any significant influence of income size to successful credit card application

Ratio of monthly credit card expenditure to yearly income relation to credit

CreditCard$income = as.numeric(CreditCard$income)
CreditCard= na.omit(CreditCard)

CreditCard$binary_outcome <- as.factor(CreditCard$card)

model <- glm(binary_outcome ~ income, data = CreditCard, family = binomial())

summary(model)
## 
## Call:
## glm(formula = binary_outcome ~ income, family = binomial(), data = CreditCard)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.73452    0.15842   4.636 3.55e-06 ***
## income       0.15582    0.04597   3.390 0.000699 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1404.6  on 1318  degrees of freedom
## Residual deviance: 1391.6  on 1317  degrees of freedom
## AIC: 1395.6
## 
## Number of Fisher Scoring iterations: 4

Average monthly credit card expenditure regarding credit card risk

Now we go on and start analyzing the expenditure. First, for better understanding I am calculating the mean of expenditures, when the credit card risk is ‘yes’ or ‘no’.

# Calculate mean expenditures by card approval status
mean_expenditures <- CreditCard %>%
  group_by(card) %>%
  summarise(Mean_Expenditure = mean(expenditure, na.rm = TRUE))

# Print the table
knitr::kable(mean_expenditures, caption = "Mean of Monthly Expenditure by Card Approval Status")
Mean of Monthly Expenditure by Card Approval Status
card Mean_Expenditure
no 0.0000
yes 238.6024

As we can see, when the card risk is ‘no’, the monthly expenditure is equal 0. Then it is clear that the expenditure differs significantly with various credit risk. Now I will try to visualize the data, only when credit is ‘yes’ to see more clearly how this data changes.

# Filter the data to include only approved applications
approved_expenditures <- CreditCard %>% 
  filter(card == "yes")

# Plotting the density plot
ggplot(approved_expenditures, aes(x = expenditure, fill = "pink")) +
  geom_density(alpha = 0.5) +
  labs(subtitle = "Distribution of expenditures for approved credit card applications",
       x = "Monthly Expenditure (USD)",
       y = "Density") +
  theme_minimal() +
  theme(legend.position = "none")  # Hide legend if not needed

We can see that in this case monthly expenditure changes with it’s growth. The highest density of expenditure is at about 100 USD and then it begins to drop rapidly. Then it is quite rare for the expenditure to be more than 1000 USD a month.

So in this case, the difference in credit risk notably affects how the monthly expenditure changes. When credit is ‘no’ there are no expenditures (they are equal to 0) and when it is ‘yes’ - expenditures are changing with their prices.

ggplot(CreditCard, aes(x = income, y = expenditure, color = card)) +
  geom_point() +
  labs(title = "Relationship between Monthly Credit Card Expenditure and Yearly Income",
       x = "Yearly Income",
       y = "Monthly Credit Card Expenditure") +
  scale_color_manual(values = c("yes" = "lightblue", "no" = "yellow"),
                     labels = c("Denied", "Accepted")) +
  theme_minimal()

From the above plot, we can observe a relationship between expenditures and incomes for applicants, for denied ones expenditures are zero.