Descriptive Statistics
Univariate Statistics
Your turn!
Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.
New dataset
this dataset displays info about credit cards of type “yes” and “no”. based on this test we will perform analisys of those two types.
## card reports age income share expenditure owner selfemp
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no
## 7 yes 0 27.91667 3.9600 0.012575760 40.833330 no no
## 8 yes 0 29.16667 2.3700 0.076433760 150.790000 yes no
## 9 yes 0 37.00000 3.8000 0.245627900 777.821700 yes no
## 10 yes 0 28.41667 3.2000 0.019780000 52.580000 no no
## dependents months majorcards active
## 1 3 54 1 12
## 2 3 34 1 13
## 3 4 58 1 5
## 4 0 25 1 7
## 5 2 64 1 5
## 6 0 54 1 1
## 7 2 7 1 5
## 8 0 77 1 3
## 9 0 97 1 6
## 10 0 65 1 18
Format
A data frame containing 1,319 observations on 12 variables.
card: Factor. Was the application for a credit card accepted?
reports: Number of major derogatory reports.
age: Age in years plus twelfths of a year.
income: Yearly income (in USD 10,000).
share: Ratio of monthly credit card expenditure to yearly income.
expenditure: Average monthly credit card expenditure.
owner: Factor. Does the individual own their home?
selfemp: Factor. Is the individual self-employed?
dependents: Number of dependents.
months: Months living at current address.
majorcards: Number of major credit cards held.
active: Number of active credit accounts.
Age in relation to succsessful credit aplication
First analysis covers the relevance of age in getting approval for credit. my first step will be to round the age to integer form so it is easier to group it.
# add column of age in integer form
age_analysis <- CreditCard
age_analysis$int_age <- as.integer(age_analysis$age)
# remove old column of age in double form
age_analysis <- subset(age_analysis, select = c (card, int_age))
# i found some age columns equal 0, it seems like wrong input/ missing data
# therefore i will get rid of them
age_analysis <- filter(age_analysis, int_age !=0)
age_analysis$age_group = as.character("")
age_analysis$age_group <- case_when(
age_analysis$int_age >= 18 & age_analysis$int_age <= 25 ~ "18 - 25",
age_analysis$int_age >= 26 & age_analysis$int_age <= 35 ~ "26 - 35",
age_analysis$int_age >= 36 & age_analysis$int_age <= 50 ~ "36 - 50",
age_analysis$int_age >= 51 ~ "51",
TRUE ~ NA_character_
)
result <- age_analysis %>%
group_by(age_group) %>%
summarise(Yes_Count = sum(card == "yes"), No_Count = sum(card == "no"))
result <- result %>%
mutate(Percentage_Yes = (Yes_Count / (Yes_Count+No_Count)) * 100)The graph shows us that most of the samples were collected from age group 26-35 years. Every age group had more instances of successful credit applications. From this data we can assume that age isnt a significant factor in credit aplications.
ggplot(age_analysis, aes(x = age_group, fill = card)) +
geom_bar(position = "stack") +
labs(title = "Accepted and denied credit cards in different age groups",
x = "Age Group",
y = "Ammount of applications") +
scale_fill_manual(values = c("yes" = "lightgreen", "no" = "red")) +
theme_minimal()
## Yearly income relevance in connection to application for credit
Now we can take a look at yearly income.
mean_income <- aggregate(income ~ card, data = CreditCard, FUN = mean)
ggplot(CreditCard, aes(x = income, colour = card, fill = card)) +
geom_histogram(position = "stack", alpha=0.5, bins = 30, size = 0.75) +
geom_vline(data = mean_income, aes(xintercept = income, color = card), linetype = "dashed", size=1) +
scale_color_manual(values = c("yes" = "darkblue", "no" = "gold")) +
scale_fill_manual(values = c("yes" = "darkblue", "no" = "gold")) +
labs(title="Credit applications and their outcomes in relation to income",x = "Income in USD 10.000", y = "Applications") +
theme_minimal()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the histogram above we can determine that our income data doesn’t follow normal distribution.
mean of accepted applications is slightly different from the denied ones but the difference isn’t really significant. The values mostly cover each other and behave almost identically therefore i can’t see any significant influence of income size to successful credit card application
Ratio of monthly credit card expenditure to yearly income relation to credit
CreditCard$income = as.numeric(CreditCard$income)
CreditCard= na.omit(CreditCard)
CreditCard$binary_outcome <- as.factor(CreditCard$card)
model <- glm(binary_outcome ~ income, data = CreditCard, family = binomial())
summary(model)##
## Call:
## glm(formula = binary_outcome ~ income, family = binomial(), data = CreditCard)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.73452 0.15842 4.636 3.55e-06 ***
## income 0.15582 0.04597 3.390 0.000699 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1404.6 on 1318 degrees of freedom
## Residual deviance: 1391.6 on 1317 degrees of freedom
## AIC: 1395.6
##
## Number of Fisher Scoring iterations: 4
Average monthly credit card expenditure regarding credit card risk
Now we go on and start analyzing the expenditure. First, for better understanding I am calculating the mean of expenditures, when the credit card risk is ‘yes’ or ‘no’.
# Calculate mean expenditures by card approval status
mean_expenditures <- CreditCard %>%
group_by(card) %>%
summarise(Mean_Expenditure = mean(expenditure, na.rm = TRUE))
# Print the table
knitr::kable(mean_expenditures, caption = "Mean of Monthly Expenditure by Card Approval Status")| card | Mean_Expenditure |
|---|---|
| no | 0.0000 |
| yes | 238.6024 |
As we can see, when the card risk is ‘no’, the monthly expenditure is equal 0. Then it is clear that the expenditure differs significantly with various credit risk. Now I will try to visualize the data, only when credit is ‘yes’ to see more clearly how this data changes.
# Filter the data to include only approved applications
approved_expenditures <- CreditCard %>%
filter(card == "yes")
# Plotting the density plot
ggplot(approved_expenditures, aes(x = expenditure, fill = "pink")) +
geom_density(alpha = 0.5) +
labs(subtitle = "Distribution of expenditures for approved credit card applications",
x = "Monthly Expenditure (USD)",
y = "Density") +
theme_minimal() +
theme(legend.position = "none") # Hide legend if not neededWe can see that in this case monthly expenditure changes with it’s growth. The highest density of expenditure is at about 100 USD and then it begins to drop rapidly. Then it is quite rare for the expenditure to be more than 1000 USD a month.
So in this case, the difference in credit risk notably affects how the monthly expenditure changes. When credit is ‘no’ there are no expenditures (they are equal to 0) and when it is ‘yes’ - expenditures are changing with their prices.
ggplot(CreditCard, aes(x = income, y = expenditure, color = card)) +
geom_point() +
labs(title = "Relationship between Monthly Credit Card Expenditure and Yearly Income",
x = "Yearly Income",
y = "Monthly Credit Card Expenditure") +
scale_color_manual(values = c("yes" = "lightblue", "no" = "yellow"),
labels = c("Denied", "Accepted")) +
theme_minimal()
From the above plot, we can observe a relationship between expenditures
and incomes for applicants, for denied ones expenditures are zero.