Univariate Analysis

Your turn!

In the code below, the provided R code snippet is an initial step in analyzing the “CreditCard” dataset from the AER package. It begins by loading the dataset into the R environment using data(CreditCard). Then, cardHead <- head(CreditCard) is used to preview the first six rows, giving a glimpse of the data’s content. The command str(CreditCard) reveals the structure of the dataset, showing details about the types and organization of the data variables. Finally, any(is.na(CreditCard)) checks for any missing values in the dataset, ensuring that the data is complete and ready for further analysis.

data(CreditCard)
cardHead <- head(CreditCard)
str(CreditCard)

## 'data.frame':    1319 obs. of  12 variables:
##  $ card       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ reports    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age        : num  37.7 33.2 33.7 30.5 32.2 ...
##  $ income     : num  4.52 2.42 4.5 2.54 9.79 ...
##  $ share      : num  0.03327 0.00522 0.00416 0.06521 0.06705 ...
##  $ expenditure: num  124.98 9.85 15 137.87 546.5 ...
##  $ owner      : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
##  $ selfemp    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dependents : num  3 3 4 0 2 0 2 0 0 0 ...
##  $ months     : num  54 34 58 25 64 54 7 77 97 65 ...
##  $ majorcards : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : num  12 13 5 7 5 1 5 3 6 18 ...

# First, it is important to check is there any NA in our dataset.
any(is.na(CreditCard))

## [1] FALSE

# It is clean.

In this segment of the R code, various columns in the “CreditCard” dataset are converted to appropriate data types for more precise analysis. The ‘card’, ‘owner’, and ‘selfemp’ columns are transformed into factors to properly categorize them, as they represent distinct groups or categories. Meanwhile, columns like ‘reports’, ‘age’, ‘income’, ‘share’, ‘expenditure’, ‘dependents’, ‘months’, ‘majorcards’, and ‘active’ are converted to numeric types, which is crucial for variables that will be used in numerical calculations and statistical analyses. These conversions ensure that each piece of data is accurately represented and analyzed according to its nature

CreditCard$card <- as.factor(CreditCard$card)
CreditCard$owner <- as.factor(CreditCard$owner)
CreditCard$selfemp <- as.factor(CreditCard$selfemp)
CreditCard$reports <- as.numeric(CreditCard$reports)
CreditCard$age <- as.numeric(CreditCard$age)
CreditCard$income <- as.numeric(CreditCard$income)
CreditCard$share <- as.numeric(CreditCard$share)
CreditCard$expenditure <- as.numeric(CreditCard$expenditure)
CreditCard$dependents <- as.numeric(CreditCard$dependents)
CreditCard$months <- as.numeric(CreditCard$months)
CreditCard$majorcards <- as.numeric(CreditCard$majorcards)
CreditCard$active <- as.numeric(CreditCard$active)

In the code, we’re segmenting the yearly incomes from the “CreditCard” dataset into predefined brackets ranging from “0-20 kUSD” to “120-140 kUSD” using R’s ‘cut’ function. This organizes the incomes into more manageable groups. Each group label corresponds to a different income range, simplifying the analysis of income distribution among credit card applicants. The ‘freq’ function then creates a frequency table of these income intervals, formatting it as HTML for easy viewing and integration into web-based outputs. This step is crucial for understanding the economic profile of the dataset’s individuals.

# yearly income
incomes<- c("0-20 kUSD", "20-40 kUSD", "40-60 kUSD", "60-80 kUSD", "80-100 kUSD", "100-120 kUSD", "120-140 kUSD")
income_intervals <- cut(CreditCard$income, breaks = seq(0, 14, by = 2), labels = incomes)
table_of_incomes<-freq(income_intervals,type="html")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

In this R code, we’re analyzing the “CreditCard” dataset’s income data by creating a frequency table. The ‘Freq’ function segments the incomes into bins, and ‘useNA=“always”’ ensures even missing values are counted. This table is then formatted for readability using ‘kable’ from the ‘knitr’ package, setting detailed column names and applying a classic style with “Comic Sans MS” font. This approach provides a clear, visually appealing overview of income distributions among credit card applicants.

tab1<-Freq(CreditCard$income,breaks=seq(0, 14,by = 2), useNA="always")
tab1 %>% kable(col.names = c("Incomes of Credit Card Users","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>% kable_classic(full_width = T, html_font = "Comic Sans MS")

Incomes of Credit Card Users	Frequency	Percentage %	Cumulative frequency	Cumulative percentage %
[0,2]	236	0.1789234	236	0.1789234
(2,4]	783	0.5936315	1019	0.7725550
(4,6]	205	0.1554208	1224	0.9279757
(6,8]	63	0.0477635	1287	0.9757392
(8,10]	23	0.0174375	1310	0.9931766
(10,12]	7	0.0053071	1317	0.9984837
(12,14]	2	0.0015163	1319	1.0000000
<NA>	0	0.0000000	1319	1.0000000

We’re organizing the credit card expenditure data into specific ranges to simplify the analysis. The ranges are defined from “$0-200” up to “$2000-4000”. To achieve this, the cut function is used, where breaks set the boundaries for each range, including smaller increments initially and larger increments for higher expenditure levels.

The categorized expenditure data is then assigned labels corresponding to each range. The freq function generates an HTML-formatted frequency table (‘table_of_expenditures’) of these intervals, which is helpful for visualizing and understanding spending patterns among the credit card applicants. This categorization makes it easier to see how expenditures are distributed across different levels in the dataset.

#expenditures
expenditure <- c("0-200", "200-400", "400-600", "600-800", "800-1000", "1000-2000", "2000-4000")
breaks <- c(seq(0, 1000, by = 200), seq(1001, 2000, by = 1000), seq(2001, 4000, by = 2000))
expenditure_intervals <- cut(CreditCard$expenditure, breaks = breaks, labels = expenditure)
table_of_expenditures <- freq(expenditure_intervals, type = "HTML")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

We’re analyzing the “CreditCard” dataset’s expenditure data by categorizing it into specified ranges using the Freq function, which creates a frequency table for expenditures segmented into intervals such as $0-200 up to $2000-4000. The code ensures inclusion of all data points, even missing values, by setting ‘useNA=“always”’. After creating the table (tab2), it is formatted into a more user-friendly layout with kablefrom theknitrpackage. Column names are specified to detail expenditure categories and their statistics, and the table style is enhanced with ‘kable_classic`’ to achieve a visually appealing full-width presentation, using “Comic Sans MS” font. This organized presentation helps in easily understanding and sharing the expenditure patterns of credit card users.

tab2<-Freq(CreditCard$expenditure,breaks=c(seq(0, 1000,by = 200), seq(1001, 2000, by = 1000), seq(2000, 4000, by = 2000)), useNA="always")
tab2 %>% kable(col.names = c("Expenditures of Credit Card Users","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>% kable_classic(full_width = T, html_font = "Comic Sans MS")

Expenditures of Credit Card Users	Frequency	Percentage %	Cumulative frequency	Cumulative percentage %
[0,200]	915	0.6937074	915	0.6937074
(200,400]	223	0.1690675	1138	0.8627748
(400,600]	102	0.0773313	1240	0.9401061
(600,800]	37	0.0280516	1277	0.9681577
(800,1000]	21	0.0159212	1298	0.9840788
(1000,1001]	0	0.0000000	1298	0.9840788
(1001,2000]	18	0.0136467	1316	0.9977255
(2000,4000]	3	0.0022745	1319	1.0000000
<NA>	0	0.0000000	1319	1.0000000

The R code segments ages from the “CreditCard” dataset into decade groups like “0-10” to “80-90.” Using ‘cut’, ages are categorized into these brackets. Then, ‘freq’ creates a straightforward HTML frequency table, allowing for a quick visualization of age distribution among the credit card applicants.

#age
age <- c("0-10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "80-90")
breaks <- c(seq(0, 90, by = 10))
age_intervals <- cut(CreditCard$age, breaks = breaks, labels = age)
table_of_age <- freq(age_intervals, type="html")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

We’re analyzing age data from the “CreditCard” dataset by categorizing it into decade-based intervals, from “0-90” years, using the ‘Freq’ function. This ensures even missing values are included by specifying ‘useNA=“always”’. The resulting frequency table (tab3) is then formatted for clarity and aesthetics using ‘kable’ from the ‘knitr’ package. You set custom column names to provide detailed descriptions, such as “Ages of Credit Card Users” and statistical metrics. Finally,‘kable_classic’ is used to style the table, opting for full width and “Comic Sans MS” font to make it visually appealing and easy to read. This helps in clearly presenting the distribution of ages among the credit card users.

tab3<-Freq(CreditCard$age,breaks=seq(0, 90,by = 10), useNA="always")
tab3 %>% kable(col.names = c("Ages of Credit Card Users","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>% kable_classic(full_width = T, html_font = "Comic Sans MS")

Ages of Credit Card Users	Frequency	Percentage %	Cumulative frequency	Cumulative percentage %
[0,10]	7	0.0053071	7	0.0053071
(10,20]	22	0.0166793	29	0.0219864
(20,30]	565	0.4283548	594	0.4503412
(30,40]	422	0.3199393	1016	0.7702805
(40,50]	219	0.1660349	1235	0.9363154
(50,60]	61	0.0462472	1296	0.9825625
(60,70]	19	0.0144049	1315	0.9969674
(70,80]	2	0.0015163	1317	0.9984837
(80,90]	2	0.0015163	1319	1.0000000
<NA>	0	0.0000000	1319	1.0000000

In this segment of our R code, we’re classifying the number of dependents for each credit card applicant into categories from “0” to “6+”. Using the ‘cut’ function, the dependents are categorized based on intervals created by ‘seq(0, 7, by = 1)’, where each interval corresponds to an exact count of dependents. The labels for these intervals match the number of dependents directly.

Once these intervals are defined, the freq function generates an HTML frequency table (‘table_of_dependents’) that displays how many applicants fall into each dependents category. This setup allows for easy visualization and analysis of the distribution of dependents among the credit card applicants, helping to understand demographic dependencies within the dataset.

#dependents
dependents <- c("0", "1", "2", "3", "4", "5", "6")
dependents_intervals <- cut(CreditCard$dependents, breaks = seq(0, 7, by = 1), labels = dependents)
table_of_dependents <- freq(dependents_intervals, type = "html")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

tab4<-Freq(CreditCard$dependents,breaks=seq(0, 6,by = 1), useNA="always")
tab4 %>% kable(col.names = c("The Numbers of Dependents","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>% kable_classic(full_width = T, html_font = "Comic Sans MS")

The Numbers of Dependents	Frequency	Percentage %	Cumulative frequency	Cumulative percentage %
[0,1]	926	0.7020470	926	0.7020470
(1,2]	218	0.1652767	1144	0.8673237
(2,3]	115	0.0871873	1259	0.9545110
(3,4]	44	0.0333586	1303	0.9878696
(4,5]	9	0.0068234	1312	0.9946929
(5,6]	7	0.0053071	1319	1.0000000
<NA>	0	0.0000000	1319	1.0000000

We’re preparing a detailed analysis of the “CreditCard” dataset’s dependents data by breaking it down into specific categories, ranging from 0 to 6 dependents. The ‘Freq’ function helps segment these data points into these categories, ensuring no data point is missed by including ‘useNA=“always”’ to account for any missing values. The resultant frequency table, ‘tab4’, is then formatted using ‘kable’ from the ‘knitr’ package to enhance readability and presentation. You define explicit column names like “The Numbers of Dependents” and various statistical measures to provide a comprehensive view. Lastly, ‘kable_classic’ is applied with full-width formatting and “Comic Sans MS” font to give the table a clean and accessible visual style, making it easier to interpret how dependents are distributed among the credit card applicants.

CreditCard$ratio <- (CreditCard$expenditure / CreditCard$months) / (CreditCard$income*10000)
#ratio of monthly credit card expenditure to yearly income
ratio <- c("0-0.004", "0.004-0.008", "0.008-0.012", "0.012-0.016", "0.016-0.020", "0.020-0.024", "0.024-0.028")
breaks <- c(seq(0, 0.028, by = 0.004))
ratio_intervals <- cut(CreditCard$ratio, breaks = breaks, labels = ratio)
table_of_ratio <- freq(ratio_intervals, type = "html")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Here, we’re focusing on the ratio of incomes to expenditures in the “CreditCard” dataset by creating specific intervals using the ‘Freq’ function. This method segments the ratio data into intervals from 0 to 0.028, with steps of 0.004. Including ‘useNA=“always”’ ensures that even missing values are accounted for, providing a comprehensive view of the data.

Once segmented, the resulting frequency table, ‘tab5’, is neatly formatted with ‘kable’ from the knitr package. You specify clear column headings such as “Incomes/Expenditures Ratio” and statistical metrics for a detailed breakdown. The ‘kable_classic’ function is then used to apply a classic style to the table, enhancing its readability with “Comic Sans MS” font in full width. This setup makes it easier to analyze and visualize how the ratio of incomes to expenditures varies among credit card users, providing valuable insights into their financial behavior.

tab5<-Freq(CreditCard$ratio,breaks=seq(0, 0.028,by = 0.004), useNA="always")
tab5 %>% kable(col.names = c("Incomes/Expenditures Ratio","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>% kable_classic(full_width = T, html_font = "Comic Sans MS")

Incomes/Expenditures Ratio	Frequency	Percentage %	Cumulative frequency	Cumulative percentage %
[0,0.004]	1286	0.9749810	1286	0.9749810
(0.004,0.008]	18	0.0136467	1304	0.9886277
(0.008,0.012]	7	0.0053071	1311	0.9939348
(0.012,0.016]	2	0.0015163	1313	0.9954511
(0.016,0.02]	2	0.0015163	1315	0.9969674
(0.02,0.024]	0	0.0000000	1315	0.9969674
(0.024,0.028]	1	0.0007582	1316	0.9977255
<NA>	3	0.0022745	1319	1.0000000

The main purpose of the above codes is to divide the information in the CreditCard database into certain intervals and make their frequency measurements and reflect these measurements to a table in html type in the desired font and give us the results.

PLOTS

incomes <- c("0-20 kUSD", "20-40 kUSD", "40-60 kUSD", "60-80 kUSD", "80-100 kUSD", "100-120 kUSD", "120-140 kUSD")
income_intervals <- cut(CreditCard$income, breaks = seq(0, 14, by = 2), labels = incomes)


CreditCard$income_intervals <- income_intervals


plot <- ggplot(CreditCard, aes(x = income_intervals, fill = factor(card))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Number of People by Income and Card Status", x = "Income Bracket", y = "Number of People", fill = "Card Status") +
  coord_flip() + scale_fill_manual(values = c("yes" = "blue", "no" = "red")) + theme_minimal()

print(plot)

ages <- c("0-10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "80-90", "90-100")
age_intervals <- cut(CreditCard$age, breaks = seq(0, 100, by = 10), labels = ages)


CreditCard$age_intervals <- age_intervals


plot <- ggplot(CreditCard, aes(x = age_intervals, fill = factor(card))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Number of People by Age and Card Status", x = "Age Bracket", y = "Number of People", fill = "Card Status") + coord_flip() +  
  scale_fill_manual(values = c("yes" = "blue", "no" = "red")) + theme_minimal()  

print(plot)

This R code uses the income and expenditure column in the CreditCard data frame to group by income ranges and visualizes the number of people in each of these ranges according to their credit card status. This block of code produces an easy to understand and visually impressive bar chart. This is a useful method for assessing credit card ownership rates by income levels.

In the first graph, we can see that the largest proportion of credit card holders earn 20 - 40k USD, and the largest proportion of non-card holders also earn this amount of money, so we can say that the general earning range of the society is in this range.

And, in the second graph, we see that people between the ages of 20 and 30 are the most likely card holders, and we can say that the society in general uses cards because the number of card users is higher than non-card users in every age group

age <- c("0-10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "80-90")
breaks <- c(seq(0, 90, by = 10))
CreditCard$age_group <- cut(CreditCard$age, breaks = breaks, labels = age, include.lowest = TRUE)

avg_expenditures <- CreditCard %>%
  group_by(age, card) %>%
  summarise(avg_expenditure = mean(expenditure, na.rm = TRUE), .groups = 'drop')

ggplot(avg_expenditures, aes(x = age, y = avg_expenditure)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Expenditures by Ages", x = "Age Group", y = "Average Expenditure") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When we examine the table above, we see that; The young population (20-30) is more likely to pay by credit card, but their average spending does not exceed 500. We see the most spending around the ages of 33-35. As we get older, the rate of credit card usage and the amount of spending decreases, and we see the most serious change at the age of 50 and beyond.

CreditCard <- CreditCard %>%
  mutate(monthly_expenditure = expenditure / 12, expenditure_income_ratio = monthly_expenditure / income)

ggplot(CreditCard, aes(x = expenditure_income_ratio)) +
  geom_histogram(binwidth = 0.75, fill = "blue", color = "black") +
  labs(title = "Monthly Expenditure to Yearly Income Ratio",
       x = "Expenditure/Income Ratio",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This R script adds a new variable to the CreditCard data frame and then creates a histogram on that variable. This code creates a histogram showing the ratio between monthly spending and annual income. This histogram can be used to visualize how spending habits are related to income levels.

expenditure_labels <- c("0-200", "200-400", "400-600", "600-800", "800-1000", "1000-2000", "2000-4000")
expenditure_breaks <- c(seq(0, 1000, by = 200), seq(1001, 2000, by = 1000), seq(2001, 4000, by = 2000))

CreditCard$expenditure_intervals <- cut(CreditCard$expenditure, breaks = expenditure_breaks, labels = expenditure_labels)

plot <- ggplot(CreditCard, aes(x = dependents, y = ..count.., fill = expenditure_intervals)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Number of People by Dependent Count and Expenditure Bracket",
       x = "Dependent Count",
       y = "Number of People",
       fill = "Expenditure Bracket") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
  scale_fill_brewer(palette = "Set3")

print(plot)

This R code divides the expenditure variable in the CreditCard data frame into specified expenditure intervals and visualizes these intervals with dependents. As a result, this block of code analyzes the expenditure and number of dependents data in the CreditCard data frame and produces a colorful and aesthetic bar chart showing the number of people by number of dependents and spending range. This is particularly useful for studying financial behavior and spending habits in combination with demographic factors.

count_data <- CreditCard %>%
  group_by(dependents, expenditure_intervals) %>%
  summarise(count = n(), .groups = 'drop')

total_population <- sum(count_data$count)

count_data <- count_data %>%
  mutate(ratio = count / total_population)

print(count_data)

## # A tibble: 43 × 4
##    dependents expenditure_intervals count   ratio
##         <dbl> <fct>                 <int>   <dbl>
##  1          0 0-200                   306 0.232  
##  2          0 200-400                 112 0.0849 
##  3          0 400-600                  53 0.0402 
##  4          0 600-800                  14 0.0106 
##  5          0 800-1000                 11 0.00834
##  6          0 2000-4000                10 0.00758
##  7          0 <NA>                    153 0.116  
##  8          1 0-200                   127 0.0963 
##  9          1 200-400                  47 0.0356 
## 10          1 400-600                  23 0.0174 
## # ℹ 33 more rows

When we examine these rates, we can say that there is no significant relationship between the number of dependents and spending ranges. Having fewer or more dependents does not seem to significantly affect spending ranges. Between low and high spending ranges, There was no notable change in the observed rates as the number of dependents increased or decreased. This shows that the number of dependents does not have a significant impact on spending habits.

count_data <- CreditCard %>%
  group_by(selfemp, card) %>%
  summarise(count = n(), .groups = 'drop')

ggplot(count_data, aes(x = selfemp, y = count, fill = card)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Relationship Between Self-Employment and Card Ownership",
       x = "Self-Employment Status",
       y = "Number of People",
       fill = "Card Ownership") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This code explains how many people are self-employed and whether they have a card or not, and groups specific information in the data set among itself.

count_data <- CreditCard %>%
  group_by(selfemp, card) %>%
  summarise(count = n(), .groups = 'drop')

print(count_data)

## # A tibble: 4 × 3
##   selfemp card  count
##   <fct>   <fct> <int>
## 1 no      no      268
## 2 no      yes     960
## 3 yes     no       28
## 4 yes     yes      63

self_emp_no_card_no <- 268
self_emp_no_card_yes <- 960
self_emp_yes_card_no <- 28
self_emp_yes_card_yes <- 63

total_population <- self_emp_no_card_no + self_emp_no_card_yes + self_emp_yes_card_no + self_emp_yes_card_yes

ratio1 <- self_emp_no_card_yes / (self_emp_no_card_no + self_emp_no_card_yes)
ratio2 <- self_emp_yes_card_yes / (self_emp_no_card_no + self_emp_no_card_yes)
ratio3 <- self_emp_no_card_no / total_population
ratio4 <- self_emp_no_card_yes / total_population
ratio5 <- self_emp_yes_card_no / total_population
ratio6 <- self_emp_yes_card_yes / total_population

cat("Ratio of people who are not self-employed and have a card:", ratio1, "\n")

## Ratio of people who are not self-employed and have a card: 0.781759

cat("Ratio of people who are self-employed and have a card:", ratio2, "\n")

## Ratio of people who are self-employed and have a card: 0.05130293

cat("Ratio of people who are not self-employed and do not have a card:", ratio3, "\n")

## Ratio of people who are not self-employed and do not have a card: 0.2031842

cat("Ratio of people who are not self-employed and have a card:", ratio4, "\n")

## Ratio of people who are not self-employed and have a card: 0.7278241

cat("Ratio of people who are self-employed and do not have a card:", ratio5, "\n")

## Ratio of people who are self-employed and do not have a card: 0.0212282

cat("Ratio of people who are self-employed and have a card:", ratio6, "\n")

## Ratio of people who are self-employed and have a card: 0.04776346

This R-code groups and aggregates data on self-employment status (selfemp) and credit card ownership status (card) from the CreditCard data frame. It then computes counts for these groups and calculates various ratios. After calculating, it calculates and prints the proportions of these groups in relation to the total population. This can be used to understand and interpret the relationships between different groups.

When you take these rates into consideration, we can say that self-employment and card ownership are generally independent of each other and there is no relationship. Because the card ownership rate of those who are self-employed and the card ownership rate of those who are not are very close to each other. In this situation, There is no conclusive evidence indicating any addiction or relationship.

summary_ratio <- CreditCard %>%
  group_by(card) %>%
  summarise(
    "Minimum" = min(expenditure_income_ratio),
    "Maximum" = max(expenditure_income_ratio),
    "Median" = median(expenditure_income_ratio),
    "Mean" = mean(expenditure_income_ratio),
    "Quartile 1" = quantile(expenditure_income_ratio, 0.25),
    "Quartile 3" = quantile(expenditure_income_ratio, 0.75),
    "Sd" = sd(expenditure_income_ratio),
    "IQR" = IQR(expenditure_income_ratio),
    "Sx" = IQR(expenditure_income_ratio) / 2,
    "Var %" = sd(expenditure_income_ratio) / mean(expenditure_income_ratio),
    "IQR Var %" = IQR(expenditure_income_ratio) / median(expenditure_income_ratio),
    "Skewness" = skewness(expenditure_income_ratio),
    "Kurtosis" = kurtosis(expenditure_income_ratio)
  )

summary_table <- summary_ratio %>%
  pivot_longer(-card) %>%
  pivot_wider(names_from = card, values_from = value)

kbl(summary_table, digits = 2,
    caption = "Expenditure by income ratio on card ownership",
    col.names = c("", "No", "Yes"),
    row.names = FALSE, escape = FALSE) %>% kable_classic(full_width = F, html_font = "Cambria")%>% kable_styling(bootstrap_options = c("striped", "hover"))

Expenditure by income ratio on card ownership
	No	Yes
Minimum	0	0.00
Maximum	0	62.94
Median	0	4.18
Mean	0	6.14
Quartile 1	0	1.84
Quartile 3	0	7.90
Sd	0	6.88
IQR	0	6.07
Sx	0	3.03
Var %	NaN	1.12
IQR Var %	NaN	1.45
Skewness	NaN	3.04
Kurtosis	NaN	14.83

This R script calculates the summary statistics of the variable expenditure_income_ratio in the CreditCard data frame by credit card ownership status (card) and organizes these statistics in a table. This allows comparison of expenditure income ratios between different credit card ownership statuses.

Comparing these data, card owners have a way higher maximum spending-to-income ratio (62.94) compared to non-card owners, showing that card owners tend to have a wider range of spending habits and variations, maybe because they have more access to credit. we can also say that the average expenditure-income ratio is higher than the typical expenditure-income ratio for both card owners and non-card owners, suggesting that there are some people with very high expenditure-income ratios, especially among card owners.

summary_ratio <- CreditCard %>%
  group_by(selfemp) %>%
  summarise(
    "Minimum" = min(expenditure_income_ratio),
    "Maximum" = max(expenditure_income_ratio),
    "Median" = median(expenditure_income_ratio),
    "Mean" = mean(expenditure_income_ratio),
    "Quartile 1" = quantile(expenditure_income_ratio, 0.25),
    "Quartile 3" = quantile(expenditure_income_ratio, 0.75),
    "Sd" = sd(expenditure_income_ratio),
    "IQR" = IQR(expenditure_income_ratio),
    "Sx" = IQR(expenditure_income_ratio) / 2,
    "Var %" = sd(expenditure_income_ratio) / mean(expenditure_income_ratio),
    "IQR Var %" = IQR(expenditure_income_ratio) / median(expenditure_income_ratio),
    "Skewness" = skewness(expenditure_income_ratio),
    "Kurtosis" = kurtosis(expenditure_income_ratio)
  )

summary_table <- summary_ratio %>%
  pivot_longer(-selfemp) %>%
  pivot_wider(names_from = selfemp, values_from = value)

kbl(summary_table, digits = 2,
    caption = "Expenditure by income ratio on self employment",
    col.names = c("", "No", "Yes"),
    row.names = FALSE, escape = FALSE) %>% kable_classic(full_width = F, html_font = "Cambria")%>% kable_styling(bootstrap_options = c("striped", "hover"))

Expenditure by income ratio on self employment
	No	Yes
Minimum	0.00	0.00
Maximum	62.94	17.93
Median	2.85	1.28
Mean	4.90	2.85
Quartile 1	0.18	0.00
Quartile 3	6.63	3.89
Sd	6.71	4.09
IQR	6.45	3.89
Sx	3.23	1.95
Var %	1.37	1.43
IQR Var %	2.27	3.05
Skewness	3.13	1.83
Kurtosis	15.70	2.80

This R script calculates summary statistics of data grouped by the variable self-employment status (selfemp) in the CreditCard data frame and displays these statistics as a table in an organized way. This allows comparison of expenditure income ratios between different self-employment statuses.

While we compare these datas we can see that, non-self-employed people have a higher “max expenditure-income ratio” (62.94) compared to self-employed people (17.93). This suggests that non-self-employed people might have some really big spending habits. The range of spending habits between the 25th and 75th percentiles , is wider for non-self-employed (6.45) compared to self-employed (3.89) people. This means there’s more variety in spending habits among non-self-employed people.

Univariate Analysis

Emre Aydin, Semih Elmas

2024-04-20

Your turn!