Introduction

Shopping is often gendered as a pursuit for women.
It’s a popular notion that Women tend to spend more than their counterparts and this has been engrained into the fabric of our society for decades.
By collating facts and figures we seek to answer the age-old stereotyped question of “Who spends more money?”
People are more inclined to pay with credit cards than cash and so we look at the card swipes(i.e. credit card bills) for the analysis.
Analyzing the bill statements of the credit card for each sex, we derive comprehensible insight of their expense patterns.

Introduction Cont.

Men VS Women

“THE BATTLE”

Who spends more?

Problem Statement

This study drives insights from data to unravel the question of who swipes more i.e. spends more money.
The chief objective is to scrutinize spending habits of males and females by observing their monthly credit card bills.The average bill amount of past 6 months is used to infer the trend of expenses.
Using two sample t-test we determine whether there is any statistically significant difference between spending by male and female.
We then look further into whether the ability to pay credit bills on time is associated to gender type using Chi-square test of association and determine the tendency for a person(male/female) to default(not pay bills).

Data

Data Source: UCI machine learning repository
Data Description:
– The dataset used for this investigation beholds 30,000 observations of Taiwan Customers, whose details are observed across 25 variables.
– It contains customer details such as Gender, Education, Marital Status, Age and Financial Details such as Credit Limit, Payment Status, Bill Amount and Amount paid in past 6 months and Default Payment Status.
List of 25 variables of the dataset is provided below:

Sr.No	Variable	Variable Explanation
1	ID	A unique ID to differentiate every record
2	Balance Limit	Credit card Balance Limit (NT Dollars)
3	Gender	1 = Male; 2 = Female
4	Education	1 = graduate school; 2 = university; 3 = high school; 4 = others
5	Marital status	1 = married; 2 = single; 3 = others
6	Age	In Years
7	Pay_0,Pay_2,Pay_3,	History of past monthly payment records (from April to September).
	Pay_4,Pay_5,Pay_6	The measurement scale for the repayment status is:
		-1 = pay duly;
		1 = payment delay for one month;
		2 = payment delay for two months; . . .;
		8 = payment delay for eight months.
8	Bill_Amt1, Bill_Amt2,	Amount of bill statement (NT dollar,from April to September)
	Bill_Amt3,Bill_Amt4,	Bill_Amt1 = amount of bill statement in September
	Bill_Amt5,Bill_Amt6	Bill_Amt2= amount of bill statement in August
		. . .; Bill_Amt6 = amount of bill statement in April.
9	Pay_Amt1,Pay_Amt2,	Amount of previous payment from (NT dollar,April to September)
	Pay_Amt3,Pay_Amt4,	Pay_Amt1= amount paid in September
	Pay_Amt5,Pay_Amt6	Pay_Amt2 = amount paid in August;
		. . .;Pay_Amt6 = amount paid in April.
10	default payment	Yes = 1, No = 0

Data Cont.

#Reading and importing the dataset
spend<-read.csv("C:\\Users\\aditi\\Downloads\\default of credit card clients.csv")
#Compactly display the structure of the variables
str(spend)

## 'data.frame':    30000 obs. of  25 variables:
##  $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LIMIT_BAL                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ BILL_AMT2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ BILL_AMT3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ BILL_AMT4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ BILL_AMT5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ BILL_AMT6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ PAY_AMT1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ PAY_AMT2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ PAY_AMT3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ PAY_AMT5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ PAY_AMT6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...

#Providing an Overview of the dataset
hd<-head(spend)

hd %>% kable() %>% kable_styling()

ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default.payment.next.month
1	20000	2	2	1	24	2	2	-1	-1	-2	-2	3913	3102	689	0	0	0	0	689	0	0	0	0	1
2	120000	2	2	2	26	-1	2	0	0	0	2	2682	1725	2682	3272	3455	3261	0	1000	1000	1000	0	2000	1
3	90000	2	2	2	34	0	0	0	0	0	0	29239	14027	13559	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
4	50000	2	2	1	37	0	0	0	0	0	0	46990	48233	49291	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
5	50000	1	2	1	57	-1	0	-1	0	0	0	8617	5670	35835	20940	19146	19131	2000	36681	10000	9000	689	679	0
6	50000	1	1	2	37	0	0	0	0	0	0	64400	57069	57608	19394	19619	20024	2500	1815	657	1000	1000	800	0

Data Pre-processing

Looking for inconsistencies in data
– Read CSV file
– Check for missing and impossible values
– Check data types of variables
Cleaning the data
– Replace impossible values by appropriate nearest possible value; categorical variables have values which do not correspond to any category
– Convert numerical categories to factor with proper labels; categorical variables present with numerical data

#As levels 0,5,6 of variable Education belongs to unknown categories it has been assigned to category 4(Others).
spend<-spend %>% mutate(EDUCATION = ifelse(EDUCATION==0|EDUCATION==5|EDUCATION==6,4,EDUCATION))

#As level 0 of variable Marriage belongs to unknown category it has been assigned to category 3(Others).
spend<-spend %>% mutate(MARRIAGE = ifelse(MARRIAGE==0,3,MARRIAGE))

#As level 0 and -2 of below listed set of variables belongs to unknown category it has been assigned to category -1(Pay duly).
spend<-spend %>% mutate(PAY_0 = ifelse(PAY_0==0|PAY_0==-2,-1,PAY_0))
spend<-spend %>% mutate(PAY_2 = ifelse(PAY_2==0|PAY_2==-2,-1,PAY_2))
spend<-spend %>% mutate(PAY_3 = ifelse(PAY_3==0|PAY_3==-2,-1,PAY_3))
spend<-spend %>% mutate(PAY_4 = ifelse(PAY_4==0|PAY_4==-2,-1,PAY_4))
spend<-spend %>% mutate(PAY_5 = ifelse(PAY_5==0|PAY_5==-2,-1,PAY_5))
spend<-spend %>% mutate(PAY_6 = ifelse(PAY_6==0|PAY_6==-2,-1,PAY_6))

Data Pre-processing Cont.

#Checking for no.of levels for all variables to be coverted to factors
#Assigning appropriate labels to the levels of variable Education.
spend$EDUCATION<- factor(spend$EDUCATION, levels = c(1,2,3,4), labels=c("Graduate School", "University", "High School", "Others") )
#Assigning appropriate labels to the levels of variable Marriage.
spend$MARRIAGE<- factor(spend$MARRIAGE, levels = c(1,2,3), labels=c("Married", "Single", "Others"))
#Assigning appropriate labels to the levels of variable Sex.
spend$SEX <- factor(spend$SEX ,levels = c('1','2'), labels=c("MALE", "FEMALE"))
#Assigning appropriate labels to the variable status of payment
spend$default.payment.next.month <- factor(spend$default.payment.next.month,levels = c(0,1), labels=c("No  Default", "Default"))
#Following function aids to efforlessly assign labels to levels.
convtfactor<-function(x){factor(x, levels= c(-1,1,2,3,4,5,6,7,8), labels = c( "Pay duly", "payment delay for one month",
                      "payment delay for two months", "payment delay for three months", "payment delay for four months", 
                      "payment delay for five months", "payment delay for six months", "payment delay for seven months",
                       "payment delay for eight months"))}
#Applying the above function on the variables PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6.
spend$PAY_0<-convtfactor(spend$PAY_0)
spend$PAY_2<-convtfactor(spend$PAY_2)
spend$PAY_3<-convtfactor(spend$PAY_3)
spend$PAY_4<-convtfactor(spend$PAY_4)
spend$PAY_5<-convtfactor(spend$PAY_5)
spend$PAY_6<-convtfactor(spend$PAY_6)

#Applying Mutate to determine Average Bill amount paid in past 6 months.              
spend<-mutate(spend,BILL_AMOUNT_AVG=(BILL_AMT1+BILL_AMT2+BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6)/6)

Outlier Treatment

Outliers for the variable Bill amount average are handled using boxplot approach.
The observations having outliers are removed from the dataset.
The observations having negative values are removed

#In this chunk we would be dealing with outliers
#Handling outliers for Average Bill Amount variable
outlier <- boxplot(spend$BILL_AMOUNT_AVG,plot=FALSE)$out
spend<- spend[-which(spend$BILL_AMOUNT_AVG %in% outlier),]
spend<- spend[-which(spend$BILL_AMOUNT_AVG <0),]

Checking the no. of total observations left in the dataset

dim(spend)

## [1] 27224    26

Descriptive Statistics and Visualisation.

The table provides a brief summary statistics of ‘bill amount average’ variable to represent quantitative insights.
Summarizing the data gender-wise
– The mean ‘bill amount average’ for males and females are not the same.

#Providing descriptive summary for variable BILL_AMOUNT_AVG separately for each Sex.
table1<-spend %>% group_by(SEX) %>% summarise(Min = min(BILL_AMOUNT_AVG,na.rm = TRUE),
                                             Q1 = quantile(BILL_AMOUNT_AVG,probs = .25,na.rm = TRUE),
                                             Median = median(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             Q3 = quantile(BILL_AMOUNT_AVG,probs = .75,na.rm = TRUE),
                                             Max = max(BILL_AMOUNT_AVG,na.rm = TRUE),
                                             Mean = mean(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             SD = sd(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             n = n(),
                                             Missing = sum(is.na(BILL_AMOUNT_AVG)))

table1 %>% kable() %>% kable_styling()

SEX	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
MALE	0	5211.667	18864.00	44099.50	135506.8	30227.18	32143.58	10685	0
FEMALE	0	3525.917	17470.17	44019.92	135548.5	29448.48	32697.98	16539	0

Descriptive Statistics Cont.

Visualizing distribution of bill amount in male and female using box plot.

#Plotting Box Plot to visualize the data
spend %>% boxplot(BILL_AMOUNT_AVG ~ SEX, data=.,ylab="Average Bill Amount", main = "Boxplot - Average Amount spent in dollars by Gender", col = c(rgb(0.1,0.1,0.7,0.5),rgb(0.8,0.1,0.3,0.6)))

It can be inferred from the box plot that, the range of values of bill amount is almost similar in both genders.
Also, the median value of males is at a slightly higher height.

Hypothesis Testing

To compare the mean spending amount of both categories, two sample t-test is performed.
The result of t-test can give proper insight of whether there is any statistically significant difference between mean values of female and male bill amounts.

Hypotheses
Null Hypothesis: There is no statistically significant difference between mean of bill amount of male and female. \[H_0: \mu_1 = \mu_2 \]

Alternate Hypothesis: There is a statistically significant difference between mean of bill amount of male and female. \[H_A: \mu_1 \ne \mu_2\] Assumptions
– Comparing two independent population means i.e. bill amounts of males and females with unknown variance
– Population homogeneity of variance
– Population data sample used contains large data of 27,224 observations (n>30)

#Plotting QQPlot for Male BILL_AMOUNT_AVG
bill_male <- spend %>% filter(SEX == "MALE")
bill_male$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")

## [1] 5991 9658

title(main="BIll Amount for Male")

#Plotting QQPlot for Female BILL_AMOUNT_AVG
bill_female <- spend %>% filter(SEX == "FEMALE")
bill_female$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")

## [1] 7107 6754

title(main="BIll Amount for Female")

Hypothesis Testing Cont.

Observing the qqplot, though the deviation from the normal line is high, the normality can be assumed considering CLT(the sample size of both groups are greater than 30).
Check for equal variance of both population using Levene test.
Following hypothesis is checked:

Null Hypothesis - There is no statistically significant difference in the variance of bill amount of males and females. \[H_0: \sigma_1^2 = \sigma_2^2 \] Alternate Hypothesis - There is a statistically significant difference in the variance of bill amount of males and females. \[H_A: \sigma_1^2 \ne \sigma_2^2\]

#check for equal variance
levene<-leveneTest(BILL_AMOUNT_AVG ~ SEX, data = spend)

levene %>% kable() %>% kable_styling()

	Df	F value	Pr(>F)
group	1	3.527547	0.0603677
	27222	NA	NA

Result
– Levene’s test for homogeneity of variance returns p-value as 0.06 which is higher than 0.05. Therefore, we fail to reject hypothesis
– We assume equal variance.

Hypothesis Testing Cont.

Now, considering the result of levene test, we can apply t-test with TRUE value of equal variance.
Also, the t-test is applied with 95% of CI and value of alternative attribute as TRUE shows that it is two sample t-test.

#Applying two sample t-test
t.test(  BILL_AMOUNT_AVG ~ SEX, data = spend, var.equal = TRUE, alternative = "two.sided" )

## 
##  Two Sample t-test
## 
## data:  BILL_AMOUNT_AVG by SEX
## t = 1.9315, df = 27222, p-value = 0.05343
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   -11.49918 1568.90368
## sample estimates:
##   mean in group MALE mean in group FEMALE 
##             30227.18             29448.48

p-value = 0.053 i.e.( p-value> 0.050).
Also, the 95% CI for difference in means is [-11.50 1568.90], which captures \(H_0\).
As per results, we fail to reject \(H_0\), subsequently it is concluded that there is no statistically significant difference between mean of bill amount of male and female.
Therefore, in the constrained extent of the present data investigated, it is evident that female and male represents quite similar spending behaviour.

Ability to pay credit bills

Scrutinizing further the tendency of males and females to pay the bills timely we perform ‘Chi-square of test of association’ to see if status of bills payment is dependent on gender
Hypothesis for the Chi-square of test of association :
\(H_0\)- There is no association between gender and tendency for a person to default (Gender and ability to payback are independent)
\(H_A\)- Alternate Hypothesis - There is an association between gender and tendency for a person to default (Gender and ability to payback are not independent)
Assumptions
That no more than 25% of the cells have expected counts below 5.

First, we report the chi-square, statistic, df and p-value

Test - Chi-square of test of association

Result : p-value < 0.05

#Check for equal variance
chi <- chisq.test(table(spend$default.payment.next.month, spend$SEX))
chi

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(spend$default.payment.next.month, spend$SEX)
## X-squared = 45.522, df = 1, p-value = 1.509e-11

chi$observed

##              
##                MALE FEMALE
##   No  Default  8086  13093
##   Default      2599   3446

chi$expected

##              
##                   MALE    FEMALE
##   No  Default 8312.431 12866.569
##   Default     2372.569  3672.431

Interpretation - Chi-square of test of association

Summary of test
- 0% of expected counts were less than
- chi-square = 45.522, \(df\)= 1
- p-value < 0.05

Decision
- Reject \(H_0\)

We conclude the result of the test discovered a statistically significant association between gender and the default status. According to the bar plot more number of both males and females positively accomplish in clearing the bill amounts thus they fall under the category of ‘No Default’. However, more number of females show this behaviour than males.

tab1 <- table(spend$default.payment.next.month, spend$SEX)
tab2 <- tab1 %>% prop.table(margin=2)
tab2

##              
##                    MALE    FEMALE
##   No  Default 0.7567618 0.7916440
##   Default     0.2432382 0.2083560

barplot(tab2, main = "Default by Gender",ylab="Proportion within Gender", ylim=c(0,1.2), legend=rownames(tab2),beside=TRUE, args.legend=c(x = "top",horiz=TRUE,title="Status of Payment"), xlab="Gender", col = c(rgb(0.1,0.1,0.7,0.8),rgb(0.2,0.7,0.7,0.7)))

Discussion

From the results of two sample t-test, we can say that “Males and Females spend equally”
The conclusion refutes the stereotype that females tend to spend more money and shows that males and females have similar habit to spend money.
Furthermore, we notice that there is an association between gender and status of payment and more number of females clear payments on time.
For future, we can look at the spending habits of males and females closely to find where they spend the most of the money.
The current data looks at customers from Taiwan, for future we can expand the dataset to include customers from across the world.

“Who Swipes More?”

Statistics on Men and Women Spending Habits

RPubs link information

Introduction

Introduction Cont.

Problem Statement

Data

Data Cont.

Data Pre-processing

Data Pre-processing Cont.

Outlier Treatment

Descriptive Statistics and Visualisation.

Descriptive Statistics Cont.

Hypothesis Testing

Hypothesis Testing Cont.

Hypothesis Testing Cont.

Ability to pay credit bills

Test - Chi-square of test of association

Interpretation - Chi-square of test of association

Discussion

References