“Who Swipes More?”

Statistics on Men and Women Spending Habits

Aditi Mahajan(s3757732) Salina Bharthu(s3736867) Neha Rao(s3755214)

Last updated: 02 June, 2019

Introduction

Introduction Cont.

Men VS Women

“THE BATTLE”

Who spends more?

Problem Statement

Data

Sr.No Variable Variable Explanation
1 ID A unique ID to differentiate every record
2 Balance Limit Credit card Balance Limit (NT Dollars)
3 Gender 1 = Male; 2 = Female
4 Education 1 = graduate school; 2 = university; 3 = high school; 4 = others
5 Marital status 1 = married; 2 = single; 3 = others
6 Age In Years
7 Pay_0,Pay_2,Pay_3, History of past monthly payment records (from April to September).
Pay_4,Pay_5,Pay_6 The measurement scale for the repayment status is:
-1 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months; . . .;
8 = payment delay for eight months.
8 Bill_Amt1, Bill_Amt2, Amount of bill statement (NT dollar,from April to September)
Bill_Amt3,Bill_Amt4, Bill_Amt1 = amount of bill statement in September
Bill_Amt5,Bill_Amt6 Bill_Amt2= amount of bill statement in August
. . .; Bill_Amt6 = amount of bill statement in April.
9 Pay_Amt1,Pay_Amt2, Amount of previous payment from (NT dollar,April to September)
Pay_Amt3,Pay_Amt4, Pay_Amt1= amount paid in September
Pay_Amt5,Pay_Amt6 Pay_Amt2 = amount paid in August;
. . .;Pay_Amt6 = amount paid in April.
10 default payment Yes = 1, No = 0

Data Cont.

#Reading and importing the dataset
spend<-read.csv("C:\\Users\\aditi\\Downloads\\default of credit card clients.csv")
#Compactly display the structure of the variables
str(spend)
## 'data.frame':    30000 obs. of  25 variables:
##  $ ID                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LIMIT_BAL                 : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ SEX                       : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION                 : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ MARRIAGE                  : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ AGE                       : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ PAY_0                     : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ PAY_2                     : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ PAY_3                     : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ PAY_4                     : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ PAY_5                     : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ PAY_6                     : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ BILL_AMT1                 : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ BILL_AMT2                 : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ BILL_AMT3                 : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ BILL_AMT4                 : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ BILL_AMT5                 : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ BILL_AMT6                 : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ PAY_AMT1                  : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ PAY_AMT2                  : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ PAY_AMT3                  : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ PAY_AMT4                  : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ PAY_AMT5                  : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ PAY_AMT6                  : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ default.payment.next.month: int  1 1 0 0 0 0 0 0 0 0 ...
#Providing an Overview of the dataset
hd<-head(spend)

hd %>% kable() %>% kable_styling()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
1 20000 2 2 1 24 2 2 -1 -1 -2 -2 3913 3102 689 0 0 0 0 689 0 0 0 0 1
2 120000 2 2 2 26 -1 2 0 0 0 2 2682 1725 2682 3272 3455 3261 0 1000 1000 1000 0 2000 1
3 90000 2 2 2 34 0 0 0 0 0 0 29239 14027 13559 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
4 50000 2 2 1 37 0 0 0 0 0 0 46990 48233 49291 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
5 50000 1 2 1 57 -1 0 -1 0 0 0 8617 5670 35835 20940 19146 19131 2000 36681 10000 9000 689 679 0
6 50000 1 1 2 37 0 0 0 0 0 0 64400 57069 57608 19394 19619 20024 2500 1815 657 1000 1000 800 0

Data Pre-processing

#As levels 0,5,6 of variable Education belongs to unknown categories it has been assigned to category 4(Others).
spend<-spend %>% mutate(EDUCATION = ifelse(EDUCATION==0|EDUCATION==5|EDUCATION==6,4,EDUCATION))

#As level 0 of variable Marriage belongs to unknown category it has been assigned to category 3(Others).
spend<-spend %>% mutate(MARRIAGE = ifelse(MARRIAGE==0,3,MARRIAGE))

#As level 0 and -2 of below listed set of variables belongs to unknown category it has been assigned to category -1(Pay duly).
spend<-spend %>% mutate(PAY_0 = ifelse(PAY_0==0|PAY_0==-2,-1,PAY_0))
spend<-spend %>% mutate(PAY_2 = ifelse(PAY_2==0|PAY_2==-2,-1,PAY_2))
spend<-spend %>% mutate(PAY_3 = ifelse(PAY_3==0|PAY_3==-2,-1,PAY_3))
spend<-spend %>% mutate(PAY_4 = ifelse(PAY_4==0|PAY_4==-2,-1,PAY_4))
spend<-spend %>% mutate(PAY_5 = ifelse(PAY_5==0|PAY_5==-2,-1,PAY_5))
spend<-spend %>% mutate(PAY_6 = ifelse(PAY_6==0|PAY_6==-2,-1,PAY_6))

Data Pre-processing Cont.

#Checking for no.of levels for all variables to be coverted to factors
#Assigning appropriate labels to the levels of variable Education.
spend$EDUCATION<- factor(spend$EDUCATION, levels = c(1,2,3,4), labels=c("Graduate School", "University", "High School", "Others") )
#Assigning appropriate labels to the levels of variable Marriage.
spend$MARRIAGE<- factor(spend$MARRIAGE, levels = c(1,2,3), labels=c("Married", "Single", "Others"))
#Assigning appropriate labels to the levels of variable Sex.
spend$SEX <- factor(spend$SEX ,levels = c('1','2'), labels=c("MALE", "FEMALE"))
#Assigning appropriate labels to the variable status of payment
spend$default.payment.next.month <- factor(spend$default.payment.next.month,levels = c(0,1), labels=c("No  Default", "Default"))
#Following function aids to efforlessly assign labels to levels.
convtfactor<-function(x){factor(x, levels= c(-1,1,2,3,4,5,6,7,8), labels = c( "Pay duly", "payment delay for one month",
                      "payment delay for two months", "payment delay for three months", "payment delay for four months", 
                      "payment delay for five months", "payment delay for six months", "payment delay for seven months",
                       "payment delay for eight months"))}
#Applying the above function on the variables PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6.
spend$PAY_0<-convtfactor(spend$PAY_0)
spend$PAY_2<-convtfactor(spend$PAY_2)
spend$PAY_3<-convtfactor(spend$PAY_3)
spend$PAY_4<-convtfactor(spend$PAY_4)
spend$PAY_5<-convtfactor(spend$PAY_5)
spend$PAY_6<-convtfactor(spend$PAY_6)

#Applying Mutate to determine Average Bill amount paid in past 6 months.              
spend<-mutate(spend,BILL_AMOUNT_AVG=(BILL_AMT1+BILL_AMT2+BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6)/6)

Outlier Treatment

#In this chunk we would be dealing with outliers
#Handling outliers for Average Bill Amount variable
outlier <- boxplot(spend$BILL_AMOUNT_AVG,plot=FALSE)$out
spend<- spend[-which(spend$BILL_AMOUNT_AVG %in% outlier),]
spend<- spend[-which(spend$BILL_AMOUNT_AVG <0),]
dim(spend)
## [1] 27224    26

Descriptive Statistics and Visualisation.

#Providing descriptive summary for variable BILL_AMOUNT_AVG separately for each Sex.
table1<-spend %>% group_by(SEX) %>% summarise(Min = min(BILL_AMOUNT_AVG,na.rm = TRUE),
                                             Q1 = quantile(BILL_AMOUNT_AVG,probs = .25,na.rm = TRUE),
                                             Median = median(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             Q3 = quantile(BILL_AMOUNT_AVG,probs = .75,na.rm = TRUE),
                                             Max = max(BILL_AMOUNT_AVG,na.rm = TRUE),
                                             Mean = mean(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             SD = sd(BILL_AMOUNT_AVG, na.rm = TRUE),
                                             n = n(),
                                             Missing = sum(is.na(BILL_AMOUNT_AVG)))

table1 %>% kable() %>% kable_styling()
SEX Min Q1 Median Q3 Max Mean SD n Missing
MALE 0 5211.667 18864.00 44099.50 135506.8 30227.18 32143.58 10685 0
FEMALE 0 3525.917 17470.17 44019.92 135548.5 29448.48 32697.98 16539 0

Descriptive Statistics Cont.

#Plotting Box Plot to visualize the data
spend %>% boxplot(BILL_AMOUNT_AVG ~ SEX, data=.,ylab="Average Bill Amount", main = "Boxplot - Average Amount spent in dollars by Gender", col = c(rgb(0.1,0.1,0.7,0.5),rgb(0.8,0.1,0.3,0.6)))

Hypothesis Testing

Hypotheses
Null Hypothesis: There is no statistically significant difference between mean of bill amount of male and female. \[H_0: \mu_1 = \mu_2 \]

Alternate Hypothesis: There is a statistically significant difference between mean of bill amount of male and female. \[H_A: \mu_1 \ne \mu_2\] Assumptions
– Comparing two independent population means i.e. bill amounts of males and females with unknown variance
– Population homogeneity of variance
– Population data sample used contains large data of 27,224 observations (n>30)

#Plotting QQPlot for Male BILL_AMOUNT_AVG
bill_male <- spend %>% filter(SEX == "MALE")
bill_male$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")
## [1] 5991 9658
title(main="BIll Amount for Male")

#Plotting QQPlot for Female BILL_AMOUNT_AVG
bill_female <- spend %>% filter(SEX == "FEMALE")
bill_female$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")
## [1] 7107 6754
title(main="BIll Amount for Female")

Hypothesis Testing Cont.

Null Hypothesis - There is no statistically significant difference in the variance of bill amount of males and females. \[H_0: \sigma_1^2 = \sigma_2^2 \] Alternate Hypothesis - There is a statistically significant difference in the variance of bill amount of males and females. \[H_A: \sigma_1^2 \ne \sigma_2^2\]

#check for equal variance
levene<-leveneTest(BILL_AMOUNT_AVG ~ SEX, data = spend)

levene %>% kable() %>% kable_styling()
Df F value Pr(>F)
group 1 3.527547 0.0603677
27222 NA NA

Hypothesis Testing Cont.

#Applying two sample t-test
t.test(  BILL_AMOUNT_AVG ~ SEX, data = spend, var.equal = TRUE, alternative = "two.sided" )
## 
##  Two Sample t-test
## 
## data:  BILL_AMOUNT_AVG by SEX
## t = 1.9315, df = 27222, p-value = 0.05343
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   -11.49918 1568.90368
## sample estimates:
##   mean in group MALE mean in group FEMALE 
##             30227.18             29448.48

Ability to pay credit bills

First, we report the chi-square, statistic, df and p-value

Test - Chi-square of test of association

Result : p-value < 0.05

#Check for equal variance
chi <- chisq.test(table(spend$default.payment.next.month, spend$SEX))
chi
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(spend$default.payment.next.month, spend$SEX)
## X-squared = 45.522, df = 1, p-value = 1.509e-11
chi$observed
##              
##                MALE FEMALE
##   No  Default  8086  13093
##   Default      2599   3446
chi$expected
##              
##                   MALE    FEMALE
##   No  Default 8312.431 12866.569
##   Default     2372.569  3672.431

Interpretation - Chi-square of test of association

Summary of test
- 0% of expected counts were less than
- chi-square = 45.522, \(df\)= 1
- p-value < 0.05

Decision
- Reject \(H_0\)

We conclude the result of the test discovered a statistically significant association between gender and the default status. According to the bar plot more number of both males and females positively accomplish in clearing the bill amounts thus they fall under the category of ‘No Default’. However, more number of females show this behaviour than males.

tab1 <- table(spend$default.payment.next.month, spend$SEX)
tab2 <- tab1 %>% prop.table(margin=2)
tab2
##              
##                    MALE    FEMALE
##   No  Default 0.7567618 0.7916440
##   Default     0.2432382 0.2083560
barplot(tab2, main = "Default by Gender",ylab="Proportion within Gender", ylim=c(0,1.2), legend=rownames(tab2),beside=TRUE, args.legend=c(x = "top",horiz=TRUE,title="Status of Payment"), xlab="Gender", col = c(rgb(0.1,0.1,0.7,0.8),rgb(0.2,0.7,0.7,0.7)))

Discussion

References