Aditi Mahajan(s3757732) Salina Bharthu(s3736867) Neha Rao(s3755214)
Last updated: 02 June, 2019
Presentation published on Rpubs
Link to Rpubs: Spend_Analysis
Shopping is often gendered as a pursuit for women.
It’s a popular notion that Women tend to spend more than their counterparts and this has been engrained into the fabric of our society for decades.
By collating facts and figures we seek to answer the age-old stereotyped question of “Who spends more money?”
People are more inclined to pay with credit cards than cash and so we look at the card swipes(i.e. credit card bills) for the analysis.
Analyzing the bill statements of the credit card for each sex, we derive comprehensible insight of their expense patterns.
Men VS Women
“THE BATTLE”
Who spends more?
This study drives insights from data to unravel the question of who swipes more i.e. spends more money.
The chief objective is to scrutinize spending habits of males and females by observing their monthly credit card bills.The average bill amount of past 6 months is used to infer the trend of expenses.
Using two sample t-test we determine whether there is any statistically significant difference between spending by male and female.
We then look further into whether the ability to pay credit bills on time is associated to gender type using Chi-square test of association and determine the tendency for a person(male/female) to default(not pay bills).
Data Source: UCI machine learning repository
Data Description:
– The dataset used for this investigation beholds 30,000 observations of Taiwan Customers, whose details are observed across 25 variables.
– It contains customer details such as Gender, Education, Marital Status, Age and Financial Details such as Credit Limit, Payment Status, Bill Amount and Amount paid in past 6 months and Default Payment Status.
List of 25 variables of the dataset is provided below:
| Sr.No | Variable | Variable Explanation |
|---|---|---|
| 1 | ID | A unique ID to differentiate every record |
| 2 | Balance Limit | Credit card Balance Limit (NT Dollars) |
| 3 | Gender | 1 = Male; 2 = Female |
| 4 | Education | 1 = graduate school; 2 = university; 3 = high school; 4 = others |
| 5 | Marital status | 1 = married; 2 = single; 3 = others |
| 6 | Age | In Years |
| 7 | Pay_0,Pay_2,Pay_3, | History of past monthly payment records (from April to September). |
| Pay_4,Pay_5,Pay_6 | The measurement scale for the repayment status is: | |
| -1 = pay duly; | ||
| 1 = payment delay for one month; | ||
| 2 = payment delay for two months; . . .; | ||
| 8 = payment delay for eight months. | ||
| 8 | Bill_Amt1, Bill_Amt2, | Amount of bill statement (NT dollar,from April to September) |
| Bill_Amt3,Bill_Amt4, | Bill_Amt1 = amount of bill statement in September | |
| Bill_Amt5,Bill_Amt6 | Bill_Amt2= amount of bill statement in August | |
| . . .; Bill_Amt6 = amount of bill statement in April. | ||
| 9 | Pay_Amt1,Pay_Amt2, | Amount of previous payment from (NT dollar,April to September) |
| Pay_Amt3,Pay_Amt4, | Pay_Amt1= amount paid in September | |
| Pay_Amt5,Pay_Amt6 | Pay_Amt2 = amount paid in August; | |
| . . .;Pay_Amt6 = amount paid in April. | ||
| 10 | default payment | Yes = 1, No = 0 |
#Reading and importing the dataset
spend<-read.csv("C:\\Users\\aditi\\Downloads\\default of credit card clients.csv")
#Compactly display the structure of the variables
str(spend)## 'data.frame': 30000 obs. of 25 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ LIMIT_BAL : int 20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
## $ SEX : int 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : int 2 2 2 2 2 1 1 2 3 3 ...
## $ MARRIAGE : int 1 2 2 1 1 2 2 2 1 2 ...
## $ AGE : int 24 26 34 37 57 37 29 23 28 35 ...
## $ PAY_0 : int 2 -1 0 0 -1 0 0 0 0 -2 ...
## $ PAY_2 : int 2 2 0 0 0 0 0 -1 0 -2 ...
## $ PAY_3 : int -1 0 0 0 -1 0 0 -1 2 -2 ...
## $ PAY_4 : int -1 0 0 0 0 0 0 0 0 -2 ...
## $ PAY_5 : int -2 0 0 0 0 0 0 0 0 -1 ...
## $ PAY_6 : int -2 2 0 0 0 0 0 -1 0 -1 ...
## $ BILL_AMT1 : int 3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
## $ BILL_AMT2 : int 3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
## $ BILL_AMT3 : int 689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
## $ BILL_AMT4 : int 0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
## $ BILL_AMT5 : int 0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
## $ BILL_AMT6 : int 0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
## $ PAY_AMT1 : int 0 0 1518 2000 2000 2500 55000 380 3329 0 ...
## $ PAY_AMT2 : int 689 1000 1500 2019 36681 1815 40000 601 0 0 ...
## $ PAY_AMT3 : int 0 1000 1000 1200 10000 657 38000 0 432 0 ...
## $ PAY_AMT4 : int 0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
## $ PAY_AMT5 : int 0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
## $ PAY_AMT6 : int 0 2000 5000 1000 679 800 13770 1542 1000 0 ...
## $ default.payment.next.month: int 1 1 0 0 0 0 0 0 0 0 ...
#Providing an Overview of the dataset
hd<-head(spend)
hd %>% kable() %>% kable_styling()| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | -2 | 3913 | 3102 | 689 | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | 2 | 2682 | 1725 | 2682 | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 29239 | 14027 | 13559 | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 46990 | 48233 | 49291 | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | 0 | 8617 | 5670 | 35835 | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
| 6 | 50000 | 1 | 1 | 2 | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 64400 | 57069 | 57608 | 19394 | 19619 | 20024 | 2500 | 1815 | 657 | 1000 | 1000 | 800 | 0 |
Looking for inconsistencies in data
– Read CSV file
– Check for missing and impossible values
– Check data types of variables
Cleaning the data
– Replace impossible values by appropriate nearest possible value; categorical variables have values which do not correspond to any category
– Convert numerical categories to factor with proper labels; categorical variables present with numerical data
#As levels 0,5,6 of variable Education belongs to unknown categories it has been assigned to category 4(Others).
spend<-spend %>% mutate(EDUCATION = ifelse(EDUCATION==0|EDUCATION==5|EDUCATION==6,4,EDUCATION))
#As level 0 of variable Marriage belongs to unknown category it has been assigned to category 3(Others).
spend<-spend %>% mutate(MARRIAGE = ifelse(MARRIAGE==0,3,MARRIAGE))
#As level 0 and -2 of below listed set of variables belongs to unknown category it has been assigned to category -1(Pay duly).
spend<-spend %>% mutate(PAY_0 = ifelse(PAY_0==0|PAY_0==-2,-1,PAY_0))
spend<-spend %>% mutate(PAY_2 = ifelse(PAY_2==0|PAY_2==-2,-1,PAY_2))
spend<-spend %>% mutate(PAY_3 = ifelse(PAY_3==0|PAY_3==-2,-1,PAY_3))
spend<-spend %>% mutate(PAY_4 = ifelse(PAY_4==0|PAY_4==-2,-1,PAY_4))
spend<-spend %>% mutate(PAY_5 = ifelse(PAY_5==0|PAY_5==-2,-1,PAY_5))
spend<-spend %>% mutate(PAY_6 = ifelse(PAY_6==0|PAY_6==-2,-1,PAY_6))#Checking for no.of levels for all variables to be coverted to factors
#Assigning appropriate labels to the levels of variable Education.
spend$EDUCATION<- factor(spend$EDUCATION, levels = c(1,2,3,4), labels=c("Graduate School", "University", "High School", "Others") )
#Assigning appropriate labels to the levels of variable Marriage.
spend$MARRIAGE<- factor(spend$MARRIAGE, levels = c(1,2,3), labels=c("Married", "Single", "Others"))
#Assigning appropriate labels to the levels of variable Sex.
spend$SEX <- factor(spend$SEX ,levels = c('1','2'), labels=c("MALE", "FEMALE"))
#Assigning appropriate labels to the variable status of payment
spend$default.payment.next.month <- factor(spend$default.payment.next.month,levels = c(0,1), labels=c("No Default", "Default"))
#Following function aids to efforlessly assign labels to levels.
convtfactor<-function(x){factor(x, levels= c(-1,1,2,3,4,5,6,7,8), labels = c( "Pay duly", "payment delay for one month",
"payment delay for two months", "payment delay for three months", "payment delay for four months",
"payment delay for five months", "payment delay for six months", "payment delay for seven months",
"payment delay for eight months"))}
#Applying the above function on the variables PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6.
spend$PAY_0<-convtfactor(spend$PAY_0)
spend$PAY_2<-convtfactor(spend$PAY_2)
spend$PAY_3<-convtfactor(spend$PAY_3)
spend$PAY_4<-convtfactor(spend$PAY_4)
spend$PAY_5<-convtfactor(spend$PAY_5)
spend$PAY_6<-convtfactor(spend$PAY_6)
#Applying Mutate to determine Average Bill amount paid in past 6 months.
spend<-mutate(spend,BILL_AMOUNT_AVG=(BILL_AMT1+BILL_AMT2+BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6)/6)Outliers for the variable Bill amount average are handled using boxplot approach.
The observations having outliers are removed from the dataset.
The observations having negative values are removed
#In this chunk we would be dealing with outliers
#Handling outliers for Average Bill Amount variable
outlier <- boxplot(spend$BILL_AMOUNT_AVG,plot=FALSE)$out
spend<- spend[-which(spend$BILL_AMOUNT_AVG %in% outlier),]
spend<- spend[-which(spend$BILL_AMOUNT_AVG <0),]dim(spend)## [1] 27224 26
The table provides a brief summary statistics of ‘bill amount average’ variable to represent quantitative insights.
Summarizing the data gender-wise
– The mean ‘bill amount average’ for males and females are not the same.
#Providing descriptive summary for variable BILL_AMOUNT_AVG separately for each Sex.
table1<-spend %>% group_by(SEX) %>% summarise(Min = min(BILL_AMOUNT_AVG,na.rm = TRUE),
Q1 = quantile(BILL_AMOUNT_AVG,probs = .25,na.rm = TRUE),
Median = median(BILL_AMOUNT_AVG, na.rm = TRUE),
Q3 = quantile(BILL_AMOUNT_AVG,probs = .75,na.rm = TRUE),
Max = max(BILL_AMOUNT_AVG,na.rm = TRUE),
Mean = mean(BILL_AMOUNT_AVG, na.rm = TRUE),
SD = sd(BILL_AMOUNT_AVG, na.rm = TRUE),
n = n(),
Missing = sum(is.na(BILL_AMOUNT_AVG)))
table1 %>% kable() %>% kable_styling()| SEX | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| MALE | 0 | 5211.667 | 18864.00 | 44099.50 | 135506.8 | 30227.18 | 32143.58 | 10685 | 0 |
| FEMALE | 0 | 3525.917 | 17470.17 | 44019.92 | 135548.5 | 29448.48 | 32697.98 | 16539 | 0 |
#Plotting Box Plot to visualize the data
spend %>% boxplot(BILL_AMOUNT_AVG ~ SEX, data=.,ylab="Average Bill Amount", main = "Boxplot - Average Amount spent in dollars by Gender", col = c(rgb(0.1,0.1,0.7,0.5),rgb(0.8,0.1,0.3,0.6)))Hypotheses
Null Hypothesis: There is no statistically significant difference between mean of bill amount of male and female. \[H_0: \mu_1 = \mu_2 \]
Alternate Hypothesis: There is a statistically significant difference between mean of bill amount of male and female. \[H_A: \mu_1 \ne \mu_2\] Assumptions
– Comparing two independent population means i.e. bill amounts of males and females with unknown variance
– Population homogeneity of variance
– Population data sample used contains large data of 27,224 observations (n>30)
#Plotting QQPlot for Male BILL_AMOUNT_AVG
bill_male <- spend %>% filter(SEX == "MALE")
bill_male$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")## [1] 5991 9658
title(main="BIll Amount for Male")#Plotting QQPlot for Female BILL_AMOUNT_AVG
bill_female <- spend %>% filter(SEX == "FEMALE")
bill_female$BILL_AMOUNT_AVG %>% qqPlot(dist="norm")## [1] 7107 6754
title(main="BIll Amount for Female")Observing the qqplot, though the deviation from the normal line is high, the normality can be assumed considering CLT(the sample size of both groups are greater than 30).
Check for equal variance of both population using Levene test.
Following hypothesis is checked:
Null Hypothesis - There is no statistically significant difference in the variance of bill amount of males and females. \[H_0: \sigma_1^2 = \sigma_2^2 \] Alternate Hypothesis - There is a statistically significant difference in the variance of bill amount of males and females. \[H_A: \sigma_1^2 \ne \sigma_2^2\]
#check for equal variance
levene<-leveneTest(BILL_AMOUNT_AVG ~ SEX, data = spend)
levene %>% kable() %>% kable_styling()| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 1 | 3.527547 | 0.0603677 |
| 27222 | NA | NA |
#Applying two sample t-test
t.test( BILL_AMOUNT_AVG ~ SEX, data = spend, var.equal = TRUE, alternative = "two.sided" )##
## Two Sample t-test
##
## data: BILL_AMOUNT_AVG by SEX
## t = 1.9315, df = 27222, p-value = 0.05343
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.49918 1568.90368
## sample estimates:
## mean in group MALE mean in group FEMALE
## 30227.18 29448.48
Hypothesis for the Chi-square of test of association :
\(H_0\)- There is no association between gender and tendency for a person to default (Gender and ability to payback are independent)
\(H_A\)- Alternate Hypothesis - There is an association between gender and tendency for a person to default (Gender and ability to payback are not independent)
Assumptions
That no more than 25% of the cells have expected counts below 5.
First, we report the chi-square, statistic, df and p-value
Result : p-value < 0.05
#Check for equal variance
chi <- chisq.test(table(spend$default.payment.next.month, spend$SEX))
chi##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(spend$default.payment.next.month, spend$SEX)
## X-squared = 45.522, df = 1, p-value = 1.509e-11
chi$observed##
## MALE FEMALE
## No Default 8086 13093
## Default 2599 3446
chi$expected##
## MALE FEMALE
## No Default 8312.431 12866.569
## Default 2372.569 3672.431
Summary of test
- 0% of expected counts were less than
- chi-square = 45.522, \(df\)= 1
- p-value < 0.05
Decision
- Reject \(H_0\)
We conclude the result of the test discovered a statistically significant association between gender and the default status. According to the bar plot more number of both males and females positively accomplish in clearing the bill amounts thus they fall under the category of ‘No Default’. However, more number of females show this behaviour than males.
tab1 <- table(spend$default.payment.next.month, spend$SEX)
tab2 <- tab1 %>% prop.table(margin=2)
tab2##
## MALE FEMALE
## No Default 0.7567618 0.7916440
## Default 0.2432382 0.2083560
barplot(tab2, main = "Default by Gender",ylab="Proportion within Gender", ylim=c(0,1.2), legend=rownames(tab2),beside=TRUE, args.legend=c(x = "top",horiz=TRUE,title="Status of Payment"), xlab="Gender", col = c(rgb(0.1,0.1,0.7,0.8),rgb(0.2,0.7,0.7,0.7)))