Homework assignment 2 at the course Applied Data Analysis in Business with R

Author: Hui-Ju Huang

1 Research question

1.1 Does the average monthly sales revenue differ significantly between customers who use member cards and those who do not?

1.2 Variables

Independent Variable: Customer type (Member or Normal)
Dependent Variable: Total Sales Revenue

1.3 Research hypotheses

H0: The average monthly sales revenue is the same for both customer types.
H1: The average monthly sales revenue differs between customers who use member cards and those who do not.

2 Data

2.1 Import of the data

mydata <- read.table("./supermarket_sales.csv", header = TRUE, sep = ",", dec = ".")

head(mydata)

##    Invoice.ID Branch      City Customer.type Gender           Product.line
## 1 750-67-8428      A    Yangon        Member Female      Health and beauty
## 2 226-31-3081      C Naypyitaw        Normal Female Electronic accessories
## 3 631-41-3108      A    Yangon        Normal   Male     Home and lifestyle
## 4 123-19-1176      A    Yangon        Member   Male      Health and beauty
## 5 373-73-7910      A    Yangon        Normal   Male      Sports and travel
## 6 699-14-3026      C Naypyitaw        Normal   Male Electronic accessories
##   Unit.price Quantity  Tax.5.    Total      Date  Time     Payment   cogs
## 1      74.69        7 26.1415 548.9715  1/5/2019 13:08     Ewallet 522.83
## 2      15.28        5  3.8200  80.2200  3/8/2019 10:29        Cash  76.40
## 3      46.33        7 16.2155 340.5255  3/3/2019 13:23 Credit card 324.31
## 4      58.22        8 23.2880 489.0480 1/27/2019 20:33     Ewallet 465.76
## 5      86.31        7 30.2085 634.3785  2/8/2019 10:37     Ewallet 604.17
## 6      85.39        7 29.8865 627.6165 3/25/2019 18:30     Ewallet 597.73
##   gross.margin.percentage gross.income Rating
## 1                4.761905      26.1415    9.1
## 2                4.761905       3.8200    9.6
## 3                4.761905      16.2155    7.4
## 4                4.761905      23.2880    8.4
## 5                4.761905      30.2085    5.3
## 6                4.761905      29.8865    4.1

2.2 Explanation of the dataset

Unit of Observation: Each row represents a sales transaction in the supermarket.
Sample Size: Total number of rows in the dataset (1000).
Definition of Variables
- Invoice id: Computer generated sales slip invoice identification number.
- Branch: Branch of supermarket (3 branches are available identified by A, B and C).
- City: Location of supermarkets.
- Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
- Gender: Gender type of customer.
- Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel.
- Unit price: Price of each product in $.
- Quantity: Number of products purchased by customer.
- Tax: 5% tax fee for customer buying.
- Total: Total price including tax.
- Date: Date of purchase (Record available from January 2019 to March 2019).
- Time: Purchase time (10am to 9pm).
- Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet).
- COGS: Cost of goods sold.
- Gross margin percentage: Gross margin percentage.
- Gross income: Gross income.
- Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10).

2.3 Source of the dataset

Kaggle: Supermarket sales
https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

2.4 Basic descriptive statistics

Summary of the Total

summary(mydata$Total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.68  124.42  253.85  322.97  471.35 1042.65

The mean of the total price is 322.97.

Convert categorical variables to factors

mydata$Customer.type <- factor(mydata$Customer.type)

Determine sample sizes

sample_size_member <- sum(mydata$Customer.type == "Member")
sample_size_normal <- sum(mydata$Customer.type == "Normal")
sample_size_member

## [1] 501

sample_size_normal

## [1] 499

There are 501 people who does have member in the sample.
There are 499 people who does not have member in the sample.

#install.packages("psych")
library(psych)
sales_member <- mydata$Total[mydata$Customer.typ == "Member"]
describe(sales_member)

##    vars   n   mean     sd median trimmed   mad   min     max   range skew
## X1    1 501 327.79 247.99 266.03  299.64 251.8 10.68 1042.65 1031.97 0.84
##    kurtosis    se
## X1    -0.22 11.08

The mean of the total price who does have a member is 327.79.
The standard deviation of the total price who does have a member is 247.99.

sales_normal <- mydata$Total[mydata$Customer.typ == "Normal"]
describe(sales_normal)

##    vars   n   mean    sd median trimmed    mad   min     max  range skew
## X1    1 499 318.12 243.9 237.43   288.7 217.29 12.69 1039.29 1026.6 0.94
##    kurtosis    se
## X1     0.04 10.92

The mean of the total price who does not have a member is 318.12.
The standard deviation of the total price who does not have a member is 243.9.

3 Analysis

3.1 Determine the statistical test to use

We use an independent samples t-test to compare the average monthly sales revenue between customers who use member cards and those who do not. This is appropriate since we are comparing means of two independent groups (customer types).

3.2 Evaluate all assumptions

Check whether a variable is normally distributed.

H0: Variable is normally distributed.
H1: Variable is not normally distributed.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Member <- ggplot(mydata[mydata$Customer.type == "Member", ], aes(x = Total)) +
  geom_histogram(position = "identity", binwidth = 100, colour = "black") +
  ylab("Count") +
  xlab("Total") +
  ggtitle("Member")

Normal <- ggplot(mydata[mydata$Customer.type == "Normal", ], aes(x = Total)) +
  geom_histogram(position = "identity", binwidth = 100, colour = "black") +
  ylab("Count") +
  xlab("Total") +
  ggtitle("Normal")

#install.packages("ggpubr")
library(ggpubr)
ggarrange(Member, Normal,ncol = 2, nrow=1)

The two histograms shows the number of total price for both groups of customers. Both do not look normal, which can be formally verified with the Shapiro-Wilk normality tests.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(Customer.type) %>%
  shapiro_test(Total)

## # A tibble: 2 × 4
##   Customer.type variable statistic        p
##   <fct>         <chr>        <dbl>    <dbl>
## 1 Member        Total        0.912 2.02e-16
## 2 Normal        Total        0.904 3.49e-17

We performed two normality tests, separately for each subgroup. In both cases, the p-values are less than 0.05, so that we reject the null hypothesis. This means that we assume that the normality assumption is not fulfilled.
Therefore, the appropriate non-parametric alternative test must be performed.

3.3 Statistical test

Non-parametric test: Wilcoxon rank sum test
- To statistically compare both groups, we use the Wilcoxon rank sum test, which is an alternative to the t-test for independent samples.
The hypotheses
- H0: Distribution locations of average monthly sales revenue is the same for both customer types.
- H1: Distribution locations of average monthly sales revenue differs between customers who use member cards and those who do not.

wilcox.test(mydata$Total ~ mydata$Customer.type,
  paired = FALSE,
  correct = FALSE,
  exact = FALSE,
  alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Total by mydata$Customer.type
## W = 127453, p-value = 0.5911
## alternative hypothesis: true location shift is not equal to 0

We cannot reject the null hypothesis that the distributions of average monthly sales revenue is same for both customer types.

3.4 Calculation of the effect size

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydata$Total ~ mydata$Customer.type,
  paired = FALSE,
  correct = FALSE,
  exact = FALSE,
  alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.02              | [-0.05, 0.09]

library(effectsize)
interpret_rank_biserial(0.02 )

## [1] "tiny"
## (Rules: funder2019)

The effect size shows the tiny differences between the distributions.

4 Conclusion

Based on the sample data, we find that there is no significant difference in average monthly sales revenue between customers who use member cards and those who do not (p > 0.05). Specifically, customers who use member cards and those who do not have similar average monthly sales revenue. Therefore, we fail to reject the null hypothesis, and we conclude that there is a tiny or negligible difference (r = 0.02) in sales revenue between the two customer types.

Homework assignment 2 at the course Applied Data Analysis in Business with R

2024-04-26

1 Research question

1.1 Does the average monthly sales revenue differ significantly between customers who use member cards and those who do not?

1.2 Variables

1.3 Research hypotheses

2 Data

2.1 Import of the data

2.2 Explanation of the dataset

2.3 Source of the dataset

2.4 Basic descriptive statistics

3 Analysis

3.1 Determine the statistical test to use

3.2 Evaluate all assumptions

3.3 Statistical test

3.4 Calculation of the effect size

4 Conclusion