#Load data
diwali <- read.csv("/Users/pin.lyu/Desktop/Tidy_Tuesday/Data/Diwali_Sales_Data.csv")
Data Description:
# Change variable types
diwali$Product_ID <- gsub("P", "", diwali$Product_ID) |>
as.numeric()
stargazer(data = diwali, type = "text", title = "Table 1: Data Summary Statistics")
##
## Table 1: Data Summary Statistics
## ====================================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------------
## User_ID 11,251 1,003,004.000 1,716.125 1,000,001 1,006,040
## Product_ID 11,251 175,109.900 101,252.600 142 370,642
## Age 11,251 35.421 12.754 12 92
## Marital_Status 11,251 0.420 0.494 0 1
## Orders 11,251 2.489 1.115 1 4
## Amount 11,239 9,453.611 5,222.356 188.000 23,952.000
## --------------------------------------------------------------------
Comment: A total of 11,251 observations are in the data set. Our mean age is 35. There are more people are unmarried than married in the data set as the mean for the binary variable “Marital_Status” is 0.42. Additionally, consumers during the day of Diwali, as recorded in this data set, on average spend 9,453 rupees which is equivalent of 112 dollars. Lastly, the average number of purchases for a product by one single customer is about around 2 to 3.
# Check missing values
missmap(obj = diwali)
#Remove unnecessary variables
diwali <- diwali |>
select(-Status, -unnamed1)
table(diwali$Gender)
##
## F M
## 7842 3409
Comment: This data set contains twice as much data on women’s purchases than men’s throughout India. this may lead to certain biases in our analysis.
length(unique(diwali$Product_Category))
## [1] 18
Comment: This data contains a total of products across 18 categories
length(unique(diwali$Product_ID))
## [1] 2351
Comment: A total of 2351 different unique products recorded in this data set.
graph_1 <- diwali |>
group_by(Age.Group,Amount,Gender)
ggplot(data = graph_1, aes(x = Age.Group, y = Amount, fill = Gender)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "Sales by Age Group During Diwali",
y = "Total Amount of Expenditure",
x = "Age Groups") +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
theme_classic()
## Warning: Removed 12 rows containing missing values (`geom_bar()`).
diwali |>
filter(Age < 18, Gender == 'F') |>
count()
## n
## 1 162
diwali |>
filter(Age < 18, Gender == 'M') |>
count()
## n
## 1 134
Comments: In age group “0-17”, though, the number of boys, 134, are less than that of girls, 162, their collective spending is more than the girls’. This could suggest two things, 1) Boys purchase more expensive items in this age group than girls do. Or 2) They purchase multiple of the same products during one single trip to the retail store.
graph_2 <- diwali |>
group_by(Marital_Status,Amount,Gender)
ggplot(data = graph_2, aes(x = Marital_Status, y = Amount, fill = Gender)) +
geom_bar( stat = "identity") +
labs(title = "Expenditure by Maritia Status",
y = "Total Amount of Expenditure",
x = "Maritial Status") +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
theme_classic()
## Warning: Removed 12 rows containing missing values (`position_stack()`).
Comments: If just look at this graph, we can conjecture that women either married or single are tend to spend more than men. However, as we already noticed earlier, there are more women than men recorded in this data set. It’s hard for us to extrapolate this into the broader population in India. Perhaps, there are more women in the data is because women are the targeted customers of this retail store brand.
graph_3 <- diwali |>
group_by(State,Amount)
ggplot(data = graph_3, aes(x = State, y = Amount, fill = State)) +
geom_col(position = "dodge") +
labs(title = "Total Expenditure by State",
y = "Total Amount of Expenditure",
x = "States") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 12 rows containing missing values (`geom_col()`).
Comments: Interesting macro-data here. However, not that much of difference in spending here between different states.
graph_4 <- diwali |>
group_by(Age.Group,Amount,Product_Category)
ggplot(data = graph_4, aes(x = Product_Category, y = Amount, fill = Age.Group)) +
geom_bar( stat = "identity") +
labs(title = "Expenditure by Maritia Status",
y = "Total Amount of Expenditure",
x = "Maritial Status") +
theme_classic()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 12 rows containing missing values (`position_stack()`).
Comment: Most of the spending are spend on Food, Clothing & Apparel, Electronics & Gadgets, and Footwear & Shoes. This fits what we would expect and experience in our own lives. People would be more likely to eat out, to shop for gifts for family and friends, as well as upgrading their electronic devices as there are holiday promotions that matches with holiday period to incentive people to spend their money.
R1 <- lm(Amount ~ Orders, data = diwali)
Regression two
R2 <- lm(Amount ~ Marital_Status + Orders + Age , data = diwali)
# Summary
stargazer(R1,R2, type = 'text', title = 'Summary of Two Linear Regressions')
##
## Summary of Two Linear Regressions
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## Amount
## (1) (2)
## -------------------------------------------------------------------
## Marital_Status -181.556*
## (99.756)
##
## Orders -61.750 -63.198
## (44.182) (44.159)
##
## Age 12.628***
## (3.861)
##
## Constant 9,607.345*** 9,240.068***
## (120.522) (186.781)
##
## -------------------------------------------------------------------
## Observations 11,239 11,239
## R2 0.0002 0.001
## Adjusted R2 0.0001 0.001
## Residual Std. Error 5,222.134 (df = 11237) 5,219.311 (df = 11235)
## F Statistic 1.953 (df = 1; 11237) 5.371*** (df = 3; 11235)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Comments: First, I ran a simple linear regression on total amount spend as a function of “Orders” which is a variable that records what’s the number of purchases of the same product was brought by one customer during one single trip to the retail store. This variable turned to be statistically insignificant. Therefore, I ran a multi-variable regression on “Amount” as a function “Orders”, “Marital_Status”, and “Age”. The “Orders” variables is still statistically insignificant. However, as we can see that the other two variables are statistically significant. The following is the coefficient interpretation of the two variables in the second regression;
(0 = Unmarried, 1 = Married) A person who is married would spend 181.556 rupees less than a person who is unmarried.
When a person gets older by one year, the person would spend 12.628 rupees more during Diwali.