Introduction

The data set provides comprehensive insights into Retail Sales and Customer Demographics. Comprising 9 variables, it represents retail transactions sourced from Kaggle.com. The data set encompasses crucial attributes, including; Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount.

The data is a valuable resource for exploring diverse inquiries, and unraveling trends such as sales patterns, demographic impacts, and purchasing behaviors. Noteworthy aspects of the data set include unique identifiers for each transaction and the corresponding transaction date. Customers making purchases are distinctly identified, with some details, such as age and gender, being specified. The data set illuminates the acquisition of products across three categories, shedding light on the quantity and price involved in each transaction.

This analysis focuses on three pivotal questions to glean insights from the data set:

  1. Identifying the product categories that resonate most with customers.

  2. Uncovering discernible sales patterns across distinct periods.

  3. Examining whether customers tend to buy more of lower-priced or higher-priced items.

library(ggplot2)
library(reshape2)
library(openxlsx2)
library(readxl)
## 
## Attaching package: 'readxl'
## The following object is masked from 'package:openxlsx2':
## 
##     read_xlsx
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-3
library(rsconnect)
library(rmarkdown)
data <- read_excel("C:/Users/Chukwujekwu/Desktop/Project/Retail/retail.xlsx")

Data Cleaning and Exploration

The above code were the analysis which involves exploring and understanding the data set. Key aspects of the data were examined to uncover patterns, trends, and essential characteristics and prepare the data for analysis.

Here is a breakdown of the data exploration conducted on the Retail Sales and Customer. It is important to check for missing values. The output from the script indicates that there are no missing values in any of the columns in the data set. Having looked at the data in preparation for further analysis, it is important to transform some of the data to suit analysis, especially the character type of data. However, the Gender was converted to numeric to reflect the change of values from “Male” and “Female” to 1 and 0, respectively.

To have a better understating of the data, the statistical distribution of the data was examined. The results show that the minimum age is 18 and the maximum age of 64. The median age of 42. For the Product Category, there are three categories: Beauty, Clothing, and Electronics. Beauty has 307 counts, Clothing is 351), while Electronics is 342. In terms of transaction, the minimum of the total amount of the transaction is 25 while the maximum amount is 2,000. The Date spans from January 1, 2023, to January 1, 2024. Also, gender has a distribution of 490 Males and 510 Females. For the quantity of products purchased by the customers, it ranges from 1 to 4 with a mean value of 2.51. While the price per unit ranges from 25 to 500 with a mean value of 179.9. The relationship between the variables was examined and it showed that a strong positive correlation of 0.85 is found between “Price per Unit” and “Total Amount” and a moderate positive correlation of 0.37 between “Quantity” and “Total Amount” while other correlations seem to be relatively weak.

head(data) 
## # A tibble: 6 × 9
##   `Transaction ID` Date                `Customer ID` Gender   Age
##              <dbl> <dttm>              <chr>         <chr>  <dbl>
## 1              180 2023-01-01 00:00:00 CUST180       Male      41
## 2              522 2023-01-01 00:00:00 CUST522       Male      46
## 3              559 2023-01-01 00:00:00 CUST559       Female    40
## 4              163 2023-01-02 00:00:00 CUST163       Female    64
## 5              303 2023-01-02 00:00:00 CUST303       Male      19
## 6              421 2023-01-02 00:00:00 CUST421       Female    37
## # ℹ 4 more variables: `Product Category` <chr>, Quantity <dbl>,
## #   `Price per Unit` <dbl>, `Total Amount` <dbl>
sapply(data, function(x) sum(is.na(x))) 
##   Transaction ID             Date      Customer ID           Gender 
##                0                0                0                0 
##              Age Product Category         Quantity   Price per Unit 
##                0                0                0                0 
##     Total Amount 
##                0
sapply(data, class)
## $`Transaction ID`
## [1] "numeric"
## 
## $Date
## [1] "POSIXct" "POSIXt" 
## 
## $`Customer ID`
## [1] "character"
## 
## $Gender
## [1] "character"
## 
## $Age
## [1] "numeric"
## 
## $`Product Category`
## [1] "character"
## 
## $Quantity
## [1] "numeric"
## 
## $`Price per Unit`
## [1] "numeric"
## 
## $`Total Amount`
## [1] "numeric"
data$Gender <- as.numeric(factor(data$Gender, levels = c("Male", "Female"), labels = c(1, 0))) 
str(data) 
## tibble [1,000 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Transaction ID  : num [1:1000] 180 522 559 163 303 421 979 610 32 231 ...
##  $ Date            : POSIXct[1:1000], format: "2023-01-01" "2023-01-01" ...
##  $ Customer ID     : chr [1:1000] "CUST180" "CUST522" "CUST559" "CUST163" ...
##  $ Gender          : num [1:1000] 1 1 2 2 1 2 2 2 1 2 ...
##  $ Age             : num [1:1000] 41 46 40 64 19 37 19 26 30 23 ...
##  $ Product Category: chr [1:1000] "Clothing" "Beauty" "Clothing" "Clothing" ...
##  $ Quantity        : num [1:1000] 3 3 4 3 3 3 1 2 3 3 ...
##  $ Price per Unit  : num [1:1000] 300 500 300 50 30 500 25 300 30 50 ...
##  $ Total Amount    : num [1:1000] 900 1500 1200 150 90 1500 25 600 90 150 ...
summary(data)
##  Transaction ID        Date                       Customer ID       
##  Min.   :   1.0   Min.   :2023-01-01 00:00:00.0   Length:1000       
##  1st Qu.: 250.8   1st Qu.:2023-04-08 00:00:00.0   Class :character  
##  Median : 500.5   Median :2023-06-29 12:00:00.0   Mode  :character  
##  Mean   : 500.5   Mean   :2023-07-03 00:25:55.2                     
##  3rd Qu.: 750.2   3rd Qu.:2023-10-04 00:00:00.0                     
##  Max.   :1000.0   Max.   :2024-01-01 00:00:00.0                     
##      Gender          Age        Product Category      Quantity    
##  Min.   :1.00   Min.   :18.00   Length:1000        Min.   :1.000  
##  1st Qu.:1.00   1st Qu.:29.00   Class :character   1st Qu.:1.000  
##  Median :2.00   Median :42.00   Mode  :character   Median :3.000  
##  Mean   :1.51   Mean   :41.39                      Mean   :2.514  
##  3rd Qu.:2.00   3rd Qu.:53.00                      3rd Qu.:4.000  
##  Max.   :2.00   Max.   :64.00                      Max.   :4.000  
##  Price per Unit   Total Amount 
##  Min.   : 25.0   Min.   :  25  
##  1st Qu.: 30.0   1st Qu.:  60  
##  Median : 50.0   Median : 135  
##  Mean   :179.9   Mean   : 456  
##  3rd Qu.:300.0   3rd Qu.: 900  
##  Max.   :500.0   Max.   :2000
gender_distribution <-
table(data$Gender) 
print("Gender Distribution:") 
## [1] "Gender Distribution:"
print(gender_distribution) 
## 
##   1   2 
## 490 510
summary_stats <- summary(data$Age)
# Print age distribution summary
cat("Age Distribution Summary:\n") 
## Age Distribution Summary:
cat("Minimum Age:", min(data$Age), "\n")
## Minimum Age: 18
cat("1st Quartile (25th percentile):", quantile(data$Age, 0.25), "\n")
## 1st Quartile (25th percentile): 29
cat("Median Age (50th percentile):", median(data$Age), "\n")
## Median Age (50th percentile): 42
cat("3rd Quartile (75th percentile):", quantile(data$Age, 0.75), "\n")
## 3rd Quartile (75th percentile): 53
cat("Maximum Age:", max(data$Age), "\n")
## Maximum Age: 64
cat("\n")
# Print sample of age values
cat("Sample of Age Values:\n")
## Sample of Age Values:
cat(head(data$Age), "\n")
## 41 46 40 64 19 37
# Compute and print product category distribution
product_category_distribution <- table(data$`Product Category`)
print("Product Category Distribution:")
## [1] "Product Category Distribution:"
print(product_category_distribution)
## 
##      Beauty    Clothing Electronics 
##         307         351         342
# Compute and print total amount distribution summary
summary_stats <- summary(data$`Total Amount`)
cat("Total Amount Distribution Summary:\n")
## Total Amount Distribution Summary:
cat("Minimum Total Amount:", min(data$`Total Amount`), "\n")
## Minimum Total Amount: 25
cat("1st Quartile (25th percentile):", quantile(data$`Total Amount`, 0.25), "\n")
## 1st Quartile (25th percentile): 60
cat("Median Total Amount (50th percentile):", median(data$`Total Amount`), "\n")
## Median Total Amount (50th percentile): 135
cat("3rd Quartile (75th percentile):", quantile(data$`Total Amount`, 0.75), "\n")
## 3rd Quartile (75th percentile): 900
cat("Maximum Total Amount:", max(data$`Total Amount`), "\n")
## Maximum Total Amount: 2000
cat("\n")
# Print sample of total amount values
cat("Sample of Total Amount Values:\n")
## Sample of Total Amount Values:
cat(head(data$`Total Amount`), "\n")
## 900 1500 1200 150 90 1500
# Extract numeric data and print structure
numeric_data <- data[, sapply(data, is.numeric)]
str(numeric_data)
## tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Transaction ID: num [1:1000] 180 522 559 163 303 421 979 610 32 231 ...
##  $ Gender        : num [1:1000] 1 1 2 2 1 2 2 2 1 2 ...
##  $ Age           : num [1:1000] 41 46 40 64 19 37 19 26 30 23 ...
##  $ Quantity      : num [1:1000] 3 3 4 3 3 3 1 2 3 3 ...
##  $ Price per Unit: num [1:1000] 300 500 300 50 30 500 25 300 30 50 ...
##  $ Total Amount  : num [1:1000] 900 1500 1200 150 90 1500 25 600 90 150 ...
# Compute and print correlation matrix
correlation_matrix <- cor(numeric_data)
print(correlation_matrix)
##                Transaction ID        Gender          Age    Quantity
## Transaction ID     1.00000000 -0.0103874595  0.065190946 -0.02662252
## Gender            -0.01038746  1.0000000000 -0.002621444  0.02802265
## Age                0.06519095 -0.0026214444  1.000000000 -0.02373658
## Quantity          -0.02662252  0.0280226527 -0.023736579  1.00000000
## Price per Unit    -0.06083662  0.0009612312 -0.038422845  0.01750128
## Total Amount      -0.07503432  0.0010007048 -0.060568024  0.37370705
##                Price per Unit Total Amount
## Transaction ID  -0.0608366201 -0.075034320
## Gender           0.0009612312  0.001000705
## Age             -0.0384228451 -0.060568024
## Quantity         0.0175012801  0.373707054
## Price per Unit   1.0000000000  0.851924840
## Total Amount     0.8519248404  1.000000000

Product Categories and Customer Preference

Attempting to answer the question on the product category that holds the highest appeal among customers, the distribution of product categories by gender, and total sales. These give insights into the popularity of each product category among male and female customers as well as how each product performed by the amount sold. An analysis of the distribution of total amounts spent across different product categories, with a focus on gender-specific preferences as well as the general performance of the product categories was conducted.The figure compared the distribution of transactions across product categories for males and females.

From the figure, we can observe the buying patterns of males and females across different product categories. Understanding the gender-specific preferences for different product categories can inform targeted marketing strategies. For instance, Beauty products are more popular among women, thereby, marketing efforts can be tailored to target men if the company wants to expand its sales to men or it can be tailored to women to further encourage more sales. This depends on the strategy of the company. Meanwhile, the data also showed that clothing and electronics performed relatively well among men compared to women, though, the difference is not substantial.

Looking at the performance of the product category, the aggregated summary of total amounts of sales on each product category provides insights into the overall revenue generated by different product categories. The above chart shows the performance of the product categories. The three product categories based on the total amount generated are Electronics with a revenue of $156,905, followed by Clothing with a revenue of $155,580, and lastly, the Beauty product category with a revenue of $143,515. From the insight generated, the business can prioritize based on its revenue to improve sales efforts for these categories: Electronics, Clothing, and Beauty being the top revenue-generating product categories.

# Attempting to answer the question on "Which product categories hold the highest appeal among customers?"

ggplot(data, aes(x = `Product Category`, fill = as.factor(Gender))) + geom_bar(position = "dodge") + labs(title = 'Product Category Distribution by Gender', fill = 'Gender') + scale_fill_manual(values = c("blue", "pink"), labels = c("Male", "Female"))

category_summary <- aggregate(data$`Total Amount`, by = list(data$`Product Category`), FUN = sum)
colnames(category_summary) <- c('Product Category', 'Total Amount')
category_summary <- category_summary[order(category_summary$`Total Amount`, decreasing = TRUE), ]
print(category_summary)
##   Product Category Total Amount
## 3      Electronics       156905
## 2         Clothing       155580
## 1           Beauty       143515
ggplot(category_summary, aes(x = `Product Category`, y = `Total Amount`, fill = `Product Category`)) + geom_bar(stat = 'identity') + labs(title = 'Total Amount Spent by Product Category', fill = 'Product Category')

Sales Pattern Across Board and Month-on-Month Performance

The chart showing the sales trends over time, provided the overviews of the sales. The figures show that there are fluctuations in sales and the peaks or valleys on specific dates.

To further gain insight into the monthly sales to understand seasonality and the month with the highest sales, there is a need to break down the data into monthly sales. Monthly sales show variability. For example, there is a noticeable spike in sales in May and October. May is the month with the highest sales. The month of May has significantly higher sales compared to the surrounding months, indicating a potential month with special event or promotion.

The percentage changes from one month to the next vary. For instance, there’s a substantial increase from April to May and a significant decrease from May to June. The average monthly sales is $35,076.9, providing a baseline for typical performance. It could be seen that the months below the average sales were Match, September, and January 2024. Further investigation is needed to understand the factors contributing to the sales spikes, especially in May and October. Opportunities may lie in capitalizing on the observed spikes. Challenges could involve addressing the lower sales periods.

# Attempting to answer the question on "Are there discernible patterns in sales across different time periods?"

ggplot(data, aes(x = Date, y = `Total Amount`)) + geom_line() + labs(title = 'Total Amount Over Time', x = 'Date', y = 'Total Amount')

data_monthly <- data %>%
mutate(Month = format(Date, "%Y-%m")) %>%
group_by(Month) %>%
summarize(Total_Sales = sum(`Total Amount`)) %>%
filter(!is.na(Month))
data_monthly$Month <- as.Date(paste0(data_monthly$Month, "-01"), format = "%Y-%m-%d")
average_monthly_sales <- mean(data_monthly$Total_Sales)
max_index <- which.max(data_monthly$Total_Sales)
min_index <- which.min(data_monthly$Total_Sales)
p <- ggplot(data_monthly, aes(x = Month, y = Total_Sales)) +
geom_line() +
labs(title = 'Total Sales Over Time (Monthly)', x = 'Month', y = 'Total Sales') +
theme_minimal() +
annotate("point", x = data_monthly$Month[max_index], y = data_monthly$Total_Sales[max_index], color = "red", size = 3) +
annotate("point", x = data_monthly$Month[min_index], y = data_monthly$Total_Sales[min_index], color = "blue", size = 3) +
geom_hline(yintercept = average_monthly_sales, linetype = "dashed", color = "green") # Add a horizontal line for the average monthly sales
print(p)

Price Per Unit Against Quantity Purchased

Here, we want to examine whether customers tend to buy more due to price changes. To do this, we needed to look at the relationship between price per unit, quantity purchased as well as total sales.

The scatter plot shows individual data points for Quantity and Price per Unit. It helps visualize the distribution of data. It could be seen that there is a weak correlation which means that there is no strong linear relationship between these two variables. Importantly, the correlation coefficient between Quantity and Price per Unit is approximately 0.0175. This indicates a very weak positive correlation between the two variables. This means that as prices increase, the quantity slightly increases.

To further understand how price affects the quantity of product that is being purchased by the customers, the variable was extended to include Total Amount, and the Generalized Additive Model (GAM) was adopted. A GAM model was fitted to explore the relationship more flexibly. The summary of the GAM model indicates that the nonparametric effect of Price per Unit is not statistically significant (p-value = 0.5804). This further supports the idea that there is no clear non-linear relationship between Quantity and Price per Unit. It is important to state that the relationship was positive. The smoothed line plot, generated using linear regression, shows a slightly flat line, reinforcing the lack of a strong linear relationship between Quantity and Price per Unit. The weak correlation and the results from the GAM and linear regression models suggest that changes in the price per Unit do not have a significant impact on the quantity sold. However, the upward slopes from left to right suggest a positive correlation. As the price per unit increases, the quantity purchased also tends to increase. The steepness of the line indicates the strength of the correlation. The blue steeper line suggests a weak correlation between the two variables. Therefore, customers may not be highly sensitive to changes in the price of the product, at least within the observed range.

# Examing whether customers tend to buy more of lower-priced or higher-priced items.

correlation_matrix <- cor(data[, c("Quantity", "Price per Unit")])
print(correlation_matrix)
##                  Quantity Price per Unit
## Quantity       1.00000000     0.01750128
## Price per Unit 0.01750128     1.00000000
ggplot(data, aes(x = `Quantity`, y = `Price per Unit`)) + geom_point() + labs(title = 'Relationship between Quantity and Price per Unit', x = 'Quantity', y = 'Price per Unit')

gam_model <- gam(Quantity ~ s(`Price per Unit`), data = data)
summary(gam_model)
## 
## Call: gam(formula = Quantity ~ s(`Price per Unit`), data = data)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6294 -1.4317  0.3706  1.3706  1.5683 
## 
## (Dispersion Parameter for gaussian family taken to be 1.284)
## 
##     Null Deviance: 1281.804 on 999 degrees of freedom
## Residual Deviance: 1277.571 on 995 degrees of freedom
## AIC: 3094.838 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                      Df  Sum Sq Mean Sq F value Pr(>F)
## s(`Price per Unit`)   1    0.39 0.39261  0.3058 0.5804
## Residuals           995 1277.57 1.28399               
## 
## Anova for Nonparametric Effects
##                     Npar Df  Npar F  Pr(F)
## (Intercept)                               
## s(`Price per Unit`)       3 0.99702 0.3935
ggplot(data, aes(x = `Price per Unit`, y = Quantity)) +
geom_smooth(method = 'lm') + # Use linear regression for smoothing
labs(title = 'Smoothed Line Plot: Relationship between Quantity and Price per Unit', x = 'Price per Unit', y = 'Quantity')
## `geom_smooth()` using formula = 'y ~ x'

In conclusion, the key findings include gender-specific product preferences, identification of top-performing product categories, and recognition of sales trends over time. The weak correlation between price per unit and quantity suggests that customers may not be highly sensitive to price changes within the observed range. The analysis provides actionable recommendations for targeted marketing, sales optimization during peak months, and considerations for pricing strategies. Overall, the findings offer a foundation for informed decision-making to enhance retail performance and customer engagement.