Introduction

The purpose of this report is to assess the shopping_trends.csv dataset from Kaggle to study retail purchasing behavior across product categories, seasons, and customer demographics. The dataset has transaction data at the item-level and consists of product category, gender, season, frequency of prior purchases, and monetary value at the transaction level. The dataset has no missing values, leading to full and robust analyses across all dimensions.

This report utilizes descriptive statistics and visual analytical methods to assess the distribution of demand for the best-selling items, seasonal buying habits, and the association between previous purchases and the number of transactions and monetary value. The purpose of this analysis is to examine the existence of significant seasonal and gender-related differences in category demand, purchase frequency, and average expenditure. The visualizations were crafted to seamlessly highlight the key factors, trends, and behavioral patterns.

Dataset

Descriptive Statistics

(code showed in tab 1)

There are a total of 19 variables (both categorical and numeric) and 3900 observations in the dataset. Each purchase costs on average $59.76, and the median purchase cost is $60, with the purchases costing between $20 and $100. In a previous purchase study, it was found that the number of purchases varied between 1 and 50, and the average was 25.35. There were no missing values in any of the variables.

The shopping_trends.csv contains individual records of retail transactions and each row records one purchase observation. The dataset contains both types of variables: categorical and numerical. They recorded the purchase amount (USD) and prior purchase count as numerical variables and recorded product category, season, gender, item bought, and place as categorical variables.

The dataset contains complete records and has no missing values which guarantees consistency throughout all the analyses and visualizations. The combination of customer attributes, product details and transaction focused data provides an excellent structure to analyze buying behavior across several dimensions.

Findings

Overall, the descriptive analysis shows across the dataset, purchasing patterns are consistent and stable. Every season sees Clothing as the most common product category followed by Accessories and equally, low purchase counts for Footwear and Outerwear. There are small seasonal fluctuations, for example, Spring and Fall see an increase in Clothing purchases, however, the relatively ranking of categories suggests that customer demand is driven by product type and not seasonality.

Purchases are geographically dispersed. The majority of purchases are in the Other location group as opposed to other locations and states meaning that sales in the Other location group are distributed sales across multiple locations.

When comparing genders, purchasing behavior is relatively balanced across the significant items without either gender having strong buying dominance. Hence, the popular items resonate with all demographic groups. The distribution of the previous purchase counts is relatively flat with no dominate trend that would indicate that the previous purchases are represented more frequently for certain customers. Although there are slight peaks and dips, they are not trend consistent. Overall, the product category seems to have the greatest influence on total demand, and the evidence shows that consumers’ buying patterns in this dataset are consistent across seasons and different demographic groups.

Tab 1

This section contains the complete descriptive statistics and corresponding R output used for dataset summarization.

# Basic dataset info
colnames(df)
##  [1] "Customer ID"              "Age"                     
##  [3] "Gender"                   "Item Purchased"          
##  [5] "Category"                 "Purchase Amount (USD)"   
##  [7] "Location"                 "Size"                    
##  [9] "Color"                    "Season"                  
## [11] "Review Rating"            "Subscription Status"     
## [13] "Payment Method"           "Shipping Type"           
## [15] "Discount Applied"         "Promo Code Used"         
## [17] "previous_purchases"       "Preferred Payment Method"
## [19] "Frequency of Purchases"
head(df)
##    Customer ID   Age Gender Item Purchased Category Purchase Amount (USD)
##          <int> <int> <char>         <char>   <char>                 <int>
## 1:           1    55   Male         Blouse Clothing                    53
## 2:           2    19   Male        Sweater Clothing                    64
## 3:           3    50   Male          Jeans Clothing                    73
## 4:           4    21   Male        Sandals Footwear                    90
## 5:           5    45   Male         Blouse Clothing                    49
## 6:           6    46   Male       Sneakers Footwear                    20
##         Location   Size     Color Season Review Rating Subscription Status
##           <char> <char>    <char> <char>         <num>              <char>
## 1:      Kentucky      L      Gray Winter           3.1                 Yes
## 2:         Maine      L    Maroon Winter           3.1                 Yes
## 3: Massachusetts      S    Maroon Spring           3.1                 Yes
## 4:  Rhode Island      M    Maroon Spring           3.5                 Yes
## 5:        Oregon      M Turquoise Spring           2.7                 Yes
## 6:       Wyoming      M     White Summer           2.9                 Yes
##    Payment Method Shipping Type Discount Applied Promo Code Used
##            <char>        <char>           <char>          <char>
## 1:    Credit Card       Express              Yes             Yes
## 2:  Bank Transfer       Express              Yes             Yes
## 3:           Cash Free Shipping              Yes             Yes
## 4:         PayPal  Next Day Air              Yes             Yes
## 5:           Cash Free Shipping              Yes             Yes
## 6:          Venmo      Standard              Yes             Yes
##    previous_purchases Preferred Payment Method Frequency of Purchases
##                 <int>                   <char>                 <char>
## 1:                 14                    Venmo            Fortnightly
## 2:                  2                     Cash            Fortnightly
## 3:                 23              Credit Card                 Weekly
## 4:                 49                   PayPal                 Weekly
## 5:                 31                   PayPal               Annually
## 6:                 14                    Venmo                 Weekly
dim(df)
## [1] 3900   19
str(df)
## Classes 'data.table' and 'data.frame':   3900 obs. of  19 variables:
##  $ Customer ID             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  55 19 50 21 45 46 63 27 26 57 ...
##  $ Gender                  : chr  "Male" "Male" "Male" "Male" ...
##  $ Item Purchased          : chr  "Blouse" "Sweater" "Jeans" "Sandals" ...
##  $ Category                : chr  "Clothing" "Clothing" "Clothing" "Footwear" ...
##  $ Purchase Amount (USD)   : int  53 64 73 90 49 20 85 34 97 31 ...
##  $ Location                : chr  "Kentucky" "Maine" "Massachusetts" "Rhode Island" ...
##  $ Size                    : chr  "L" "L" "S" "M" ...
##  $ Color                   : chr  "Gray" "Maroon" "Maroon" "Maroon" ...
##  $ Season                  : chr  "Winter" "Winter" "Spring" "Spring" ...
##  $ Review Rating           : num  3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
##  $ Subscription Status     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Payment Method          : chr  "Credit Card" "Bank Transfer" "Cash" "PayPal" ...
##  $ Shipping Type           : chr  "Express" "Express" "Free Shipping" "Next Day Air" ...
##  $ Discount Applied        : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Promo Code Used         : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ previous_purchases      : int  14 2 23 49 31 14 49 19 8 4 ...
##  $ Preferred Payment Method: chr  "Venmo" "Cash" "Credit Card" "PayPal" ...
##  $ Frequency of Purchases  : chr  "Fortnightly" "Fortnightly" "Weekly" "Weekly" ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(df)
##   Customer ID          Age           Gender          Item Purchased    
##  Min.   :   1.0   Min.   :18.00   Length:3900        Length:3900       
##  1st Qu.: 975.8   1st Qu.:31.00   Class :character   Class :character  
##  Median :1950.5   Median :44.00   Mode  :character   Mode  :character  
##  Mean   :1950.5   Mean   :44.07                                        
##  3rd Qu.:2925.2   3rd Qu.:57.00                                        
##  Max.   :3900.0   Max.   :70.00                                        
##    Category         Purchase Amount (USD)   Location             Size          
##  Length:3900        Min.   : 20.00        Length:3900        Length:3900       
##  Class :character   1st Qu.: 39.00        Class :character   Class :character  
##  Mode  :character   Median : 60.00        Mode  :character   Mode  :character  
##                     Mean   : 59.76                                             
##                     3rd Qu.: 81.00                                             
##                     Max.   :100.00                                             
##     Color              Season          Review Rating  Subscription Status
##  Length:3900        Length:3900        Min.   :2.50   Length:3900        
##  Class :character   Class :character   1st Qu.:3.10   Class :character   
##  Mode  :character   Mode  :character   Median :3.70   Mode  :character   
##                                        Mean   :3.75                      
##                                        3rd Qu.:4.40                      
##                                        Max.   :5.00                      
##  Payment Method     Shipping Type      Discount Applied   Promo Code Used   
##  Length:3900        Length:3900        Length:3900        Length:3900       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  previous_purchases Preferred Payment Method Frequency of Purchases
##  Min.   : 1.00      Length:3900              Length:3900           
##  1st Qu.:13.00      Class :character         Class :character      
##  Median :25.00      Mode  :character         Mode  :character      
##  Mean   :25.35                                                     
##  3rd Qu.:38.00                                                     
##  Max.   :50.00
# Missing values
colSums(is.na(df))
##              Customer ID                      Age                   Gender 
##                        0                        0                        0 
##           Item Purchased                 Category    Purchase Amount (USD) 
##                        0                        0                        0 
##                 Location                     Size                    Color 
##                        0                        0                        0 
##                   Season            Review Rating      Subscription Status 
##                        0                        0                        0 
##           Payment Method            Shipping Type         Discount Applied 
##                        0                        0                        0 
##          Promo Code Used       previous_purchases Preferred Payment Method 
##                        0                        0                        0 
##   Frequency of Purchases 
##                        0
# Unique items
length(unique(df$`Item Purchased`))
## [1] 25

Tab 2

This graph shows the top 10 most bought items separated by gender. The Other category completely dominates Merch Sales with 2,242, crushing every single item. when looking at specific products, Pants, Jewelry and Blouses are at the top with 171 purchases each, and are followed by Shirts, Dresses, and Sweaters. Almost all items show a balanced purchasing behavior between the two genders, and the breakdown reflects that with no single product category being strongly influenced by a single gender, suggesting that the top items are broadly shared.

max_y <- round_any(max(agg_tot$tot), 10, ceiling)
ggplot(new_df, aes(x = reorder(Item.Purchased, n, sum), y = n, fill = Gender)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(title = "Purchases by Item (Top 10) by Gender", x = "", 
       y = "Purchases Count", fill = "Gender") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(data = agg_tot, aes(x = Item.Purchased, y = tot, 
                                label = scales::comma(tot), fill = NULL), 
            hjust = -0.1, size = 4) +
  scale_y_continuous(labels = comma, 
            expand = expansion(mult = c(0, 0.1)))

Tab 3

This line graph describes data recorded about previous purchases made by customers. There are some fluctuations where counts consistently increase then decrease. This does not have any clear tendency to rise or fall in correlation to the number of previous purchases. Mid 70s to mid 80s is the most frequent range value and correlates to a relative customer uniformity that spans across the purchase history spectrum. The data contains some isolated spikes but overall the information reflects a repurchase uniformity across data sets. There are 97 records at the highest isolated spike and only 58 at the lowest of the isolated spikes. Overall data of customers that have made different previous purchases are uniform.

x_axis_breaks <- seq(min(prev_df$previous_purchases), max(prev_df$previous_purchases), by = 5)

ggplot(prev_df, aes(x = previous_purchases, y = n)) +
  geom_line(color = "black", linewidth = 1) +
  geom_point(shape = 21, size = 4, color = "black", fill = "white") +
  geom_point(data = hi_lo, shape = 21, size = 4, fill = "red", color = "red") +
  geom_label_repel(data = hi_lo, aes(label = scales::comma(n)),
                   box.padding = 1, point.padding = 1, size = 4,
                   color = "grey50", segment.color = "black") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = x_axis_breaks, minor_breaks = NULL) +
  labs(title = "Records by Previous Purchases", x = "Number of Previous Purchases",
       y = "Count of Records", caption = "Source: shopping_trends.csv") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))

Tab 4

This series of bar charts illustrates the number of purchases made in each product category for each season. Across all charts, Clothing has the most purchases, followed by Accessories, while Footwear and Outerwear have the fewest. The only notable seasonal variations include the Clothing purchases in Spring and Fall, which do seem to be a little higher. Accessories and Footwear appear to have fairly consistent demand, and customer demand for products is more influenced by the type of product. With Clothing continuing to have a higher number of purchases year-round, it creates a consistent trend.

ggplot(category_season_df, aes(x = Category, y = n, fill = Season)) +
  geom_col(width = 0.8) +
  facet_wrap(~Season, ncol = 2) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2", name = "Season") +
  labs(title = "Purchases by Category Across Seasons",
       x = "Product Category", y = "Purchase Count") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 30, hjust = 1),
        panel.spacing = unit(1, "lines"))

Tab 5

The line graph illustrates the number of purchases across different product categories, with a different line for each season. Clothing has the highest sales number followed by Accessories, while Footwear and Outerwear are in the very low demand range. Purchase counts show very little seasonality outside of an increase in Clothing purchases in the Spring and Winter; all other purchases remain statistically constant. All parallel lines show that sales of the product categories are stable, further supporting the argument that consumer preferences are consistent through all the seasons.

ggplot(cat_df, aes(x = Category, y = n, group = Season, color = Season)) +
  geom_line(linewidth = 1.2) +
  geom_point(shape = 21, size = 3, color = "black", fill = "white") +
  labs(title = "Purchase Count by Category (Lines by Season)",
       x = "Category", y = "Purchase Count", color = "Season") +
  scale_y_continuous(labels = comma) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 25, hjust = 1)) +
  scale_color_brewer(palette = "Paired")

Tab 6

Total purchase counts by season and average spend per purchase on the top 10 items are combined on the dual-axis chart. The horizontal stacked bars show the item demand distribution per season: Winter, Spring, Summer, and Fall. Shirts, Pants, and Jewelry have a sustained purchase volume demand across the seasons, while Belts and Coats do not. The black line shows average spend per each item. Though product purchase volume varies widely, average spend per item does not, suggesting customers spend similar amounts per purchase regardless of product demand. The chart proves that purchase frequency and not average spend per item is the most significant driver of total sales volume.

new_df <- df %>%
  filter(`Item Purchased` %in% top_items) %>%
  dplyr::count(`Item Purchased`, Season) %>%
  rename(Item.Purchased = `Item Purchased`, n = n) %>%
  as.data.frame()

avg_df <- df %>%
  filter(`Item Purchased` %in% top_items) %>%
  group_by(`Item Purchased`) %>%
  summarise(
    purchase_count = n(),
    total_spent = sum(`Purchase Amount (USD)`, na.rm = TRUE),
    avg_spend = total_spent / purchase_count,
    .groups = "drop"
  ) %>%
  as.data.frame()

item_order <- new_df %>%
  group_by(Item.Purchased) %>%
  summarise(tot = sum(n), .groups = "drop") %>%
  arrange(tot) %>%
  pull(Item.Purchased)

ylab <- seq(0, max(avg_df$avg_spend, na.rm = TRUE), length.out = 5)
my_labels <- dollar(ylab)

scale_factor <- max(new_df$n, na.rm = TRUE) / max(avg_df$avg_spend, na.rm = TRUE)

ggplot(new_df, aes(x = Item.Purchased, y = n, fill = Season)) +
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  coord_flip() +
  theme_light() +
  labs(
    title = "Purchase Count and Average Spend per Purchase (Top 10 Items)",
    x = "",
    y = "Purchase Count",
    fill = "Season") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Spectral", guide = guide_legend(reverse = TRUE)) +
  geom_line(
    inherit.aes = FALSE,
    data = avg_df,
    aes(x = `Item Purchased`, y = avg_spend * scale_factor, colour = "Avg Spend", group = 1),
    linewidth = 1) +
  scale_color_manual(NULL, values = "black") +
  scale_y_continuous(
    labels = comma,
    sec.axis = sec_axis(~ . / scale_factor,
      name = "Avg Spend per Purchase",
      breaks = ylab,
      labels = my_labels)) +
  geom_point(
    inherit.aes = FALSE,
    data = avg_df,
    aes(x = `Item Purchased`, y = avg_spend * scale_factor, group = 1),
    size = 3,
    shape = 21,
    fill = "white",
    color = "black") +
  theme(
    legend.background = element_rect(fill = "transparent"),
    legend.box.background = element_rect(fill = "transparent", colour = NA),
    legend.spacing = unit(-1, "lines"))

Tab 7

This heatmap describes seasonal demand by product category. All year round, clothing is the most popular category with spring being the high point at 454 purchases. Accessories’ demand is moderate and stable throughout all seasons with an upward trend and demand peak in fall at 324 purchases. Footwear’s demand is highest in spring at 163 purchases and lowest in fall at 136, showing slight seasonal variation. Outerwear has the lowest demand in all seasons, lowest in summer at 75 purchases. From the heatmap, it can be seen that product category is the most determining factor for demand as there are only minor seasonal fluctuations.

ggplotly(g, tooltip = c("n", "Season", "Category")) %>%
  style(hoverlabel = list(bgcolor = "white"))

Tab 8

The trellis chart breaks down purchases by location group (Montana, California, and Other) by season. For every season, the Other category is responsible for ~95% of total purchases, while Montana and California each contribute a sliver. From Winter to Fall, the proportions are nearly the same, showing that the customer’s location is not variable over the year. Seasonal variation in purchases doesn’t seem to affect where demand is coming from.

plot_ly(textposition = "inside", labels = ~myLoc, values = ~n) %>%
  add_pie(data = loc_df[loc_df$Season == "Winter", ],
          name = "Winter", title = "Winter",
          domain = list(row = 0, column = 0)) %>%
  add_pie(data = loc_df[loc_df$Season == "Spring", ],
          name = "Spring", title = "Spring",
          domain = list(row = 0, column = 1)) %>%
  add_pie(data = loc_df[loc_df$Season == "Summer", ],
          name = "Summer", title = "Summer",
          domain = list(row = 1, column = 0)) %>%
  add_pie(data = loc_df[loc_df$Season == "Fall", ],
          name = "Fall", title = "Fall",
          domain = list(row = 1, column = 1)) %>%
  layout(title = "Trellis Chart: Purchases by Location Group and Season",
         showlegend = TRUE, grid = list(rows = 2, columns = 2))

Conclusion

The dataset shows consistent patterns in purchasing behavior. In all visualizations, product type stands out as the biggest demand influencer, as Clothing shows the most purchases regardless of season, gender and location. Even though some percentages are present, there are no large changes in the order of the product categories, suggesting there are consistent preferences among consumers year-round.

This trend continues in the analyses of location and gender. There is an even spread of purchases across location, and the buying behavior seems well distributed across the gender spectrum. Furthermore, the variation of previous purchase counts does not result in large fluctuations of demand, showing that purchase history does not require patterns.

In conclusion, the product category has proven to exert the most influence in driving consumers behavior, even when seasonal, geographical and demographic differentials are taken into account, the product category is still the most significant factor in driving purchasing behavior. The descriptive analysis lays the groundwork for the future work analysis to be more complex and captures the present data in the simplest form.