2025-11-02

Data

-This is a project that uses the shopping_behavior_updated dataset from kaggle.

-The data set has 18 columns, the major ones including the item purchased, Category, and purchase amount.

Rows/Columns: 3900 rows × 18 columns

Columns:
• **customer_id** (numeric)
• **age** (numeric)
• **gender** (character)
• **item_purchased** (character)
• **category** (character)
• **purchase_amount_usd** (numeric)
• **location** (character)
• **size** (character)
• **color** (character)
• **season** (character)
• **review_rating** (numeric)
• **subscription_status** (character)
• **shipping_type** (character)
• **discount_applied** (character)
• **promo_code_used** (character)
• **previous_purchases** (numeric)
• **payment_method** (character)
• **frequency_of_purchases** (character)

Data

Quick Glance

First 10 rows
customer_id age gender item_purchased category purchase_amount_usd location size color season review_rating subscription_status shipping_type discount_applied promo_code_used previous_purchases payment_method frequency_of_purchases
1 55 Male Blouse Clothing 53 Kentucky L Gray Winter 3.1 Yes Express Yes Yes 14 Venmo Fortnightly
2 19 Male Sweater Clothing 64 Maine L Maroon Winter 3.1 Yes Express Yes Yes 2 Cash Fortnightly
3 50 Male Jeans Clothing 73 Massachusetts S Maroon Spring 3.1 Yes Free Shipping Yes Yes 23 Credit Card Weekly
4 21 Male Sandals Footwear 90 Rhode Island M Maroon Spring 3.5 Yes Next Day Air Yes Yes 49 PayPal Weekly
5 45 Male Blouse Clothing 49 Oregon M Turquoise Spring 2.7 Yes Free Shipping Yes Yes 31 PayPal Annually
6 46 Male Sneakers Footwear 20 Wyoming M White Summer 2.9 Yes Standard Yes Yes 14 Venmo Weekly
7 63 Male Shirt Clothing 85 Montana M Gray Fall 3.2 Yes Free Shipping Yes Yes 49 Cash Quarterly
8 27 Male Shorts Clothing 34 Louisiana L Charcoal Winter 3.2 Yes Free Shipping Yes Yes 19 Credit Card Weekly
9 26 Male Coat Outerwear 97 West Virginia L Silver Summer 2.6 Yes Express Yes Yes 8 Venmo Annually
10 57 Male Handbag Accessories 31 Missouri M Pink Spring 4.8 Yes 2-Day Shipping Yes Yes 4 Cash Quarterly

Data

Cleaning & Feature Engineering

dat <- raw |>
  mutate(
    gender = factor(gender),
    category = factor(category),
    season = factor(season),
    subscription_status = factor(subscription_status, levels = c("No","Yes")),
    discount_applied = factor(discount_applied, levels = c("No","Yes")),
    promo_code_used = factor(promo_code_used, levels = c("No","Yes")),
    shipping_type = factor(shipping_type),
    payment_method = factor(payment_method),
    size = factor(size, levels = c("XS","S","M","L","XL")),
    # Helpful numeric versions
    purchase_amount = `purchase_amount_usd`,
    review_rating = `review_rating`,
    prev_purchases = `previous_purchases`
  ) |>
  filter(!is.na(purchase_amount), purchase_amount > 0)

summary(dat$purchase_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   39.00   60.00   59.76   81.00  100.00

Plots (ggplot)

Product Category Mix (Bar)

Why this plot? This bar chart visualizes the distribution of purchases across different product categories. Each bar represents the total number of transactions within a category, allowing for an easy comparison of which types of products are most popular among customers.

Plots (ggplot)

Purchase Amount by Gender (Box + Jitter)

Why this plot? This box-and-jitter plot compares purchase amounts across gender groups, giving a clear view of central tendencies, spread, and outliers for each category. The boxplot shows the median (central line) and interquartile range (IQR), representing where the middle 50% of purchase values lie for each gender.The points are densely packed around the median, showing that most customers make mid-range purchases between 50–70 dollars. Both distributions are symmetric and similar in range, meaning gender does not strongly influence purchase amount in this dataset.

Plots (plotly)

3D Scatter — Three Variables

Why this plot? The 3D scatter plot effectively captures how age, previous purchase history, and spending amount interact across customers, while also revealing whether gender differences play a role. It helps uncover whether older or more loyal customers tend to spend more, and whether such patterns differ between genders.

Plots (plotly)

Share of Purchases by Season (Donut)

Why this plot? The “Seasonal Purchase Share” plot indicates that shopping activity is consistent throughout the year, with only minor variations across seasons.

Stats

Descriptive Statistics

Key Descriptive Statistics
n mean_amount median_amount sd_amount min_amount max_amount mean_age mean_prev mean_rating
3900 59.76 60 23.69 20 100 44.07 25.35 3.75

Stats

Linear Regression: What drives purchase amount?

Linear model: purchase_amount ~ age + gender + prev_purchases + discount_applied + subscription_status
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 60.627 1.421 42.668 0.000 57.841 63.413
age -0.017 0.025 -0.670 0.503 -0.066 0.032
genderMale -0.286 1.013 -0.282 0.778 -2.272 1.700
prev_purchases 0.015 0.026 0.554 0.580 -0.037 0.066
discount_appliedYes -1.052 1.215 -0.866 0.387 -3.433 1.330
subscription_statusYes 0.563 1.197 0.470 0.638 -1.785 2.910

Commentary:
- Positive coefficient on prev_purchases would suggest more loyal customers tend to spend more per purchase (holding other factors constant).
- Significant effect of discount_appliedYes would imply discounts increase basket size.
- gender coefficients capture average differences vs. the baseline level.
- Check CIs and p-values to discuss significance in your narration.

Code (optional)

Reusable Helpers (Shown)

# Label-dollar formatting helper
fmt_dollar <- function(x) scales::dollar(x, accuracy = 1)

# Convenience plot theme
theme_course <- function() {
  theme_minimal(base_size = 14) +
  theme(panel.grid.minor = element_blank())
}

Appendix

Bonus: ggplot → plotly

References

  • Course dataset shopping_behavior_updated.csv (local CSV).
  • Libraries: tidyverse, plotly, janitor, broom, scales, kableExtra.