Exploring Discount Usage by Customer Demographics and Shopping Behavior

Introduction

For this project, I will using the Shopping Behaviors dataset located on Kaggle. The dataset includes 3900 customer records and 18 variable attributes.

Predictors:
- Gender
- Location
- Subscription Status
Outcome:
- Discount Usage
Goal:
- Uncover relationship with / impact on Discount Usage

Loading Libraries

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(ggplot2) 
library(scales)
library(broom)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the Dataset

Load in Shopping Behaviors dataset
Convert categorical variables into factors

shopping = read.csv("shopping_behavior_updated.csv")

shopping$Discount.Applied = factor(shopping$Discount.Applied, 
                                   levels = c("Yes", "No"))
shopping$Gender = as.factor(shopping$Gender)
shopping$Location = as.factor(shopping$Location)
shopping$Subscription.Status = as.factor(shopping$Subscription.Status)

Checking the Dataset

Ensure loaded in properly using head() and summary()
Check for any missing data using colSums(is.na())

head(shopping)
summary(shopping)
colSums(is.na(shopping))

Data Exploration

Overall Discount Usage

By Gender:

By Subscription Status:

By Top 25% Locations:

Predictive Regression Model

discount_mod<- glm(
  Discount.Applied ~ Gender + Location + Subscription.Status,
  family = binomial,
  data = shopping
)

summary(discount_mod)

Model Summary

## # A tibble: 52 × 5
##    term                estimate std.error statistic p.value
##    <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)          20.8      497.       0.0418   0.967
##  2 GenderMale          -20.1      497.      -0.0405   0.968
##  3 LocationAlaska       -0.0725     0.503   -0.144    0.885
##  4 LocationArizona       0.480      0.576    0.832    0.405
##  5 LocationArkansas     -0.619      0.477   -1.30     0.194
##  6 LocationCalifornia    0.241      0.489    0.493    0.622
##  7 LocationColorado     -0.0313     0.515   -0.0607   0.952
##  8 LocationConnecticut   0.297      0.500    0.595    0.552
##  9 LocationDelaware      0.0741     0.496    0.149    0.881
## 10 LocationFlorida      -0.0884     0.518   -0.171    0.865
## # ℹ 42 more rows

Conclusion and Findings

Gender:
- Large imbalance
- Female = 100% NO discount usage
- Male = ~37% NO discount usage
Subscription Status:
- Large imbalance
- Subscribed = 100% discount usage
- NOT subscribed = 22% discount usage
Location:
- Indiana and Wisconsin
- Minor variation but nothing significant

Conclusion and Findings (cont.)

The Predictive Regression Model:
- Attempted but found to be unreliable
- Complete separation in gender and subscription status
- Not a good indicator of discount usage
- Results are descriptive patterns NOT conclusive evidence

In the future, other predictor variables can be taken into consideration to test their relational significance to discount usage. This could help a company tailor promotional advertisements to the correct target demographic based on those variables.