Introduction

For this project, I will using the Shopping Behaviors dataset located on Kaggle. The dataset includes 3900 customer records and 18 variable attributes.

  • Predictors:
    • Gender
    • Location
    • Subscription Status
  • Outcome:
    • Discount Usage
  • Goal:
    • Uncover relationship with / impact on Discount Usage

Loading Libraries

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

library(ggplot2) 
library(scales)
library(broom)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the Dataset

  • Load in Shopping Behaviors dataset
  • Convert categorical variables into factors

shopping = read.csv("shopping_behavior_updated.csv")

shopping$Discount.Applied = factor(shopping$Discount.Applied, 
                                   levels = c("Yes", "No"))
shopping$Gender = as.factor(shopping$Gender)
shopping$Location = as.factor(shopping$Location)
shopping$Subscription.Status = as.factor(shopping$Subscription.Status)

Checking the Dataset

  • Ensure loaded in properly using head() and summary()
  • Check for any missing data using colSums(is.na())

head(shopping)
summary(shopping)
colSums(is.na(shopping))

Data Exploration

Overall Discount Usage

By Gender:

By Subscription Status:

By Top 25% Locations:

Predictive Regression Model

discount_mod<- glm(
  Discount.Applied ~ Gender + Location + Subscription.Status,
  family = binomial,
  data = shopping
)

summary(discount_mod)

Model Summary

## # A tibble: 52 × 5
##    term                estimate std.error statistic p.value
##    <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)          20.8      497.       0.0418   0.967
##  2 GenderMale          -20.1      497.      -0.0405   0.968
##  3 LocationAlaska       -0.0725     0.503   -0.144    0.885
##  4 LocationArizona       0.480      0.576    0.832    0.405
##  5 LocationArkansas     -0.619      0.477   -1.30     0.194
##  6 LocationCalifornia    0.241      0.489    0.493    0.622
##  7 LocationColorado     -0.0313     0.515   -0.0607   0.952
##  8 LocationConnecticut   0.297      0.500    0.595    0.552
##  9 LocationDelaware      0.0741     0.496    0.149    0.881
## 10 LocationFlorida      -0.0884     0.518   -0.171    0.865
## # ℹ 42 more rows

Conclusion and Findings

  • Gender:
    • Large imbalance
    • Female = 100% NO discount usage
    • Male = ~37% NO discount usage
  • Subscription Status:
    • Large imbalance
    • Subscribed = 100% discount usage
    • NOT subscribed = 22% discount usage
  • Location:
    • Indiana and Wisconsin
    • Minor variation but nothing significant

Conclusion and Findings (cont.)

  • The Predictive Regression Model:
    • Attempted but found to be unreliable
    • Complete separation in gender and subscription status
    • Not a good indicator of discount usage
    • Results are descriptive patterns NOT conclusive evidence

In the future, other predictor variables can be taken into consideration to test their relational significance to discount usage. This could help a company tailor promotional advertisements to the correct target demographic based on those variables.