About the Dataset

The Online Shopping Dataset compiles a wealth of information regarding customer transactions in an e-commerce platform. It encompasses various aspects such as customer demographics (CustomerID, Gender, Location), transaction details (Transaction_ID, Transaction_Date), product specifics (Product_SKU, Product_Category), and transactional elements (Quantity, Avg_Price, Discount_pct). Additionally, it includes spending breakdowns (Offline_Spend, Online_Spend), coupon-related data (Coupon_Status, Coupon_Code). This dataset offers a comprehensive view of customer behavior, purchasing patterns, and the effectiveness of promotional strategies in the realm of online shopping.

Link to Dataset

Objectives
  1. Customer Behavior and Product Insights:

    • Which product categories are most frequently purchased? Can we visualize this trend over time?

    • Is there any correlation between tenure (length of association) and the average price or quantity of products purchased?

  2. Coupon Analysis:

    • What is the distribution of coupon usage among different product categories or locations?

    • Can we analyze the impact of different coupon codes on the average transaction amount or quantity purchased?

  3. Geographical Patterns and Spending Habits:

    • How does customer spending behavior vary across different locations?

    • Can we visualize the distribution of offline versus online spending across different regions or customer segments?

  4. Temporal Trends and Seasonality:

    • Are there seasonal trends in purchases? How do they vary across product categories?

    • Can we identify any patterns in offline or online spending behavior over the months?

  5. Impact of Discounts and GST:

    • Is there a noticeable relationship between the discount percentage applied and the quantity or total spend in transactions?

    • How does the GST affect the total transaction amount, and is there a variation across product categories or locations?

  6. Performance Metrics:

    • What are the key performance indicators (KPIs) that can be derived from this dataset, such as average revenue per customer, average transaction value, or customer acquisition trends over time?

Analysis

# installing necessary packages
install.packages(c("tidyverse", "plotly", "ggthemes", "tinytex"),
                 repos = "https://cran.rstudio.com/")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## package 'plotly' successfully unpacked and MD5 sums checked
## package 'ggthemes' successfully unpacked and MD5 sums checked
## package 'tinytex' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\HP\AppData\Local\Temp\RtmpyKkqgt\downloaded_packages
library(tinytex)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggthemes)
library(lubridate)
library(knitr)

# importing the dataset
on_shop <- read.csv("F:\\R practice\\R Projects\\Project 6\\Online Shopping Dataset\\Online_shopping.csv")

# Checking head and str()
str(on_shop)
## 'data.frame':    52955 obs. of  20 variables:
##  $ CustomerID         : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 17850 ...
##  $ Gender             : chr  "M" "M" "M" "M" ...
##  $ Location           : chr  "Chicago" "Chicago" "Chicago" "Chicago" ...
##  $ Tenure_Months      : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ Transaction_ID     : int  16679 16680 16696 16699 16700 16701 16702 16703 16704 16710 ...
##  $ Transaction_Date   : chr  "1/1/2019" "1/1/2019" "1/1/2019" "1/1/2019" ...
##  $ Product_SKU        : chr  "GGOENEBJ079499" "GGOENEBJ079499" "GGOENEBQ078999" "GGOENEBQ079099" ...
##  $ Product_Description: chr  "Nest Learning Thermostat 3rd Gen-USA - Stainless Steel" "Nest Learning Thermostat 3rd Gen-USA - Stainless Steel" "Nest Cam Outdoor Security Camera - USA" "Nest Protect Smoke + CO White Battery Alarm-USA" ...
##  $ Product_Category   : chr  "Nest-USA" "Nest-USA" "Nest-USA" "Nest-USA" ...
##  $ Quantity           : int  1 1 2 1 1 1 2 2 1 1 ...
##  $ Avg_Price          : num  153.7 153.7 122.8 81.5 153.7 ...
##  $ Delivery_Charges   : num  6.5 6.5 6.5 6.5 6.5 ...
##  $ Coupon_Status      : chr  "Used" "Used" "Not Used" "Clicked" ...
##  $ GST                : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ Date               : chr  "1/1/2019" "1/1/2019" "1/1/2019" "1/1/2019" ...
##  $ Offline_Spend      : int  4500 4500 4500 4500 4500 4500 4500 4500 4500 4500 ...
##  $ Online_Spend       : num  2424 2424 2424 2424 2424 ...
##  $ Month              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Coupon_Code        : chr  "ELEC10" "ELEC10" "ELEC10" "ELEC10" ...
##  $ Discount_pct       : int  10 10 10 10 10 10 10 10 10 10 ...
head(on_shop)
##   CustomerID Gender Location Tenure_Months Transaction_ID Transaction_Date
## 1      17850      M  Chicago            12          16679         1/1/2019
## 2      17850      M  Chicago            12          16680         1/1/2019
## 3      17850      M  Chicago            12          16696         1/1/2019
## 4      17850      M  Chicago            12          16699         1/1/2019
## 5      17850      M  Chicago            12          16700         1/1/2019
## 6      17850      M  Chicago            12          16701         1/1/2019
##      Product_SKU                                    Product_Description
## 1 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainless Steel
## 2 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainless Steel
## 3 GGOENEBQ078999                 Nest Cam Outdoor Security Camera - USA
## 4 GGOENEBQ079099        Nest Protect Smoke + CO White Battery Alarm-USA
## 5 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainless Steel
## 6 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainless Steel
##   Product_Category Quantity Avg_Price Delivery_Charges Coupon_Status GST
## 1         Nest-USA        1    153.71              6.5          Used 0.1
## 2         Nest-USA        1    153.71              6.5          Used 0.1
## 3         Nest-USA        2    122.77              6.5      Not Used 0.1
## 4         Nest-USA        1     81.50              6.5       Clicked 0.1
## 5         Nest-USA        1    153.71              6.5       Clicked 0.1
## 6         Nest-USA        1    153.71              6.5       Clicked 0.1
##       Date Offline_Spend Online_Spend Month Coupon_Code Discount_pct
## 1 1/1/2019          4500       2424.5     1      ELEC10           10
## 2 1/1/2019          4500       2424.5     1      ELEC10           10
## 3 1/1/2019          4500       2424.5     1      ELEC10           10
## 4 1/1/2019          4500       2424.5     1      ELEC10           10
## 5 1/1/2019          4500       2424.5     1      ELEC10           10
## 6 1/1/2019          4500       2424.5     1      ELEC10           10
# Checking for missing values
any(is.na(on_shop)) #there are NA's
## [1] TRUE
# Let's look for column wise missings
nulls <- colSums(is.na(on_shop))
nulls[nulls > 0]
##       CustomerID    Tenure_Months   Transaction_ID         Quantity 
##               31               31               31               31 
##        Avg_Price Delivery_Charges              GST    Offline_Spend 
##               31               31               31               31 
##     Online_Spend     Discount_pct 
##               31              400

Dealing missing values :

Substituting 400 missing ‘Discount_pct’ values with zeros prevented significant data loss, avoiding the need to delete those rows.

on_shop$Discount_pct[is.na(on_shop$Discount_pct)] <- 0

any(is.na(on_shop$Discount_pct))
## [1] FALSE
# Dropping rows with null values in key fields preserved essential data integrity.
on_shop <- na.omit(on_shop)
any(is.na(on_shop))
## [1] FALSE
# Effectively managed missing values within the dataset.

# Checking summary 
summary(on_shop)
##    CustomerID       Gender            Location         Tenure_Months  
##  Min.   :12346   Length:52924       Length:52924       Min.   : 2.00  
##  1st Qu.:13869   Class :character   Class :character   1st Qu.:15.00  
##  Median :15311   Mode  :character   Mode  :character   Median :27.00  
##  Mean   :15347                                         Mean   :26.13  
##  3rd Qu.:16996                                         3rd Qu.:37.00  
##  Max.   :18283                                         Max.   :50.00  
##  Transaction_ID  Transaction_Date   Product_SKU        Product_Description
##  Min.   :16679   Length:52924       Length:52924       Length:52924       
##  1st Qu.:25384   Class :character   Class :character   Class :character   
##  Median :32626   Mode  :character   Mode  :character   Mode  :character   
##  Mean   :32410                                                            
##  3rd Qu.:39126                                                            
##  Max.   :48497                                                            
##  Product_Category      Quantity         Avg_Price      Delivery_Charges
##  Length:52924       Min.   :  1.000   Min.   :  0.39   Min.   :  0.00  
##  Class :character   1st Qu.:  1.000   1st Qu.:  5.70   1st Qu.:  6.00  
##  Mode  :character   Median :  1.000   Median : 16.99   Median :  6.00  
##                     Mean   :  4.498   Mean   : 52.24   Mean   : 10.52  
##                     3rd Qu.:  2.000   3rd Qu.:102.13   3rd Qu.:  6.50  
##                     Max.   :900.000   Max.   :355.74   Max.   :521.36  
##  Coupon_Status           GST             Date           Offline_Spend 
##  Length:52924       Min.   :0.0500   Length:52924       Min.   : 500  
##  Class :character   1st Qu.:0.1000   Class :character   1st Qu.:2500  
##  Mode  :character   Median :0.1800   Mode  :character   Median :3000  
##                     Mean   :0.1375                      Mean   :2831  
##                     3rd Qu.:0.1800                      3rd Qu.:3500  
##                     Max.   :0.1800                      Max.   :5000  
##   Online_Spend        Month        Coupon_Code         Discount_pct 
##  Min.   : 320.2   Min.   : 1.000   Length:52924       Min.   : 0.0  
##  1st Qu.:1252.6   1st Qu.: 4.000   Class :character   1st Qu.:10.0  
##  Median :1837.9   Median : 7.000   Mode  :character   Median :20.0  
##  Mean   :1893.1   Mean   : 6.652                      Mean   :19.8  
##  3rd Qu.:2425.3   3rd Qu.: 9.000                      3rd Qu.:30.0  
##  Max.   :4556.9   Max.   :12.000                      Max.   :30.0
# Converting date from 'character' to 'date' format    
on_shop$Transaction_Date <- mdy(on_shop$Transaction_Date)


# extracting transaction day, month, year, week day.
on_shop <- on_shop %>% 
  mutate(trans_day = day(Transaction_Date),
         trans_month = month(Transaction_Date),
         trans_year = year(Transaction_Date),
         trans_week = wday(Transaction_Date))

#'Transaction_date' and 'Date' columns share identical values; thus, 
# 'Date' column deletion is necessary.
on_shop <- on_shop %>% 
  select(-Date)
  1. Customer Behavior and Product Insights:

    • Which product categories are most frequently purchased? Can we visualize this trend over time?

    • Is there any correlation between tenure (length of association) and the average price or quantity of products purchased?

theme_set(theme_solarized())

prod_cat <- on_shop %>% 
  group_by(Product_Category) %>% 
  summarise(Product_Count = n()) %>% 
  arrange(-Product_Count)

 ggplot(prod_cat, aes(x = Product_Category, y = Product_Count)) +
  geom_bar(stat = "identity", fill = "#76D7C4") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title = "Purchase trend across product categories",
       x = "Category", y = "Count", 
       subtitle = "Apparel stands as the top-selling category, while Android ranks as the least sold") +
  theme(axis.text.x = element_text(size = 8, color = "#39858C", face = "bold")) +
  theme(plot.title = element_text(color = "#3498DB", face = "bold")) +
  theme(plot.subtitle = element_text(size = 8, color = "#5F75CE", face = "italic"))

# Visualizing trend over time
prod_cat2 <- on_shop %>% 
  group_by(Transaction_Date, Product_Category) %>% 
  summarise(Count = n()) %>% 
  arrange(- Count)
## `summarise()` has grouped output by 'Transaction_Date'. You can override using
## the `.groups` argument.
k <- ggplot(prod_cat2, aes(x = Transaction_Date, y = Count)) +
  geom_line(aes(color = Product_Category), size = 0.2) +
  labs(title = "Trend analysis of product categories over time",
       x = "Transaction Date", color = "Category") +
  theme(plot.title = element_text(color = "#3498DB", face = "bold",
                                  size = 12)) +
  theme(legend.text = element_text(size = 8))
  
# Interactive Viz
ggplotly(k)

Interactive viz: Navigate data using the cursor.

# correlation between tenure and quantity purchased
cor(on_shop$Quantity, on_shop$Tenure_Months) # 0.006867134
## [1] 0.006867134
# correlation between tenure and avg_price purchased
cor(on_shop$Tenure_Months, on_shop$Avg_Price) # -0.0007849942
## [1] -0.0007849942

There is a negligible correlation between tenure and quantity purchased (0.006867134), indicating weak to no relationship. Conversely, there appears to be a slight negative correlation (−0.0007849942) between tenure and average price, suggesting a minimal inverse association.

  1. Coupon Analysis:

    • What is the distribution of coupon usage among different product categories or locations?

    • Can we analyze the impact of different coupon codes on the average transaction amount or quantity purchased?

colnames(on_shop)
##  [1] "CustomerID"          "Gender"              "Location"           
##  [4] "Tenure_Months"       "Transaction_ID"      "Transaction_Date"   
##  [7] "Product_SKU"         "Product_Description" "Product_Category"   
## [10] "Quantity"            "Avg_Price"           "Delivery_Charges"   
## [13] "Coupon_Status"       "GST"                 "Offline_Spend"      
## [16] "Online_Spend"        "Month"               "Coupon_Code"        
## [19] "Discount_pct"        "trans_day"           "trans_month"        
## [22] "trans_year"          "trans_week"
coup_loc <- on_shop %>% 
  group_by(Coupon_Status, Location) %>% 
  summarise(Count = n()) %>% 
  arrange(- Count)
## `summarise()` has grouped output by 'Coupon_Status'. You can override using the
## `.groups` argument.
 ggplot(coup_loc, aes(x = Coupon_Status, y = Count)) +
  geom_col(aes(fill = Location), color = "black") +
  labs(title = "Distribution of Coupon Status across Locations", 
       x = "Coupon Status",
       subtitle = "Chicago tops coupon usage, while Washington DC ranks lowest") +
  theme(plot.subtitle = element_text(size = 8, color = "#5F75CE", face = "italic")) +
  scale_fill_manual(values = c("#2B3E67", "#2D7246", 
                               "#F4FC4A", "#E91601", "#71522A")) +
   theme(plot.title = element_text(color = "#3498DB", face = "bold", size = 12))

The predominant Coupon Status is ‘Clicked’, signifying its prevalence among transactions. Chicago residents utilize the highest number of coupons, while Washington DC residents use the fewest.

# Product category and coupon code
pcc <- on_shop %>% 
  group_by(Coupon_Code) %>% 
  summarise(Count = n()) %>% 
  arrange(-Count)

# Top 10 most used coupon code
top_10_cc <- head(pcc, 10)

kable(top_10_cc, full_width = FALSE)
Coupon_Code Count
SALE20 6373
SALE30 5915
SALE10 5838
ELEC10 4826
ELEC30 4647
ELEC20 4540
EXTRA10 2317
OFF10 2250
EXTRA20 2211
OFF20 2202
# Least used Coupon Codes
least_cc <- tail(pcc, 5)

kable(least_cc, full_width = FALSE)
Coupon_Code Count
HOU30 30
GC30 29
AND30 16
AND10 14
AND20 13
# Product Category and Coupon
top_codes <- on_shop %>% 
  group_by(Product_Category, Coupon_Code) %>% 
  summarise(Count = n()) %>% 
  arrange(- Count) %>% 
  head(30)
## `summarise()` has grouped output by 'Product_Category'. You can override using
## the `.groups` argument.
interactive <- ggplot(top_codes, aes(x = Product_Category, y = Count)) +
  geom_col(color = "black", aes(fill = Coupon_Code)) +
  labs(title = "Top Coupon Codes used for Product Purchases",
       x = "Product Category", fill = "Coupon Code") +
  theme(axis.text.x = element_text(angle = 45)) +
  theme(axis.text.x = element_text(size = 8)) +
  scale_fill_manual(values = c(
    "#336600", "#FF9900", "#000066", "#FF3399", "#660099",
    "#66CCFF", "#993200", "#00CC66", "#663300", "#FFCC00",
    "#333333", "#FF0033", "#F5F5F5", "#9966CC", "#FAFAFA",
    "#3333FF", "#F0F8FF", "#009999", "#E5B7B7", "#66FF33",
    "#F08080", "#CC6699", "#DE3B3B", "#FFCC99", "#CD5C5C",
    "#FFFF00", "#993232", "#FF99CC", "#8B0000", "#9900FF"
  )) +
  guides(fill = FALSE) +
  theme(plot.title = element_text(color = "#3498DB", face = "bold",
                                  size = 12))
  
ggplotly(interactive)

Interactive viz: Navigate data using the cursor.

# impact of different coupon codes on the average 
# transaction amount or quantity purchased

ggplot(on_shop, aes(x = Coupon_Code, y = Avg_Price)) +
  geom_boxplot(fill = "#FB3F03", 
               color = "#714702", 
               alpha = 0.6) +
  theme(axis.text.x = element_text(size = 7, 
                                   angle = 90, 
                                   color = "#363744")) +
  labs(title = "Distribution of Average Price across Coupon Codes",
       x = "Coupon Code", y = "Average Price") +
  theme(plot.title = element_text(color = "#3498DB", face = "bold",
                                  size = 12))

There is a larger variance in average prices for coupon codes GC10, GC20, and GC30, while the distribution of average prices among other coupon codes appears relatively consistent. Coupon code ACC30 stands unique without any coupon pairs, unlike the other codes, which exhibit three distinct categories. Notably, the highest average prices are associated with coupon codes NE10, NE20, and NE30.

  1. Geographical Patterns and Spending Habits:

    • How does customer spending behavior vary across different locations?

    • Can we visualize the distribution of offline versus online spending across different regions or customer segments?

# Adding up values of Offline Spend and Online Spend and creating a new column called total_spending
on_shop <- on_shop %>% 
  mutate(total_spending = Offline_Spend + Online_Spend)

# Combination of violin and boxplot is ideal for understanding the data distribution
ggplot(on_shop, aes(x = Location, y = total_spending)) +
  geom_violin(color ="#59905B", fill = "#E0FEAF") +
  geom_boxplot(color = "#D08115", width = 0.4, 
               fill = "#B06860", alpha = 0.5) +
  labs(title = "Spendings distribution across locations",
       y = "Total Spending (Online + Offline)") +
  theme(plot.title = element_text(color = "#3498DB", face = "bold", size = 12))

The minimum spending in Washington DC stands notably higher than in other locations, while spending patterns across all other locations appear similar. California exhibits the highest maximum spending density, while New Jersey registers the lowest.

# offline vs online
on_off <- on_shop %>% 
  select(Location, Online_Spend, Offline_Spend) %>% 
  group_by(Location) %>% 
  summarise(Total_Online_Spend = sum(Online_Spend),
            Total_Offline_Spend = sum(Offline_Spend))

# Reshaping data to long format for plotting
on_off_long <- on_off %>% 
  pivot_longer(cols = c(Total_Online_Spend, Total_Offline_Spend),
               names_to = "Spend_Type",
               values_to = "Total_Spend")

ggplot(on_off_long, aes(x = Location, y = Total_Spend, fill = Spend_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Offline vs Online spending distribution by location",
       x = "Location",
       y = "Total Spending",
       fill = "Spend Type") +
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
  scale_fill_manual(values = c("#0D126C", "#F701FE")) +
  theme(plot.title = element_text(size = 12, color = "#3498DB", 
                                  face = "bold")) +
  theme(legend.text = element_text(size = 7, face = "bold"),
        legend.title = element_text(size = 8))

# Offline transactions account for the highest spending
  1. Temporal Trends and Seasonality:

    • Are there seasonal trends in purchases? How do they vary across product categories?

    • Can we identify any patterns in offline or online spending behavior over the months?

purchase_trends <- ggplot(on_shop, aes(x = Product_Category)) +
  geom_bar(aes(fill = Product_Category)) +
  facet_wrap(~ factor(trans_month), nrow = 4) +
  theme(axis.text.x = element_blank()) +
  labs(title = "Monthly purchase trends by product category",
       x = "Product Category",
       y = "Total Purchases") +
  theme(plot.title = element_text(size = 12, color = "#3498DB", 
                                  face = "bold")) +
  theme(legend.text = element_text(size = 9, face = "bold"),
        legend.title = element_text(size = 9))

ggplotly(purchase_trends)

‘Apparel’ demonstrates elevated sales in July and August, with the lowest recorded during November. ‘Nest_USA’ experiences heightened sales in January and December. ‘Bags’ exhibit increased sales during July and August, while ‘Drinkware’ sees higher sales in March and August. ‘Lifestyle’ products depict a steady sales increase from March to October, followed by a decline after October. The ‘Nest’ category demonstrates an increasing trend from August to December.

# Finding patterns in offline and online spending behavior 
spend_by_month <- on_shop %>% 
  group_by(trans_month) %>% 
  summarise(Total_Online_Spend = sum(Online_Spend),
            Total_Offline_Spend = sum(Offline_Spend))

# Reshaping data to long format for plotting
spend_by_month_long <- spend_by_month %>% 
  pivot_longer(cols = c(Total_Online_Spend, Total_Offline_Spend),
               names_to = "Spend_Type",
               values_to = "Total_Spend")


ggplot(spend_by_month_long, aes(x = trans_month, 
                                y = Total_Spend, 
                                color = Spend_Type)) +
  geom_line(linewidth = 1) +
  geom_point(size = 4) +
  scale_x_continuous(breaks = 1:12, labels = month.abb[1:12]) +
  labs(title = "Montly Offline vs Online Purchase Trend", 
       x = "Transaction Month", y = "Total Spend",
       subtitle = "Sales peak notably in August and December",
       color = "Spend Type") +
  theme(axis.text.x = element_text(size = 7, face = "bold")) +
  theme(plot.subtitle = element_text(size = 9, face = "italic",
                                     color = "#5F75CE")) +
  scale_color_brewer(palette = "Set1") +
  theme(plot.title = element_text(color = "#3498DB", face = "bold", size = 12)) +
  theme(legend.text = element_text(size = 7),
        legend.title = element_text(size = 8))

Offline transactions surpass online transactions in volume. Sales peak in August and December, whereas they reach a minimum in February, June, and September.

  1. Impact of Discounts and GST:

    • Is there a noticeable relationship between the discount percentage applied and the quantity or total spend in transactions?

    • How does the GST affect the total transaction amount, and is there a variation across product categories or locations?

analysis_5 <-on_shop %>% 
  select(Discount_pct, Quantity, Offline_Spend, Online_Spend, Location, GST) %>% 
  mutate(Total_Spend = Offline_Spend + Online_Spend) %>% 
  filter(Discount_pct > 0)

attach(analysis_5)

cor(Discount_pct, Total_Spend) 
## [1] 0.04769603
cor(Discount_pct, Quantity) 
## [1] -0.01138081

The correlation between Discount Percentage and Total_Spend is weak (0.04769603), suggesting little to no relationship. Conversely, a negative correlation (-0.01138081) exists between Discount Percentage and Quantity.

# Finding the total Amount including GST

# Add GST: GST amount = (Original cost x GST%)/100. 
# Net price = original cost + GST amount.
analysis_5 <- analysis_5 %>% 
  mutate(GST_Amount = (Total_Spend * GST))

analysis_5$GST_Amount <- analysis_5$Total_Spend + analysis_5$GST_Amount

gst_locate <- analysis_5 %>% 
  group_by(Location) %>% 
  summarise(Without_GST = sum(Total_Spend),
            With_GST = sum(GST_Amount)) 

# Reshaping for visualization purpose
gst_locate <- gst_locate %>% 
  pivot_longer(cols = c(Without_GST, With_GST),
               names_to = "Amount_Type",
               values_to = "Total_Amount")

ggplot(gst_locate, aes(x = Location, y = Total_Amount, fill = Amount_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(size = 8),
        plot.title = element_text(size = 13, face = "bold")) +
  labs(title = "Comparing total amounts with and without GST",
       y = "Total Amount", fill = "Amount Type") +
  scale_fill_manual(values = c("#D42626", "#169483")) +
  theme(plot.title = element_text(color = "#3498DB", face = "bold")) +
  theme(legend.text = element_text(size = 8))

The highest impact of GST was observed in New Jersey, resulting in a 13.73% increase from the original amount. California and New York experienced the lowest impact, with a 13.58% increase due to GST.

  1. Performance Metrics:

    • What are the key performance indicators (KPIs) that can be derived from this dataset, such as average revenue per customer, average transaction value, or customer acquisition trends over time?
# Average Transaction Value
avg_transaction_value <- mean(on_shop$total_spending)

print(avg_transaction_value)
## [1] 4724.023
# Customer Acquisition Trends Over Time
c_aq_trend <- on_shop %>% 
  group_by(Transaction_Date) %>% 
  summarise(New_Customers = n_distinct(CustomerID)) %>% 
  arrange(- New_Customers)


# Plotting Customer Acquisition Trends Over Time
ggplot(c_aq_trend, aes(x = Transaction_Date, y = New_Customers)) +
  geom_line(color = "#2C3E50", size = 1) +
    labs(title = "Customer Acquisition Trend Over Time",
         x = "Transaction Date", y = "New Customers",
         subtitle = "August saw the highest influx of new customers") +
  theme(plot.title = element_text(color = "#3498DB", face = "bold")) +
  theme(plot.subtitle = element_text(size = 10, face = "italic",
                                     color = "#5F75CE"))

Conclusion:
  1. Product Insights:

    • Apparel emerged as the most frequently purchased category, while Android products were the least sold.

    • Purchase trends across categories varied throughout the year, with distinct peaks for specific categories in different months.

  2. Coupon Analysis:

    • Chicago exhibited the most significant coupon usage, while Washington DC had the least.

    • Different coupon codes affected average transaction prices differently, with varying price fluctuations among codes.

  3. Geographical Spending Patterns:

    • Washington DC showed notably higher minimum spending compared to other locations.

    • Offline transactions dominated over online transactions, and peak sales occurred in August and December.

  4. Discount and Quantity Relationship:

    • There appeared to be little to no relationship between the discount percentage and total spend, but a negative correlation was observed between the discount percentage and quantity purchased.
  5. GST Impact:

    • GST had varying impacts across locations, with New Jersey experiencing the highest increase in the total amount due to GST.
Overall Summary:

The project provided insights into customer behavior, preferences, and seasonal trends in the online shopping domain, shedding light on various nuances and trends within the dataset.