1 Assignment Overview

1.1 Background

Understanding customer behavior and purchasing patterns is vital for online businesses. Your objective is to analyze transaction data from an online retailer to uncover sales trends, customer segments, and potential improvements.

1.2 Data Description

The dataset captures transactions from December 2010 to December 2011 from a UK-based online retaile

2 Start Your Work Below

2.1 Data Import & Setup

library(tidyverse)
library(lubridate)
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(dlookr)
library(knitr)
data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")

2.2 Data Diagnostics

library(xray)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
xray::anomalies(data)
## $variables
##      Variable      q    qNA    pNA qZero pZero qBlank pBlank qInf pInf
## 1  CustomerID 541909 135080 24.93%     0     -      0      -    0    -
## 2   UnitPrice 541909      0      -  2515 0.46%      0      -    0    -
## 3 Description 541909      0      -     0     -   1454  0.27%    0    -
## 4     Country 541909      0      -     0     -      0      -    0    -
## 5    Quantity 541909      0      -     0     -      0      -    0    -
## 6   StockCode 541909      0      -     0     -      0      -    0    -
## 7 InvoiceDate 541909      0      -     0     -      0      -    0    -
## 8   InvoiceNo 541909      0      -     0     -      0      -    0    -
##   qDistinct      type anomalous_percent
## 1      4373   Integer            24.93%
## 2      1630   Numeric             0.46%
## 3      4224 Character             0.27%
## 4        38 Character                 -
## 5       722   Integer                 -
## 6      4070 Character                 -
## 7     23260 Character                 -
## 8     25900 Character                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)
#Key observation: We can see that only 75% of the CustomerID entries have values,
#UnitPrice has 2,515 rows with a value of 0,
#and the product description has 1,454 rows with a very small number of blank values.
library(dplyr)
library(tidyverse)
diagnose(data) %>% 
  arrange(desc(missing_count)) %>%
  select(variables, types, missing_count, missing_percent, unique_count) %>%
  knitr::kable(caption = "Data Quality Overview")
Data Quality Overview
variables types missing_count missing_percent unique_count
CustomerID integer 135080 24.92669 4373
InvoiceNo character 0 0.00000 25900
StockCode character 0 0.00000 4070
Description character 0 0.00000 4224
Quantity integer 0 0.00000 722
InvoiceDate character 0 0.00000 23260
UnitPrice numeric 0 0.00000 1630
Country character 0 0.00000 38
diagnose_outlier(data)
##    variables outliers_cnt outliers_ratio outliers_mean    with_mean
## 1   Quantity        58619      10.817130      50.59097     9.552250
## 2  UnitPrice        39627       7.312482      31.08475     4.611114
## 3 CustomerID            0       0.000000           NaN 15287.690570
##   without_mean
## 1     4.574599
## 2     2.522504
## 3 15287.690570

2.3 Data Preparation

2.5 Question 2: Customer segmentation

## Question 2: Customer segmentation
library(xray)
library(dplyr)
library(ggplot2)
library(tidyverse)

data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
head(data, 6)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##        InvoiceDate UnitPrice CustomerID        Country
## 1 01/12/2010 08:26      2.55      17850 United Kingdom
## 2 01/12/2010 08:26      3.39      17850 United Kingdom
## 3 01/12/2010 08:26      2.75      17850 United Kingdom
## 4 01/12/2010 08:26      3.39      17850 United Kingdom
## 5 01/12/2010 08:26      3.39      17850 United Kingdom
## 6 01/12/2010 08:26      7.65      17850 United Kingdom
newdata <- data %>%
  filter(!is.na(CustomerID)) %>%
  filter(Quantity > 0, UnitPrice > 0) %>%
  mutate(total_price = Quantity * UnitPrice)

customer_summary <- newdata %>%
  group_by(CustomerID) %>%
  summarise(total_spend = sum(total_price),
    frequency = n()  # AI-Assisted I would like to calculate the number of the frequency
  )

segment <- customer_summary %>%
  mutate(segment = case_when(
    frequency > 20 ~ "Frequent",
    frequency > 5 & frequency <= 20 ~ "Regular",
    TRUE ~ "One-time"
  ))  # AI-Assisted I want to classify according to frequency

ggplot(segment, aes(segment, fill = segment)) +
  geom_bar() +
  labs(
    title = "Customer Segmentation by Purchase Frequency",
    x = "Type",
    y = "Number of Customers"
  ) +
  theme_bw()

#The bar chart shows that most customers are “Frequent” buyers, with only a small number of one-time customers, indicating a relatively high level of customer loyalty

ggplot(segment, aes(x = segment, y =total_spend, fill = segment)) +
  geom_boxplot(alpha = 0.7) +
  scale_y_log10() + # AI-assisted: The original graph was compressed due to large outliers, so I applied a log scale to make the distribution clearer.
  labs( title = "Customer Spending Distribution by Segment",
    x = "Customer Segment",
    y = "Total Spend"
  ) +
  theme_minimal()

2.5.1 AI-Assisted

#I wanted to analyze the top-selling products in each country using a scatter plot, but I wasn’t sure which variables to use or how to identify the top products. I asked AI for guidance, and it suggested using Country and Description as the grouping variables. After that, I manually selected the top five countries based on total sales, and then extracted the top three best-selling products within each of those countries.

2.6 Question 3: Country-specific patterns

library(dplyr)
library(ggplot2)
library(readr)
library(ggrepel)

data <- read_csv("C:/Users/jade9/Downloads/Online Retail.csv")

data <- data %>%
  mutate(Sales = Quantity * UnitPrice)

country_sales <- data %>%
  group_by(Country) %>% # to arrange the order by country
  summarise(total_sales = sum(Sales, na.rm = TRUE)) %>%
  arrange(desc(total_sales)) # to see the highest
head(country_sales, 10)
## # A tibble: 10 Ă— 2
##    Country        total_sales
##    <chr>                <dbl>
##  1 United Kingdom    8187806.
##  2 Netherlands        284662.
##  3 EIRE               263277.
##  4 Germany            221698.
##  5 France             197404.
##  6 Australia          137077.
##  7 Switzerland         56385.
##  8 Spain               54775.
##  9 Belgium             40911.
## 10 Sweden              36596.
customer_prefer <- data %>%
  group_by(Country, Description) %>%
  summarise(total_quantity = sum(Quantity, na.rm = TRUE))
customer_prefer    
## # A tibble: 20,155 Ă— 3
## # Groups:   Country [38]
##    Country   Description                        total_quantity
##    <chr>     <chr>                                       <dbl>
##  1 Australia 10 COLOUR SPACEBOY PEN                         48
##  2 Australia 12 PENCIL SMALL TUBE WOODLAND                 384
##  3 Australia 12 PENCILS TALL TUBE POSY                     252
##  4 Australia 12 PENCILS TALL TUBE RED RETROSPOT             12
##  5 Australia 16 PIECE CUTLERY SET PANTRY DESIGN             24
##  6 Australia 20 DOLLY PEGS RETROSPOT                        24
##  7 Australia 3 HOOK HANGER MAGIC GARDEN                     12
##  8 Australia 3 STRIPEY MICE FELTCRAFT                      200
##  9 Australia 3 TIER CAKE TIN GREEN AND CREAM                32
## 10 Australia 3 TIER CAKE TIN RED AND CREAM                  48
## # ℹ 20,145 more rows
highest5 <- c("United Kingdom", "Netherlands", "EIRE", "Germany", "France")

chose_5 <- data %>%
  filter(Country %in% highest5) %>%
  group_by(Country, Description) %>%
  summarise(final = sum(Quantity, na.rm = TRUE))
chose_5
## # A tibble: 10,285 Ă— 3
## # Groups:   Country [5]
##    Country Description                         final
##    <chr>   <chr>                               <dbl>
##  1 EIRE    10 COLOUR SPACEBOY PEN                240
##  2 EIRE    12 COLOURED PARTY BALLOONS             20
##  3 EIRE    12 DAISY PEGS IN WOOD BOX              48
##  4 EIRE    12 EGG HOUSE PAINTED WOOD              25
##  5 EIRE    12 IVORY ROSE PEG PLACE SETTINGS       17
##  6 EIRE    12 MESSAGE CARDS WITH ENVELOPES        36
##  7 EIRE    12 PENCIL SMALL TUBE WOODLAND          48
##  8 EIRE    12 PENCILS SMALL TUBE RED RETROSPOT    24
##  9 EIRE    12 PENCILS TALL TUBE POSY             144
## 10 EIRE    12 PENCILS TALL TUBE RED RETROSPOT    216
## # ℹ 10,275 more rows
top3_in5 <- chose_5 %>%
  arrange(Country, desc(final)) %>%  
  group_by(Country) %>%              
  slice_head(n = 3) # AI-Assisted I would like to pick up top 3 products in each country

ggplot(top3_in5, aes(x = Country, y = final, color = Country, label = Description)) +
  geom_point(size = 1) +
  geom_text_repel(size = 3, show.legend = FALSE) + # AI-Assisted I would like to show the product name in graph so I used GPT to help me add the label of Description
  theme_bw() +
  labs(title = "Top 3 Products in Each Country",
    x = "Country",
    y = "Quantity"
  )

#from the chart, we can see the top three best-selling products in each of the five selected countries. Notably, sales in the United Kingdom are significantly higher than in the other four countries. This suggests that the company could focus more of its marketing strategy on the UK market.

ggplot(filter(country_sales, Country %in% highest5),
       aes(x = reorder(Country, -total_sales), y = total_sales, fill = Country)) +
  #I initially tried using desc() to sort the countries in descending order, but it didn’t work. After asking AI, I learned that desc() cannot be applied inside, and that using reorder and -total_sales is the correct way to achieve descending order.
geom_col() +
theme_minimal() +
labs(title = "Total Sales of Top 5 Countries", x = "Country", y = "Total Sales")

2.6.1 AI-Assisted

[If you used AI here: briefly describe what you asked and what you learned]

#In this analysis, I used GPT to assist me throughout the entire process. For the parts I wanted to focus on, I first expressed my own viewpoints, then let GPT check whether they matched the question. I also provided it with some basic formulas so it could make corrections, and afterward, I retyped everything myself.

2.7 Summary & Key Takeaways

#Through this analysis, I identified the most popular products in the OnlineSales dataset and examined customer purchasing frequency.

#Customers were classified into three categories: regular customers, one-time buyers, and returning customers.

#Using scatter plot analysis, I further identified the top three products preferred in each region.

#From a business perspective, these findings suggest that companies can optimize product offerings by region or country,

#focusing production and marketing efforts on the items most favored by local customers.

2.8 Tableau Dashboard Description

#(1) What story does your dashboard tell? The interactive dashboard focuses on analyzing sales performance across different countries and identifying regional differences in total sales and seasonal patterns.

#(2) Which 2-3 key insights did you choose to visualize? The visualization reveals that the United Kingdom contributes the largest share of sales, far exceeding other Western European countries. It also displays monthly sales trends and seasonal changes among the top ten popular products.

#(3) Why are these insights important for business decisions? These insights help businesses understand that the U.K. is the key export market, allowing them to allocate resources, adjust inventory, and plan marketing strategies more effectively toward this dominant region.