2 Start Your Work Below

2.1 Data Import & Setup

library(tidyverse)
library(lubridate)
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(dlookr)
library(knitr)

data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")

2.2 Data Diagnostics

library(xray)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
xray::anomalies(data)

## $variables
##      Variable      q    qNA    pNA qZero pZero qBlank pBlank qInf pInf
## 1  CustomerID 541909 135080 24.93%     0     -      0      -    0    -
## 2   UnitPrice 541909      0      -  2515 0.46%      0      -    0    -
## 3 Description 541909      0      -     0     -   1454  0.27%    0    -
## 4     Country 541909      0      -     0     -      0      -    0    -
## 5    Quantity 541909      0      -     0     -      0      -    0    -
## 6   StockCode 541909      0      -     0     -      0      -    0    -
## 7 InvoiceDate 541909      0      -     0     -      0      -    0    -
## 8   InvoiceNo 541909      0      -     0     -      0      -    0    -
##   qDistinct      type anomalous_percent
## 1      4373   Integer            24.93%
## 2      1630   Numeric             0.46%
## 3      4224 Character             0.27%
## 4        38 Character                 -
## 5       722   Integer                 -
## 6      4070 Character                 -
## 7     23260 Character                 -
## 8     25900 Character                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)

#Key observation: We can see that only 75% of the CustomerID entries have values,
#UnitPrice has 2,515 rows with a value of 0,
#and the product description has 1,454 rows with a very small number of blank values.
library(dplyr)
library(tidyverse)
diagnose(data) %>% 
  arrange(desc(missing_count)) %>%
  select(variables, types, missing_count, missing_percent, unique_count) %>%
  knitr::kable(caption = "Data Quality Overview")

Data Quality Overview
variables	types	missing_count	missing_percent	unique_count
CustomerID	integer	135080	24.92669	4373
InvoiceNo	character	0	0.00000	25900
StockCode	character	0	0.00000	4070
Description	character	0	0.00000	4224
Quantity	integer	0	0.00000	722
InvoiceDate	character	0	0.00000	23260
UnitPrice	numeric	0	0.00000	1630
Country	character	0	0.00000	38

diagnose_outlier(data)

##    variables outliers_cnt outliers_ratio outliers_mean    with_mean
## 1   Quantity        58619      10.817130      50.59097     9.552250
## 2  UnitPrice        39627       7.312482      31.08475     4.611114
## 3 CustomerID            0       0.000000           NaN 15287.690570
##   without_mean
## 1     4.574599
## 2     2.522504
## 3 15287.690570

2.3 Data Preparation

2.4 Question 1: Most popular products and sales over time

2.4.1 AI-Assisted

#When I tried to plot the data, I noticed that the InvoiceDate column was stored as text, so R could not treat it as a timeline. Therefore, I asked AI for help. At the same time, I also wanted to summarise the sales within each day, which helped me understand more clearly how to use summarise().

#I learned that dates in R must be converted into a proper date/datetime format. In addition, because I wanted to highlight the maximum and minimum points to make the graph more informative, I also asked the AI for assistance.

#I asked how to highlight the maximum point, and then I wrote the code for the minimum point by myself.

#I used AI to help troubleshoot errors in ggplot2, and based on its suggestions, I converted the bar chart into a horizontal bar chart (using coord_flip()) to better display the long product descriptions.

library(dplyr)
library(ggplot2)

data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")
  mp <- data %>%
  group_by(Description) %>%
  summarise(total_quantity = sum(Quantity, na.rm = TRUE)) %>% ### AI-Assisted
  arrange(desc(total_quantity))
  
head10<-head(mp,10)

ggplot(head10, aes(x = reorder(Description, total_quantity), y = total_quantity,fill = Description)) + ### reorder AI-Assisted
  geom_col() +
  coord_flip() + ### AI-Assisted
  labs(x = "Description",
       y = "Total Quantity",
       title = "Top 10 Most Popular Products")

mostpopular <- "WORLD WAR 2 GLIDERS ASSTD DESIGNS"

sales_mpproduct <- data %>%
  filter(Description == mostpopular) %>%
  mutate(InvoiceDate = as.POSIXct(InvoiceDate, format = "%d/%m/%Y %H:%M")) %>%
   mutate(date = as.Date(InvoiceDate)) %>%       
  group_by(date) %>% #AI-Assisted , i would like to arrange the date so i have to translate the format of date
  summarise(total_quantity = sum(Quantity)) %>% #to show the product sales in one day
  arrange(date)

sales_mpproduct

## # A tibble: 235 × 2
##    date       total_quantity
##    <date>              <int>
##  1 2010-12-02           3264
##  2 2010-12-03             49
##  3 2010-12-05             96
##  4 2010-12-06              8
##  5 2010-12-07             50
##  6 2010-12-08            240
##  7 2010-12-09             52
##  8 2010-12-10             67
##  9 2010-12-12             96
## 10 2010-12-13            242
## # ℹ 225 more rows

avg <- mean(sales_mpproduct$total_quantity, na.rm = TRUE) 
#because the plot looked too plain, I added an average line.t
max_val <- max(sales_mpproduct$total_quantity, na.rm = TRUE) ### AI-Assisted
max_point <- subset(sales_mpproduct, total_quantity == max_val)### AI-Assisted
min_val <-min(sales_mpproduct$total_quantity,na.rm=TRUE)
min_point <- subset(sales_mpproduct,total_quantity==min_val)

ggplot(sales_mpproduct, aes(date, total_quantity)) +
  geom_line(color = "steelblue", size = 1) +
  geom_hline(yintercept = avg, color = "red",size=0.8) +### AI-Assisted
 geom_point(data = max_point, color = "black", size = 2)+### AI-Assisted
  geom_point(data = min_point, color = "black", size = 2)+
  geom_text( data = max_point, aes(label = paste("Max:", total_quantity)),
    vjust = -1,
    color = "red",
    size = 3
  ) + ### AI-Assisted
  geom_text( data = min_point, aes(label = paste("Min:", total_quantity)),
    vjust = -1,
    color = "red",
    size = 3
  ) +
  
  labs( title = "Daily sales of most popular products",
    x = "date",
    y = "Quantity"
  ) +
  theme_minimal()

  #From the line chart, it can be observed that sales increased significantly toward the end of the year. This rise may be attributed to the year-end shopping season, including holidays such as Christmas, or promotional activities typically held during that period.

2.4.2 AI-Assisted

#I wanted to segment customers into three groups—Frequent, Regular, and One-time buyers. However, I wasn’t sure how to calculate how many times each customer made a purchase. I asked AI, and it explained that I could use frequency = n() inside summarise() to count the number of transactions per customer. After getting the frequency, I used my own thresholds to classify the customers into the three segments.
#To enrich the analysis and better understand customer spending behavior, I asked AI which visualization would be most suitable, and based on the suggestion, I added a boxplot to compare total spending across customer segments.

2.5 Question 2: Customer segmentation

## Question 2: Customer segmentation
library(xray)
library(dplyr)
library(ggplot2)
library(tidyverse)

data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
head(data, 6)

##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2
##        InvoiceDate UnitPrice CustomerID        Country
## 1 01/12/2010 08:26      2.55      17850 United Kingdom
## 2 01/12/2010 08:26      3.39      17850 United Kingdom
## 3 01/12/2010 08:26      2.75      17850 United Kingdom
## 4 01/12/2010 08:26      3.39      17850 United Kingdom
## 5 01/12/2010 08:26      3.39      17850 United Kingdom
## 6 01/12/2010 08:26      7.65      17850 United Kingdom

newdata <- data %>%
  filter(!is.na(CustomerID)) %>%
  filter(Quantity > 0, UnitPrice > 0) %>%
  mutate(total_price = Quantity * UnitPrice)

customer_summary <- newdata %>%
  group_by(CustomerID) %>%
  summarise(total_spend = sum(total_price),
    frequency = n()  # AI-Assisted I would like to calculate the number of the frequency
  )

segment <- customer_summary %>%
  mutate(segment = case_when(
    frequency > 20 ~ "Frequent",
    frequency > 5 & frequency <= 20 ~ "Regular",
    TRUE ~ "One-time"
  ))  # AI-Assisted I want to classify according to frequency

ggplot(segment, aes(segment, fill = segment)) +
  geom_bar() +
  labs(
    title = "Customer Segmentation by Purchase Frequency",
    x = "Type",
    y = "Number of Customers"
  ) +
  theme_bw()

#The bar chart shows that most customers are “Frequent” buyers, with only a small number of one-time customers, indicating a relatively high level of customer loyalty

ggplot(segment, aes(x = segment, y =total_spend, fill = segment)) +
  geom_boxplot(alpha = 0.7) +
  scale_y_log10() + # AI-assisted: The original graph was compressed due to large outliers, so I applied a log scale to make the distribution clearer.
  labs( title = "Customer Spending Distribution by Segment",
    x = "Customer Segment",
    y = "Total Spend"
  ) +
  theme_minimal()

2.5.1 AI-Assisted

#I wanted to analyze the top-selling products in each country using a scatter plot, but I wasn’t sure which variables to use or how to identify the top products. I asked AI for guidance, and it suggested using Country and Description as the grouping variables. After that, I manually selected the top five countries based on total sales, and then extracted the top three best-selling products within each of those countries.

2.6 Question 3: Country-specific patterns

library(dplyr)
library(ggplot2)
library(readr)
library(ggrepel)

data <- read_csv("C:/Users/jade9/Downloads/Online Retail.csv")

data <- data %>%
  mutate(Sales = Quantity * UnitPrice)

country_sales <- data %>%
  group_by(Country) %>% # to arrange the order by country
  summarise(total_sales = sum(Sales, na.rm = TRUE)) %>%
  arrange(desc(total_sales)) # to see the highest
head(country_sales, 10)

## # A tibble: 10 × 2
##    Country        total_sales
##    <chr>                <dbl>
##  1 United Kingdom    8187806.
##  2 Netherlands        284662.
##  3 EIRE               263277.
##  4 Germany            221698.
##  5 France             197404.
##  6 Australia          137077.
##  7 Switzerland         56385.
##  8 Spain               54775.
##  9 Belgium             40911.
## 10 Sweden              36596.

customer_prefer <- data %>%
  group_by(Country, Description) %>%
  summarise(total_quantity = sum(Quantity, na.rm = TRUE))
customer_prefer

## # A tibble: 20,155 × 3
## # Groups:   Country [38]
##    Country   Description                        total_quantity
##    <chr>     <chr>                                       <dbl>
##  1 Australia 10 COLOUR SPACEBOY PEN                         48
##  2 Australia 12 PENCIL SMALL TUBE WOODLAND                 384
##  3 Australia 12 PENCILS TALL TUBE POSY                     252
##  4 Australia 12 PENCILS TALL TUBE RED RETROSPOT             12
##  5 Australia 16 PIECE CUTLERY SET PANTRY DESIGN             24
##  6 Australia 20 DOLLY PEGS RETROSPOT                        24
##  7 Australia 3 HOOK HANGER MAGIC GARDEN                     12
##  8 Australia 3 STRIPEY MICE FELTCRAFT                      200
##  9 Australia 3 TIER CAKE TIN GREEN AND CREAM                32
## 10 Australia 3 TIER CAKE TIN RED AND CREAM                  48
## # ℹ 20,145 more rows

highest5 <- c("United Kingdom", "Netherlands", "EIRE", "Germany", "France")

chose_5 <- data %>%
  filter(Country %in% highest5) %>%
  group_by(Country, Description) %>%
  summarise(final = sum(Quantity, na.rm = TRUE))
chose_5

## # A tibble: 10,285 × 3
## # Groups:   Country [5]
##    Country Description                         final
##    <chr>   <chr>                               <dbl>
##  1 EIRE    10 COLOUR SPACEBOY PEN                240
##  2 EIRE    12 COLOURED PARTY BALLOONS             20
##  3 EIRE    12 DAISY PEGS IN WOOD BOX              48
##  4 EIRE    12 EGG HOUSE PAINTED WOOD              25
##  5 EIRE    12 IVORY ROSE PEG PLACE SETTINGS       17
##  6 EIRE    12 MESSAGE CARDS WITH ENVELOPES        36
##  7 EIRE    12 PENCIL SMALL TUBE WOODLAND          48
##  8 EIRE    12 PENCILS SMALL TUBE RED RETROSPOT    24
##  9 EIRE    12 PENCILS TALL TUBE POSY             144
## 10 EIRE    12 PENCILS TALL TUBE RED RETROSPOT    216
## # ℹ 10,275 more rows

top3_in5 <- chose_5 %>%
  arrange(Country, desc(final)) %>%  
  group_by(Country) %>%              
  slice_head(n = 3) # AI-Assisted I would like to pick up top 3 products in each country

ggplot(top3_in5, aes(x = Country, y = final, color = Country, label = Description)) +
  geom_point(size = 1) +
  geom_text_repel(size = 3, show.legend = FALSE) + # AI-Assisted I would like to show the product name in graph so I used GPT to help me add the label of Description
  theme_bw() +
  labs(title = "Top 3 Products in Each Country",
    x = "Country",
    y = "Quantity"
  )

#from the chart, we can see the top three best-selling products in each of the five selected countries. Notably, sales in the United Kingdom are significantly higher than in the other four countries. This suggests that the company could focus more of its marketing strategy on the UK market.

ggplot(filter(country_sales, Country %in% highest5),
       aes(x = reorder(Country, -total_sales), y = total_sales, fill = Country)) +
  #I initially tried using desc() to sort the countries in descending order, but it didn’t work. After asking AI, I learned that desc() cannot be applied inside, and that using reorder and -total_sales is the correct way to achieve descending order.
geom_col() +
theme_minimal() +
labs(title = "Total Sales of Top 5 Countries", x = "Country", y = "Total Sales")

2.6.1 AI-Assisted

[If you used AI here: briefly describe what you asked and what you learned]

#In this analysis, I used GPT to assist me throughout the entire process. For the parts I wanted to focus on, I first expressed my own viewpoints, then let GPT check whether they matched the question. I also provided it with some basic formulas so it could make corrections, and afterward, I retyped everything myself.

2.7 Summary & Key Takeaways

#Through this analysis, I identified the most popular products in the OnlineSales dataset and examined customer purchasing frequency.

#Customers were classified into three categories: regular customers, one-time buyers, and returning customers.

#Using scatter plot analysis, I further identified the top three products preferred in each region.

#From a business perspective, these findings suggest that companies can optimize product offerings by region or country,

#focusing production and marketing efforts on the items most favored by local customers.

2.8 Tableau Dashboard Description

#(1) What story does your dashboard tell? The interactive dashboard focuses on analyzing sales performance across different countries and identifying regional differences in total sales and seasonal patterns.

#(2) Which 2-3 key insights did you choose to visualize? The visualization reveals that the United Kingdom contributes the largest share of sales, far exceeding other Western European countries. It also displays monthly sales trends and seasonal changes among the top ten popular products.

#(3) Why are these insights important for business decisions? These insights help businesses understand that the U.K. is the key export market, allowing them to allocate resources, adjust inventory, and plan marketing strategies more effectively toward this dominant region.

E-commerce Exploratory Analytics & Dashboard

MN50749 - Databases and BI - Individual Assignment

259435637

1 Assignment Overview

1.1 Background

1.2 Data Description

2 Start Your Work Below

2.1 Data Import & Setup

2.2 Data Diagnostics

2.3 Data Preparation

2.4 Question 1: Most popular products and sales over time

2.4.1 AI-Assisted

2.4.2 AI-Assisted

2.5 Question 2: Customer segmentation

2.5.1 AI-Assisted

2.6 Question 3: Country-specific patterns

2.6.1 AI-Assisted

2.7 Summary & Key Takeaways

2.8 Tableau Dashboard Description