Understanding customer behavior and purchasing patterns is vital for online businesses. Your objective is to analyze transaction data from an online retailer to uncover sales trends, customer segments, and potential improvements.
The dataset captures transactions from December 2010 to December 2011 from a UK-based online retaile
library(tidyverse)
library(lubridate)
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(dlookr)
library(knitr)
data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")
library(xray)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
xray::anomalies(data)
## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf
## 1 CustomerID 541909 135080 24.93% 0 - 0 - 0 -
## 2 UnitPrice 541909 0 - 2515 0.46% 0 - 0 -
## 3 Description 541909 0 - 0 - 1454 0.27% 0 -
## 4 Country 541909 0 - 0 - 0 - 0 -
## 5 Quantity 541909 0 - 0 - 0 - 0 -
## 6 StockCode 541909 0 - 0 - 0 - 0 -
## 7 InvoiceDate 541909 0 - 0 - 0 - 0 -
## 8 InvoiceNo 541909 0 - 0 - 0 - 0 -
## qDistinct type anomalous_percent
## 1 4373 Integer 24.93%
## 2 1630 Numeric 0.46%
## 3 4224 Character 0.27%
## 4 38 Character -
## 5 722 Integer -
## 6 4070 Character -
## 7 23260 Character -
## 8 25900 Character -
##
## $problem_variables
## [1] Variable q qNA pNA
## [5] qZero pZero qBlank pBlank
## [9] qInf pInf qDistinct type
## [13] anomalous_percent problems
## <0 rows> (or 0-length row.names)
#Key observation: We can see that only 75% of the CustomerID entries have values,
#UnitPrice has 2,515 rows with a value of 0,
#and the product description has 1,454 rows with a very small number of blank values.
diagnose(data) %>%
arrange(desc(missing_count)) %>%
select(variables, types, missing_count, missing_percent, unique_count) %>%
knitr::kable(caption = "Data Quality Overview")
| variables | types | missing_count | missing_percent | unique_count |
|---|---|---|---|---|
| CustomerID | integer | 135080 | 24.92669 | 4373 |
| InvoiceNo | character | 0 | 0.00000 | 25900 |
| StockCode | character | 0 | 0.00000 | 4070 |
| Description | character | 0 | 0.00000 | 4224 |
| Quantity | integer | 0 | 0.00000 | 722 |
| InvoiceDate | character | 0 | 0.00000 | 23260 |
| UnitPrice | numeric | 0 | 0.00000 | 1630 |
| Country | character | 0 | 0.00000 | 38 |
diagnose_outlier(data)
## variables outliers_cnt outliers_ratio outliers_mean with_mean
## 1 Quantity 58619 10.817130 50.59097 9.552250
## 2 UnitPrice 39627 7.312482 31.08475 4.611114
## 3 CustomerID 0 0.000000 NaN 15287.690570
## without_mean
## 1 4.574599
## 2 2.522504
## 3 15287.690570
#When I tried to plot the data, I noticed that the InvoiceDate column was stored as text, so R could not treat it as a timeline. Therefore, I asked AI for help. At the same time, I also wanted to summarise the sales within each day, which helped me understand more clearly how to use summarise().
#I learned that dates in R must be converted into a proper date/datetime format. In addition, because I wanted to highlight the maximum and minimum points to make the graph more informative, I also asked the AI for assistance.
#I asked how to highlight the maximum point, and then I wrote the code for the minimum point by myself.
#I used AI to help troubleshoot errors in ggplot2, and based on its suggestions, I converted the bar chart into a horizontal bar chart (using coord_flip()) to better display the long product descriptions.
library(dplyr)
library(ggplot2)
data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")
mp <- data %>%
group_by(Description) %>%
summarise(total_quantity = sum(Quantity, na.rm = TRUE)) %>%
arrange(desc(total_quantity))
head10<-head(mp,10)
ggplot(head10, aes(x = reorder(Description, total_quantity), y = total_quantity,fill = Description)) +
geom_col() +
coord_flip() +
labs(x = "Description",
y = "Total Quantity",
title = "Top 10 Most Popular Products")
mostpopular <- "WORLD WAR 2 GLIDERS ASSTD DESIGNS"
sales_mpproduct <- data %>%
filter(Description == mostpopular) %>%
mutate(InvoiceDate = as.POSIXct(InvoiceDate, format = "%d/%m/%Y %H:%M")) %>%
mutate(date = as.Date(InvoiceDate)) %>%
group_by(date) %>% #AI-Assisted , i would like to arrange the date so i have to translate the format of date
summarise(total_quantity = sum(Quantity)) %>% #to show the product sales in one day
arrange(date)
sales_mpproduct
## # A tibble: 235 Ă— 2
## date total_quantity
## <date> <int>
## 1 2010-12-02 3264
## 2 2010-12-03 49
## 3 2010-12-05 96
## 4 2010-12-06 8
## 5 2010-12-07 50
## 6 2010-12-08 240
## 7 2010-12-09 52
## 8 2010-12-10 67
## 9 2010-12-12 96
## 10 2010-12-13 242
## # ℹ 225 more rows
avg <- mean(sales_mpproduct$total_quantity, na.rm = TRUE)
#because the plot looked too plain, I added an average line.t
max_val <- max(sales_mpproduct$total_quantity, na.rm = TRUE)
max_point <- subset(sales_mpproduct, total_quantity == max_val)
min_val <-min(sales_mpproduct$total_quantity,na.rm=TRUE)
min_point <- subset(sales_mpproduct,total_quantity==min_val)
ggplot(sales_mpproduct, aes(date, total_quantity)) +
geom_line(color = "steelblue", size = 1) +
geom_hline(yintercept = avg, color = "red",size=0.8) +
geom_point(data = max_point, color = "black", size = 2)+
geom_point(data = min_point, color = "black", size = 2)+
geom_text( data = max_point, aes(label = paste("Max:", total_quantity)),
vjust = -1,
color = "red",
size = 3
) +
geom_text( data = min_point, aes(label = paste("Min:", total_quantity)),
vjust = -1,
color = "red",
size = 3
) +
labs( title = "Daily sales of most popular products",
x = "date",
y = "Quantity"
) +
theme_minimal()
#From the line chart, it can be observed that sales increased significantly toward the end of the year. This rise may be attributed to the year-end shopping season, including holidays such as Christmas, or promotional activities typically held during that period.
#I wanted to segment customers into three groups—Frequent, Regular, and One-time buyers. However, I wasn’t sure how to calculate how many times each customer made a purchase. I asked AI, and it explained that I could use frequency = n() inside summarise() to count the number of transactions per customer. After getting the frequency, I used my own thresholds to classify the customers into the three segments.
#To enrich the analysis and better understand customer spending behavior, I asked AI which visualization would be most suitable, and based on the suggestion, I added a boxplot to compare total spending across customer segments.
## Question 2: Customer segmentation
library(xray)
library(dplyr)
library(ggplot2)
library(tidyverse)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
head(data, 6)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 01/12/2010 08:26 2.55 17850 United Kingdom
## 2 01/12/2010 08:26 3.39 17850 United Kingdom
## 3 01/12/2010 08:26 2.75 17850 United Kingdom
## 4 01/12/2010 08:26 3.39 17850 United Kingdom
## 5 01/12/2010 08:26 3.39 17850 United Kingdom
## 6 01/12/2010 08:26 7.65 17850 United Kingdom
newdata <- data %>%
filter(!is.na(CustomerID)) %>%
filter(Quantity > 0, UnitPrice > 0) %>%
mutate(total_price = Quantity * UnitPrice)
customer_summary <- newdata %>%
group_by(CustomerID) %>%
summarise(total_spend = sum(total_price),
frequency = n() # AI-Assisted I would like to calculate the number of the frequency
)
segment <- customer_summary %>%
mutate(segment = case_when(
frequency > 20 ~ "Frequent",
frequency > 5 & frequency <= 20 ~ "Regular",
TRUE ~ "One-time"
)) # AI-Assisted I want to classify according to frequency
ggplot(segment, aes(segment, fill = segment)) +
geom_bar() +
labs(
title = "Customer Segmentation by Purchase Frequency",
x = "Type",
y = "Number of Customers"
) +
theme_bw()
#The bar chart shows that most customers are “Frequent” buyers, with only a small number of one-time customers, indicating a relatively high level of customer loyalty
ggplot(segment, aes(x = segment, y =total_spend, fill = segment)) +
geom_boxplot(alpha = 0.7) +
scale_y_log10() + # AI-assisted: The original graph was compressed due to large outliers, so I applied a log scale to make the distribution clearer.
labs( title = "Customer Spending Distribution by Segment",
x = "Customer Segment",
y = "Total Spend"
) +
theme_minimal()
#I wanted to analyze the top-selling products in each country using a scatter plot, but I wasn’t sure which variables to use or how to identify the top products. I asked AI for guidance, and it suggested using Country and Description as the grouping variables. After that, I manually selected the top five countries based on total sales, and then extracted the top three best-selling products within each of those countries.
library(dplyr)
library(ggplot2)
library(readr)
library(ggrepel)
data <- read_csv("C:/Users/jade9/Downloads/Online Retail.csv")
data <- data %>%
mutate(Sales = Quantity * UnitPrice)
country_sales <- data %>%
group_by(Country) %>% # to arrange the order by country
summarise(total_sales = sum(Sales, na.rm = TRUE)) %>%
arrange(desc(total_sales)) # to see the highest
head(country_sales, 10)
## # A tibble: 10 Ă— 2
## Country total_sales
## <chr> <dbl>
## 1 United Kingdom 8187806.
## 2 Netherlands 284662.
## 3 EIRE 263277.
## 4 Germany 221698.
## 5 France 197404.
## 6 Australia 137077.
## 7 Switzerland 56385.
## 8 Spain 54775.
## 9 Belgium 40911.
## 10 Sweden 36596.
customer_prefer <- data %>%
group_by(Country, Description) %>%
summarise(total_quantity = sum(Quantity, na.rm = TRUE))
customer_prefer
## # A tibble: 20,155 Ă— 3
## # Groups: Country [38]
## Country Description total_quantity
## <chr> <chr> <dbl>
## 1 Australia 10 COLOUR SPACEBOY PEN 48
## 2 Australia 12 PENCIL SMALL TUBE WOODLAND 384
## 3 Australia 12 PENCILS TALL TUBE POSY 252
## 4 Australia 12 PENCILS TALL TUBE RED RETROSPOT 12
## 5 Australia 16 PIECE CUTLERY SET PANTRY DESIGN 24
## 6 Australia 20 DOLLY PEGS RETROSPOT 24
## 7 Australia 3 HOOK HANGER MAGIC GARDEN 12
## 8 Australia 3 STRIPEY MICE FELTCRAFT 200
## 9 Australia 3 TIER CAKE TIN GREEN AND CREAM 32
## 10 Australia 3 TIER CAKE TIN RED AND CREAM 48
## # ℹ 20,145 more rows
highest5 <- c("United Kingdom", "Netherlands", "EIRE", "Germany", "France")
chose_5 <- data %>%
filter(Country %in% highest5) %>% # AI-Assisted ,I would like to know the only 5 countries in the list
group_by(Country, Description) %>%
summarise(final = sum(Quantity, na.rm = TRUE))
chose_5
## # A tibble: 10,285 Ă— 3
## # Groups: Country [5]
## Country Description final
## <chr> <chr> <dbl>
## 1 EIRE 10 COLOUR SPACEBOY PEN 240
## 2 EIRE 12 COLOURED PARTY BALLOONS 20
## 3 EIRE 12 DAISY PEGS IN WOOD BOX 48
## 4 EIRE 12 EGG HOUSE PAINTED WOOD 25
## 5 EIRE 12 IVORY ROSE PEG PLACE SETTINGS 17
## 6 EIRE 12 MESSAGE CARDS WITH ENVELOPES 36
## 7 EIRE 12 PENCIL SMALL TUBE WOODLAND 48
## 8 EIRE 12 PENCILS SMALL TUBE RED RETROSPOT 24
## 9 EIRE 12 PENCILS TALL TUBE POSY 144
## 10 EIRE 12 PENCILS TALL TUBE RED RETROSPOT 216
## # ℹ 10,275 more rows
top3_in5 <- chose_5 %>%
arrange(Country, desc(final)) %>%
group_by(Country) %>%
slice_head(n = 3) # AI-Assisted I would like to pick up top 3 products in each country
ggplot(top3_in5, aes(x = Country, y = final, color = Country, label = Description)) +
geom_point(size = 1) +
geom_text_repel(size = 3, show.legend = FALSE) + # AI-Assisted I would like to show the product name in graph so I used GPT to help me add the label of Description
theme_bw() +
labs(title = "Top 3 Products in Each Country",
x = "Country",
y = "Quantity"
)
#from the chart, we can see the top three best-selling products in each of the five selected countries. Notably, sales in the United Kingdom are significantly higher than in the other four countries. This suggests that the company could focus more of its marketing strategy on the UK market.
ggplot(filter(country_sales, Country %in% highest5),
aes(x = reorder(Country, -total_sales), y = total_sales, fill = Country)) +
#I initially tried using desc() to sort the countries in descending order, but it didn’t work. After asking AI, I learned that desc() cannot be applied inside, and that using reorder and -total_sales is the correct way to achieve descending order.
geom_col() +
theme_minimal() +
labs(title = "Total Sales of Top 5 Countries", x = "Country", y = "Total Sales")
[If you used AI here: briefly describe what you asked and what you learned]
#In this analysis, I used GPT to assist me throughout the entire process. For the parts I wanted to focus on, I first expressed my own viewpoints, then let GPT check whether they matched the question. I also provided it with some basic formulas so it could make corrections, and afterward, I retyped everything myself.
#Through this analysis, I identified the most popular products in the OnlineSales dataset and examined customer purchasing frequency.
#Customers were classified into three categories: regular customers, one-time buyers, and returning customers.
#Using scatter plot analysis, I further identified the top three products preferred in each region.
#From a business perspective, these findings suggest that companies can optimize product offerings by region or country,
#focusing production and marketing efforts on the items most favored by local customers.
#(1) What story does your dashboard tell? The interactive dashboard focuses on analyzing sales performance across different countries and identifying regional differences in total sales and seasonal patterns.
#(2) Which 2-3 key insights did you choose to visualize? The visualization reveals that the United Kingdom contributes the largest share of sales, far exceeding other Western European countries. It also displays monthly sales trends and seasonal changes among the top ten popular products.
#(3) Why are these insights important for business decisions? These insights help businesses understand that the U.K. is the key export market, allowing them to allocate resources, adjust inventory, and plan marketing strategies more effectively toward this dominant region.