Understanding customer behavior and purchasing patterns is vital for online businesses. Your objective is to analyze transaction data from an online retailer to uncover sales trends, customer segments, and potential improvements.
The dataset captures transactions from December 2010 to December 2011 from a UK-based online retaile
library(tidyverse)
library(lubridate)
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(dlookr)
library(knitr)
data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")
library(xray)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
xray::anomalies(data)
## $variables
## Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf
## 1 CustomerID 541909 135080 24.93% 0 - 0 - 0 -
## 2 UnitPrice 541909 0 - 2515 0.46% 0 - 0 -
## 3 Description 541909 0 - 0 - 1454 0.27% 0 -
## 4 Country 541909 0 - 0 - 0 - 0 -
## 5 Quantity 541909 0 - 0 - 0 - 0 -
## 6 StockCode 541909 0 - 0 - 0 - 0 -
## 7 InvoiceDate 541909 0 - 0 - 0 - 0 -
## 8 InvoiceNo 541909 0 - 0 - 0 - 0 -
## qDistinct type anomalous_percent
## 1 4373 Integer 24.93%
## 2 1630 Numeric 0.46%
## 3 4224 Character 0.27%
## 4 38 Character -
## 5 722 Integer -
## 6 4070 Character -
## 7 23260 Character -
## 8 25900 Character -
##
## $problem_variables
## [1] Variable q qNA pNA
## [5] qZero pZero qBlank pBlank
## [9] qInf pInf qDistinct type
## [13] anomalous_percent problems
## <0 rows> (or 0-length row.names)
#Key observation: We can see that only 75% of the CustomerID entries have values,
#UnitPrice has 2,515 rows with a value of 0,
#and the product description has 1,454 rows with a very small number of blank values.
diagnose(data) %>%
arrange(desc(missing_count)) %>%
select(variables, types, missing_count, missing_percent, unique_count) %>%
knitr::kable(caption = "Data Quality Overview")
| variables | types | missing_count | missing_percent | unique_count |
|---|---|---|---|---|
| CustomerID | integer | 135080 | 24.92669 | 4373 |
| InvoiceNo | character | 0 | 0.00000 | 25900 |
| StockCode | character | 0 | 0.00000 | 4070 |
| Description | character | 0 | 0.00000 | 4224 |
| Quantity | integer | 0 | 0.00000 | 722 |
| InvoiceDate | character | 0 | 0.00000 | 23260 |
| UnitPrice | numeric | 0 | 0.00000 | 1630 |
| Country | character | 0 | 0.00000 | 38 |
diagnose_outlier(data)
## variables outliers_cnt outliers_ratio outliers_mean with_mean
## 1 Quantity 58619 10.817130 50.59097 9.552250
## 2 UnitPrice 39627 7.312482 31.08475 4.611114
## 3 CustomerID 0 0.000000 NaN 15287.690570
## without_mean
## 1 4.574599
## 2 2.522504
## 3 15287.690570
#I did not initially know how to create a line chart showing the sales trend of the most popular product. When I tried to plot the data, I discovered that the InvoiceDate column was stored as text, so R could not treat it as a timeline. Therefore, I asked AI for help. The AI told me that I needed to convert the text date into a real datetime format (POSIXct) before plotting. After that, because I wanted to show the daily sales, I asked the AI how to summarise the quantity by date. The AI provided the code for converting the datetime to a Date object correctly.
#From this AI-assisted process, I learned that dates in R must be converted into a proper date/datetime format before they can be used for plotting. I also gained a clearer understanding of what summarise() is used for — it helps aggregate data, such as calculating the total sales for each day.
library(dplyr)
install.packages("ggplot2")
library(ggplot2)
data <- read.csv("C:/Users/jade9/Downloads/online Retail.csv")
mp <- data %>%
group_by(Description) %>%
summarise(total_quantity = sum(Quantity, na.rm = TRUE)) %>%
arrange(desc(total_quantity))
head(mp,10)
## # A tibble: 10 × 2
## Description total_quantity
## <chr> <int>
## 1 "WORLD WAR 2 GLIDERS ASSTD DESIGNS" 53847
## 2 "JUMBO BAG RED RETROSPOT" 47363
## 3 "ASSORTED COLOUR BIRD ORNAMENT" 36381
## 4 "POPCORN HOLDER" 36334
## 5 "PACK OF 72 RETROSPOT CAKE CASES" 36039
## 6 "WHITE HANGING HEART T-LIGHT HOLDER" 35317
## 7 "RABBIT NIGHT LIGHT" 30680
## 8 "MINI PAINT SET VINTAGE " 26437
## 9 "PACK OF 12 LONDON TISSUES " 26315
## 10 "PACK OF 60 PINK PAISLEY CAKE CASES" 24753
mostpopular <- "WORLD WAR 2 GLIDERS ASSTD DESIGNS"
sales_mpproduct <- data %>%
filter(Description == mostpopular) %>%
mutate(InvoiceDate = as.POSIXct(InvoiceDate, format = "%d/%m/%Y %H:%M")) %>%
mutate(date = as.Date(InvoiceDate)) %>%
group_by(date) %>% #AI-Assisted , i would like to arrange the date so i have to translate the format of date
summarise(total_quantity = sum(Quantity)) %>% #to show the product sales in one day
arrange(date)
sales_mpproduct
## # A tibble: 235 × 2
## date total_quantity
## <date> <int>
## 1 2010-12-02 3264
## 2 2010-12-03 49
## 3 2010-12-05 96
## 4 2010-12-06 8
## 5 2010-12-07 50
## 6 2010-12-08 240
## 7 2010-12-09 52
## 8 2010-12-10 67
## 9 2010-12-12 96
## 10 2010-12-13 242
## # ℹ 225 more rows
ggplot(sales_mpproduct, aes(date,total_quantity))+
geom_line(color="blue")
labs(title="Daily sales of most popular products",
x="date",
y="Total Quantity")+
theme_bw()
## NULL
#From the line chart, it can be observed that sales increased significantly toward the end of the year. This rise may be attributed to the year-end shopping season, including holidays such as Christmas, or promotional activities typically held during that period.
#I wanted to segment customers into three groups—Frequent, Regular, and One-time buyers. However, I wasn’t sure how to calculate how many times each customer made a purchase. I asked AI, and it explained that I could use frequency = n() inside summarise() to count the number of transactions per customer. After getting the frequency, I used my own thresholds to classify the customers into the three segments.
## Question 2: Customer segmentation
install.packages("xray")
install.packages("dplyr")
install.packages("ggplot2")
library(xray)
library(dplyr)
library(ggplot2)
library(tidyverse)
data <- read.csv("C:/Users/jade9/Downloads/Online Retail.csv")
head(data, 6)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2
## InvoiceDate UnitPrice CustomerID Country
## 1 01/12/2010 08:26 2.55 17850 United Kingdom
## 2 01/12/2010 08:26 3.39 17850 United Kingdom
## 3 01/12/2010 08:26 2.75 17850 United Kingdom
## 4 01/12/2010 08:26 3.39 17850 United Kingdom
## 5 01/12/2010 08:26 3.39 17850 United Kingdom
## 6 01/12/2010 08:26 7.65 17850 United Kingdom
newdata <- data %>%
filter(!is.na(CustomerID)) %>%
filter(Quantity > 0, UnitPrice > 0) %>%
mutate(total_price = Quantity * UnitPrice)
customer_summary <- newdata %>%
group_by(CustomerID) %>%
summarise(
total_spend = sum(total_price),
frequency = n() # AI-Assisted I would like to calculate the number of the frequency
)
segment <- customer_summary %>%
mutate(segment = case_when(
frequency > 20 ~ "Frequent",
frequency > 5 & frequency <= 20 ~ "Regular",
TRUE ~ "One-time"
)) # AI-Assisted I want to classify according to frequency
ggplot(segment, aes(segment, fill = segment)) +
geom_bar() +
labs(
title = "Customer Segmentation by Purchase Frequency",
x = "Type",
y = "Number of Customers"
) +
theme_bw()
#The bar chart shows that most customers are “Frequent” buyers, with only a small number of one-time customers, indicating a relatively high level of customer loyalty
#I wanted to analyze the top-selling products in each country using a scatter plot, but I wasn’t sure which variables to use or how to identify the top products. I asked AI for guidance, and it suggested using Country and Description as the grouping variables. After that, I manually selected the top five countries based on total sales, and then extracted the top three best-selling products within each of those countries.
library(dplyr)
library(ggplot2)
library(readr)
library(ggrepel)
data <- read_csv("C:/Users/jade9/Downloads/Online Retail.csv")
data <- data %>%
mutate(Sales = Quantity * UnitPrice)
country_sales <- data %>%
group_by(Country) %>% # to arrange the order by country
summarise(total_sales = sum(Sales, na.rm = TRUE)) %>%
arrange(desc(total_sales)) # to see the highest
head(country_sales, 10)
## # A tibble: 10 × 2
## Country total_sales
## <chr> <dbl>
## 1 United Kingdom 8187806.
## 2 Netherlands 284662.
## 3 EIRE 263277.
## 4 Germany 221698.
## 5 France 197404.
## 6 Australia 137077.
## 7 Switzerland 56385.
## 8 Spain 54775.
## 9 Belgium 40911.
## 10 Sweden 36596.
customer_prefer <- data %>%
group_by(Country, Description) %>%
summarise(total_quantity = sum(Quantity, na.rm = TRUE))
customer_prefer
## # A tibble: 20,155 × 3
## # Groups: Country [38]
## Country Description total_quantity
## <chr> <chr> <dbl>
## 1 Australia 10 COLOUR SPACEBOY PEN 48
## 2 Australia 12 PENCIL SMALL TUBE WOODLAND 384
## 3 Australia 12 PENCILS TALL TUBE POSY 252
## 4 Australia 12 PENCILS TALL TUBE RED RETROSPOT 12
## 5 Australia 16 PIECE CUTLERY SET PANTRY DESIGN 24
## 6 Australia 20 DOLLY PEGS RETROSPOT 24
## 7 Australia 3 HOOK HANGER MAGIC GARDEN 12
## 8 Australia 3 STRIPEY MICE FELTCRAFT 200
## 9 Australia 3 TIER CAKE TIN GREEN AND CREAM 32
## 10 Australia 3 TIER CAKE TIN RED AND CREAM 48
## # ℹ 20,145 more rows
highest5 <- c("United Kingdom", "Netherlands", "EIRE", "Germany", "France")
chose_5 <- data %>%
filter(Country %in% highest5) %>% # AI-Assisted ,I would like to know the only 5 countries in the list
group_by(Country, Description) %>%
summarise(final = sum(Quantity, na.rm = TRUE))
chose_5
## # A tibble: 10,285 × 3
## # Groups: Country [5]
## Country Description final
## <chr> <chr> <dbl>
## 1 EIRE 10 COLOUR SPACEBOY PEN 240
## 2 EIRE 12 COLOURED PARTY BALLOONS 20
## 3 EIRE 12 DAISY PEGS IN WOOD BOX 48
## 4 EIRE 12 EGG HOUSE PAINTED WOOD 25
## 5 EIRE 12 IVORY ROSE PEG PLACE SETTINGS 17
## 6 EIRE 12 MESSAGE CARDS WITH ENVELOPES 36
## 7 EIRE 12 PENCIL SMALL TUBE WOODLAND 48
## 8 EIRE 12 PENCILS SMALL TUBE RED RETROSPOT 24
## 9 EIRE 12 PENCILS TALL TUBE POSY 144
## 10 EIRE 12 PENCILS TALL TUBE RED RETROSPOT 216
## # ℹ 10,275 more rows
top3_in5 <- chose_5 %>%
arrange(Country, desc(final)) %>%
group_by(Country) %>%
slice_head(n = 3) # AI-Assisted I would like to pick up top 3 products in each country
ggplot(top3_in5, aes(x = Country, y = final, color = Country, label = Description)) +
geom_point(size = 3) +
geom_text_repel(size = 3, show.legend = FALSE) + # AI-Assisted I would like to show the product name in graph so I used GPT to help me add the label of Description
theme_bw() +
labs(
title = "Top 3 Products in Each Country",
x = "Country",
y = "Quantity"
)
#from the chart, we can see the top three best-selling products in each of the five selected countries. Notably, sales in the United Kingdom are significantly higher than in the other four countries. This suggests that the company could focus more of its marketing strategy on the UK market.
[If you used AI here: briefly describe what you asked and what you learned]
#In this analysis, I used GPT to assist me throughout the entire process. For the parts I wanted to focus on, I first expressed my own viewpoints, then let GPT check whether they matched the question. I also provided it with some basic formulas so it could make corrections, and afterward, I retyped everything myself.
#Through this analysis, I identified the most popular products in the OnlineSales dataset and examined customer purchasing frequency.
#Customers were classified into three categories: regular customers, one-time buyers, and returning customers.
#Using scatter plot analysis, I further identified the top three products preferred in each region.
#From a business perspective, these findings suggest that companies can optimize product offerings by region or country,
#focusing production and marketing efforts on the items most favored by local customers.
#(1) What story does your dashboard tell? The interactive dashboard focuses on analyzing sales performance across different countries and identifying regional differences in total sales and seasonal patterns.
#(2) Which 2-3 key insights did you choose to visualize? The visualization reveals that the United Kingdom contributes the largest share of sales, far exceeding other Western European countries. It also displays monthly sales trends and seasonal changes among the top ten popular products.
#(3) Why are these insights important for business decisions? These insights help businesses understand that the U.K. is the key export market, allowing them to allocate resources, adjust inventory, and plan marketing strategies more effectively toward this dominant region.