Context: The dataset AustraliaSales.csv
contains data on
orders made to a cycling equipment store in Australia. Each observation
represents an order and provides information regarding the customer,
products purchased as well as the cost, revenue and profit made. Each of
the column is defined as follows:
Date
: Date of orderDay
: Day of orderMonth
: Month of orderYear
: Year of orderCustomer_Age
: Customer ageAge_Group
: Customer age groupCustomer_Gender
: Customer genderState
: State (in Australia) of customer addressProduct_Category
: Product categorySub_Category
: Sub category of productProduct
: Product orderedOrder_Quantity
: Order quantityUnit_Cost
: Cost per unit of productUnit_Price
: Price per unit of productProfit
: Profit for the order (Difference between
Revenue and Cost)Cost
: Cost for the orderRevenue
: Revenue for the order#put in your working directory folder pathname
setwd("/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3")
getwd()
## [1] "/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3"
#import excel file into RStudio
dfBS <- read.csv("AustraliaSales.csv")
#import libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("tidyverse") #need to call the library before you use the packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.2 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("knitr")
library("readxl") # this is required for the read_excel function
library("RColorBrewer") # for color palette
Conduct an analyses to develop a profile of the customers who have made purchases of the bike equipment. For each of the following variables, create a table and chart to describe the frequency distributions. (6 marks)
Describe your observations from the above analyses. (1.5 marks)
BEGIN: YOUR ANSWER
# Type your codes
# 2ai
Freqi <- dfBS %>% count(Age_Group)
kable(Freqi, caption = "Frequency of Customers by Age Group", col.names= c("Age Group", "Frequency"))
Age Group | Frequency |
---|---|
Adults (35-64) | 10394 |
Seniors (64+) | 58 |
Young Adults (25-34) | 9102 |
Youth (<25) | 4382 |
par(mar = c(5, 10, 4, 2))
b1 <- barplot(Freqi$n,
names.arg = Freqi$Age_Group,
col = 'blue',
main = 'Frequency of Customers by Age Group',
cex.names = 0.55,
xlim = c(0, 12000),
xlab = 'No. of customers',
horiz = TRUE,
las = 1)
text(x = Freqi$n + 700,
y = b1,
labels = Freqi$n,
pos = 1
)
# Type your codes
# 2aii
Freqii <- dfBS %>% count(Customer_Gender)
kable(Freqii, caption = "Frequency of Customers by Gender", col.names= c("Gender", "Frequency"))
Gender | Frequency |
---|---|
F | 11506 |
M | 12430 |
slice.data <- Freqii$n
piepercent <- 100 * round(Freqii$n/sum(Freqii$n), 2)
label <- Freqii$Customer_Gender #change to the variable name
label <- paste(label, ",", piepercent, "%", sep = "")
pie(slice.data,
labels = label,
col = c("blue", "cyan", "dodgerblue"),
radius = 1,
main = "Frequency of Customers by Gender") # change caption accordingly
# Type your codes
# 2aiii
Freqiii <- dfBS %>% count(State)
kable(Freqiii, caption = "Frequency of Customers by State", col.names= c("State", "Frequency"))
State | Frequency |
---|---|
New South Wales | 10412 |
Queensland | 5220 |
South Australia | 1564 |
Tasmania | 724 |
Victoria | 6016 |
par(mar = c(5, 10, 4, 2))
b2 <- barplot(Freqiii$n,
names.arg = Freqiii$State,
col = 'blue',
main = 'Frequency of Customers by State',
cex.names = 0.55,
xlim = c(0, 12000),
xlab = 'No. of customers',
horiz = TRUE,
las = 1)
text(x = Freqiii$n + 700,
y = b2,
labels = Freqiii$n,
pos = 1)
Most customers are from the adults (35-64) and young adults (25-34) age groups, both contributing to around 81% of total transactions. Seniors (64+) form the smallest customer based with around 0.2% of total market share. The proportion of male and female customers are relatively even (52% and 48% respectively). By state, New South Wales is where the majority of customers are from (around 43%), followed by Victoria (around 25%) and Queensland (around 22%), whereas customers from smaller states such as Tasmania contribute very few transactions (around 3%).
END: YOUR ANSWER
Store manager is also interested in understanding the frequency of transactions involving different products. (2 marks)
Product_Category
Product_Category
and Sub_Category
Describe your insights from the charts. (1.5 marks)
BEGIN: YOUR ANSWER
# Type your codes
# 2bi
Freqiv <- dfBS %>% count(Product_Category)
kable(Freqiv, caption = "Frequency of Transactions by Product Category", col.names= c("Product Category", "Frequency"))
Product Category | Frequency |
---|---|
Accessories | 13498 |
Bikes | 7064 |
Clothing | 3374 |
Accessories comprises the majority of total transactions (around 57%), making up more than half of all sales. Bikes follow with 7064 transactions (around 30%), while clothing is the smallest category with around 14% of total transactions. This table shows that smaller items (accessories) are purchased more often than larger items (bicycles and clothing).
# Type your codes
# 2bii
Freqv <- dfBS %>% group_by(Product_Category) %>% count(Sub_Category)
kable(Freqv, caption = "Frequency of Transactions by Sub Category of Product Category", col.names= c("Product Category", "Sub Category", "Frequency"))
Product Category | Sub Category | Frequency |
---|---|---|
Accessories | Bike Racks | 90 |
Accessories | Bike Stands | 120 |
Accessories | Bottles and Cages | 3144 |
Accessories | Cleaners | 430 |
Accessories | Fenders | 628 |
Accessories | Helmets | 2346 |
Accessories | Hydration Packs | 344 |
Accessories | Tires and Tubes | 6396 |
Bikes | Mountain Bikes | 2156 |
Bikes | Road Bikes | 4104 |
Bikes | Touring Bikes | 804 |
Clothing | Caps | 844 |
Clothing | Gloves | 578 |
Clothing | Jerseys | 1268 |
Clothing | Shorts | 264 |
Clothing | Socks | 224 |
Clothing | Vests | 196 |
Within accessories, tires and tubes collectively account for the majority of total transactions (around 47%), while hydration packs comprise the smallest share of total transactions within accessories (around 2.5%). For clothing, jerseys dominate (around 38% of total clothing transactions), while shorts, socks and vests comprise the smallest market share for clothing (around 6-8% for each subcategory). For bikes, road bikes are the most popular, comprising around 60% of total transactions for bikes. Overall, tires and tubes are the most popular subcategory of products, followed by road bikes with bike racks being the least popular.
END: YOUR ANSWER
Describe the trends observed from the above analyses. (2 marks)
# Type your codes
# 2ci
Freq2ci <- dfBS %>% count(Year)
b4 <- barplot(Freq2ci$n,
names.arg = Freq2ci$Year,
col = 'blue',
main = 'Frequency of Transactions over the Years',
cex.names = 0.55,
xlim = c(0, 8000),
xlab = 'No. of Transactions',
horiz = TRUE,
las = 1)
text(x = Freq2ci$n + 400,
y = b4,
labels = Freq2ci$n,
pos = 1)
The number of transactions remain constant in 2011 and 2012 but in 2013, the number of transactions increased to around 6 times (5335) of that of the previous year (859) and remained relatively constant after at the 5300-5800 total yearly transactions level.
# Type your codes
# 2cii
Freq2cii <- dfBS %>% group_by(Year, Product_Category) %>% tally()
Freq2cii.spread <- Freq2cii %>% tidyr::spread(key = Product_Category, value = n)
Freq2cii.spread[is.na(Freq2cii.spread)] <- 0
barmat <- as.matrix(Freq2cii.spread[, -1])
rownames(barmat) <- Freq2cii.spread$Year
par(mar = c(5, 6, 5, 8), xpd = TRUE)
bp <- barplot(t(barmat),
beside = TRUE,
col = rainbow(ncol(barmat)),
main = "Frequency of Each Product Category Across Years",
xlab = "Year",
ylab = "No. of Transactions",
names.arg = rownames(barmat),
ylim = c(0, max(barmat) * 1.3))
legend("topright",
inset = c(-0.15, 0),
legend = colnames(barmat),
fill = rainbow(ncol(barmat)),
cex = 0.7,
title = "Product Category")
text(x = bp,
y = t(barmat) + 200,
labels = t(barmat),
cex = 0.7)
The number of transactions increased significantly in 2013 compared to 2012 mainly because the store started selling accessories (accounting for the additional 2855 number of transactions) and clothing (accounting for the additional 763 number of transactions). Moreover, the total number of transactions for bikes doubled (1687) compared to 2012 (859). The total number of transactions from 2013 to 2016 remain relatively constant but the number of transactions for bikes, clothing and accessories fluctuate yearly.
END: YOUR ANSWER
Revenue
. You may use the default break points but make
sure the tic marks on the chart is aligned to each bar on the chart. (3
marks)Revenue
and Profit
. (1 mark)Describe your observations from each chart. Is there any further investigations that you would like to explore? (3 marks)
# Type your codes
# 2di
hist_input1 <- dfBS$Revenue
hist_rev <- hist(dfBS$Revenue,
main = "Histogram of Revenue",
xlab = "Revenue",
ylab = "Frequency",
col = "darkorange",
breaks = 20,
labels = TRUE,
ylim = c(0, max(hist(dfBS$Revenue, breaks = 20, plot = FALSE)$counts) * 1.2))
rev_bins <- cut(hist_input1, hist_rev$breaks, include.lowest = TRUE, dig.lab = 6)
rev_table <- table(rev_bins)
kable(rev_table, caption = "Frequency Distribution of Revenue")
rev_bins | Freq |
---|---|
[0,5000] | 23308 |
(5000,10000] | 618 |
(10000,15000] | 8 |
(15000,20000] | 0 |
(20000,25000] | 0 |
(25000,30000] | 0 |
(30000,35000] | 0 |
(35000,40000] | 0 |
(40000,45000] | 0 |
(45000,50000] | 0 |
(50000,55000] | 1 |
(55000,60000] | 1 |
The revenue distribution is extremely right-skewed. A vast majority of transactions (23,308, or more than 97%) fall within the lowest band of 0–5,000 AUD. Only 618 transactions fall between 5,001–10,000 AUD, and beyond 10,000 AUD there are just 10 transactions in total. Notably, there is a single transaction each in the 50,000–55,000 AUD and 55,000–60,000 AUD ranges. This shows that the business is overwhelmingly driven by small orders, with only a handful of very high-value transactions forming a long tail.
# Type your codes
# 2dii
# 2dii
plot(dfBS$Revenue, dfBS$Profit,
main = "Scatterplot of Profit vs Revenue",
xlab = "Revenue",
ylab = "Profit")
abline(lm(Profit ~ Revenue, data = dfBS))
The scatterplot shows that revenue and profit have a strong positive linear correlation. Transactions with higher revenue generally correspond to higher profit. However, profit levels, especially for transactions below 10000 AUD, are also spread out for a given revenue, pointing to differences in margins across products. The few very high-revenue transactions (more than 50000 AUD) deviate from the trend, with unusually low or irregular profit outcomes (lie below the linear line of best fit), likely due to bulk discounts, contractual pricing, or data errors, which warrant further investigation.
END: YOUR ANSWER
Conduct pareto analyses on Revenue
. Report the number
and percentage of transactions that contribute most, amounting to 80% of
the total revenue.
For tutorial discussion: what follow up questions do you have after performing the Pareto Analyses?
BEGIN: YOUR ANSWER
# Type your codes
# 2e
pareto_dataset <- dfBS
pareto <- pareto_dataset %>%
select(Revenue) %>%
arrange(desc(Revenue))
pareto$percentage <- pareto$Revenue / sum(pareto$Revenue)
pareto$cumulative <- cumsum(pareto$percentage)
num_indi <- which(pareto$cumulative > 0.8)[1]
num_indi
## [1] 6360
percent_indi <- num_indi / nrow(pareto)
percent_indi
## [1] 0.2657086
The Pareto analysis shows that about 6,360 transactions, only around 26.6% of all orders, contribute to 80% of the store’s total revenue. This confirms the Pareto principle that a relatively small proportion of high-value orders drive the majority of revenue. A logical next step would be to profile these transactions by product category, state, and customer segment to see which groups are most responsible for this revenue concentration.
END: YOUR ANSWER