BT1101 AY2526S1-Tutorial 3 (Part 2 Deadline: 15 Sep 9am)

Tutorial 3 Part 2 (30 marks - To be submitted by 15 Sep 9am)

Context: The dataset AustraliaSales.csv contains data on orders made to a cycling equipment store in Australia. Each observation represents an order and provides information regarding the customer, products purchased as well as the cost, revenue and profit made. Each of the column is defined as follows:

Date : Date of order
Day : Day of order
Month : Month of order
Year: Year of order
Customer_Age: Customer age
Age_Group: Customer age group
Customer_Gender: Customer gender
State: State (in Australia) of customer address
Product_Category: Product category
Sub_Category: Sub category of product
Product: Product ordered
Order_Quantity: Order quantity
Unit_Cost: Cost per unit of product
Unit_Price: Price per unit of product
Profit: Profit for the order (Difference between Revenue and Cost)
Cost: Cost for the order
Revenue: Revenue for the order

#put in your working directory folder pathname 
setwd("/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3")
getwd()

## [1] "/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3"

#import excel file into RStudio
dfBS <- read.csv("AustraliaSales.csv")

#import libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("tidyverse") #need to call the library before you use the packages

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("knitr")
library("readxl") # this is required for the read_excel function
library("RColorBrewer") # for color palette

Q2.(a) Customer Profile Dashboard (7.5 marks)

Conduct an analyses to develop a profile of the customers who have made purchases of the bike equipment. For each of the following variables, create a table and chart to describe the frequency distributions. (6 marks)

1. Age_Group
1. Customer_Gender
1. State

Describe your observations from the above analyses. (1.5 marks)

BEGIN: YOUR ANSWER

# Type your codes
# 2ai 
Freqi <- dfBS %>% count(Age_Group)
kable(Freqi, caption = "Frequency of Customers by Age Group", col.names= c("Age Group", "Frequency"))

Frequency of Customers by Age Group
Age Group	Frequency
Adults (35-64)	10394
Seniors (64+)	58
Young Adults (25-34)	9102
Youth (<25)	4382

par(mar = c(5, 10, 4, 2))

b1 <- barplot(Freqi$n,
        names.arg = Freqi$Age_Group,
        col = 'blue', 
        main = 'Frequency of Customers by Age Group',
        cex.names = 0.55,
        xlim = c(0, 12000), 
        xlab = 'No. of customers', 
        horiz = TRUE,
        las = 1)

text(x = Freqi$n + 700,
     y = b1,
     labels = Freqi$n,
     pos = 1
       )

# Type your codes
# 2aii 
Freqii <- dfBS %>% count(Customer_Gender)
kable(Freqii, caption = "Frequency of Customers by Gender", col.names= c("Gender", "Frequency"))

Frequency of Customers by Gender
Gender	Frequency
F	11506
M	12430

slice.data <- Freqii$n
piepercent <- 100 * round(Freqii$n/sum(Freqii$n), 2)
label <- Freqii$Customer_Gender #change to the variable name
label <- paste(label, ",", piepercent, "%", sep = "")
pie(slice.data,
      labels = label,
      col = c("blue", "cyan", "dodgerblue"),
      radius = 1,
      main = "Frequency of Customers by Gender") # change caption accordingly

# Type your codes
# 2aiii 
Freqiii <- dfBS %>% count(State)
kable(Freqiii, caption = "Frequency of Customers by State", col.names= c("State", "Frequency"))

Frequency of Customers by State
State	Frequency
New South Wales	10412
Queensland	5220
South Australia	1564
Tasmania	724
Victoria	6016

par(mar = c(5, 10, 4, 2))

b2 <- barplot(Freqiii$n,
        names.arg = Freqiii$State,
        col = 'blue', 
        main = 'Frequency of Customers by State',
        cex.names = 0.55,
        xlim = c(0, 12000), 
        xlab = 'No. of customers', 
        horiz = TRUE,
        las = 1)

text(x = Freqiii$n + 700,  
     y = b2,
     labels = Freqiii$n,
     pos = 1)

Most customers are from the adults (35-64) and young adults (25-34) age groups, both contributing to around 81% of total transactions. Seniors (64+) form the smallest customer based with around 0.2% of total market share. The proportion of male and female customers are relatively even (52% and 48% respectively). By state, New South Wales is where the majority of customers are from (around 43%), followed by Victoria (around 25%) and Queensland (around 22%), whereas customers from smaller states such as Tasmania contribute very few transactions (around 3%).

END: YOUR ANSWER

Q2.(b) Product Sales Analyses Dashboard (3.5 marks)

Store manager is also interested in understanding the frequency of transactions involving different products. (2 marks)

1. Create a table to display frequency of transactions by Product_Category
1. Create a table to display frequency of transactions by Product_Category and Sub_Category

Describe your insights from the charts. (1.5 marks)

BEGIN: YOUR ANSWER

# Type your codes
# 2bi 
Freqiv <- dfBS %>% count(Product_Category)
kable(Freqiv, caption = "Frequency of Transactions by Product Category", col.names= c("Product Category", "Frequency"))

Frequency of Transactions by Product Category
Product Category	Frequency
Accessories	13498
Bikes	7064
Clothing	3374

Accessories comprises the majority of total transactions (around 57%), making up more than half of all sales. Bikes follow with 7064 transactions (around 30%), while clothing is the smallest category with around 14% of total transactions. This table shows that smaller items (accessories) are purchased more often than larger items (bicycles and clothing).

# Type your codes
# 2bii 
Freqv <- dfBS %>% group_by(Product_Category) %>% count(Sub_Category)
kable(Freqv, caption = "Frequency of Transactions by Sub Category of Product Category", col.names= c("Product Category", "Sub Category", "Frequency"))

Frequency of Transactions by Sub Category of Product Category
Product Category	Sub Category	Frequency
Accessories	Bike Racks	90
Accessories	Bike Stands	120
Accessories	Bottles and Cages	3144
Accessories	Cleaners	430
Accessories	Fenders	628
Accessories	Helmets	2346
Accessories	Hydration Packs	344
Accessories	Tires and Tubes	6396
Bikes	Mountain Bikes	2156
Bikes	Road Bikes	4104
Bikes	Touring Bikes	804
Clothing	Caps	844
Clothing	Gloves	578
Clothing	Jerseys	1268
Clothing	Shorts	264
Clothing	Socks	224
Clothing	Vests	196

Within accessories, tires and tubes collectively account for the majority of total transactions (around 47%), while hydration packs comprise the smallest share of total transactions within accessories (around 2.5%). For clothing, jerseys dominate (around 38% of total clothing transactions), while shorts, socks and vests comprise the smallest market share for clothing (around 6-8% for each subcategory). For bikes, road bikes are the most popular, comprising around 60% of total transactions for bikes. Overall, tires and tubes are the most popular subcategory of products, followed by road bikes with bike racks being the least popular.

END: YOUR ANSWER

Q2.(c) Transactions by Time Analyses Dashboard (7 marks)

1. Plot a chart to display how the frequency of transactions vary over the years from. (2 marks)
1. Plot a chart to compare the frequency of each product category across years. (3 marks)

Describe the trends observed from the above analyses. (2 marks)

# Type your codes
# 2ci 

Freq2ci <- dfBS %>% count(Year)

b4 <- barplot(Freq2ci$n,
        names.arg = Freq2ci$Year,
        col = 'blue', 
        main = 'Frequency of Transactions over the Years',
        cex.names = 0.55,
        xlim = c(0, 8000), 
        xlab = 'No. of Transactions', 
        horiz = TRUE,
        las = 1)

text(x = Freq2ci$n + 400,  
     y = b4,
     labels = Freq2ci$n,
     pos = 1)

The number of transactions remain constant in 2011 and 2012 but in 2013, the number of transactions increased to around 6 times (5335) of that of the previous year (859) and remained relatively constant after at the 5300-5800 total yearly transactions level.

# Type your codes
# 2cii 

Freq2cii <- dfBS %>% group_by(Year, Product_Category) %>% tally()
Freq2cii.spread <- Freq2cii %>% tidyr::spread(key = Product_Category, value = n)
Freq2cii.spread[is.na(Freq2cii.spread)] <- 0

barmat <- as.matrix(Freq2cii.spread[, -1]) 
rownames(barmat) <- Freq2cii.spread$Year   

par(mar = c(5, 6, 5, 8), xpd = TRUE)
bp <- barplot(t(barmat),
              beside = TRUE,
              col = rainbow(ncol(barmat)),
              main = "Frequency of Each Product Category Across Years",
              xlab = "Year",
              ylab = "No. of Transactions",
              names.arg = rownames(barmat),
              ylim = c(0, max(barmat) * 1.3))

legend("topright",
       inset = c(-0.15, 0), 
       legend = colnames(barmat),
       fill = rainbow(ncol(barmat)),
       cex = 0.7,
       title = "Product Category")

text(x = bp,
     y = t(barmat) + 200,              
     labels = t(barmat),
     cex = 0.7)

The number of transactions increased significantly in 2013 compared to 2012 mainly because the store started selling accessories (accounting for the additional 2855 number of transactions) and clothing (accounting for the additional 763 number of transactions). Moreover, the total number of transactions for bikes doubled (1687) compared to 2012 (859). The total number of transactions from 2013 to 2016 remain relatively constant but the number of transactions for bikes, clothing and accessories fluctuate yearly.

END: YOUR ANSWER

Q2.(d) Revenue Analyses Dashboard (7 marks)

1. Create a table and chart to display for the frequency distribution for Revenue. You may use the default break points but make sure the tic marks on the chart is aligned to each bar on the chart. (3 marks)
1. Create a chart to display the relationship between Revenue and Profit. (1 mark)

Describe your observations from each chart. Is there any further investigations that you would like to explore? (3 marks)

# Type your codes
# 2di 

hist_input1 <- dfBS$Revenue
hist_rev <- hist(dfBS$Revenue,
                 main = "Histogram of Revenue",
                 xlab = "Revenue",
                 ylab = "Frequency",
                 col  = "darkorange",
                 breaks = 20,
                 labels = TRUE,
                 ylim = c(0, max(hist(dfBS$Revenue, breaks = 20, plot = FALSE)$counts) * 1.2))

rev_bins <- cut(hist_input1, hist_rev$breaks, include.lowest = TRUE, dig.lab = 6)
rev_table <- table(rev_bins)
kable(rev_table, caption = "Frequency Distribution of Revenue")

Frequency Distribution of Revenue
rev_bins	Freq
[0,5000]	23308
(5000,10000]	618
(10000,15000]	8
(15000,20000]	0
(20000,25000]	0
(25000,30000]	0
(30000,35000]	0
(35000,40000]	0
(40000,45000]	0
(45000,50000]	0
(50000,55000]	1
(55000,60000]	1

The revenue distribution is extremely right-skewed. A vast majority of transactions (23,308, or more than 97%) fall within the lowest band of 0–5,000 AUD. Only 618 transactions fall between 5,001–10,000 AUD, and beyond 10,000 AUD there are just 10 transactions in total. Notably, there is a single transaction each in the 50,000–55,000 AUD and 55,000–60,000 AUD ranges. This shows that the business is overwhelmingly driven by small orders, with only a handful of very high-value transactions forming a long tail.

# Type your codes
# 2dii 
# 2dii
plot(dfBS$Revenue, dfBS$Profit,
     main = "Scatterplot of Profit vs Revenue",
     xlab = "Revenue",
     ylab = "Profit")

abline(lm(Profit ~ Revenue, data = dfBS))

The scatterplot shows that revenue and profit have a strong positive linear correlation. Transactions with higher revenue generally correspond to higher profit. However, profit levels, especially for transactions below 10000 AUD, are also spread out for a given revenue, pointing to differences in margins across products. The few very high-revenue transactions (more than 50000 AUD) deviate from the trend, with unusually low or irregular profit outcomes (lie below the linear line of best fit), likely due to bulk discounts, contractual pricing, or data errors, which warrant further investigation.

END: YOUR ANSWER

Q2.(e) Pareto Analyses Dashboard (5 marks)

Conduct pareto analyses on Revenue. Report the number and percentage of transactions that contribute most, amounting to 80% of the total revenue.

For tutorial discussion: what follow up questions do you have after performing the Pareto Analyses?

BEGIN: YOUR ANSWER

# Type your codes
# 2e
pareto_dataset <- dfBS
pareto <- pareto_dataset %>%
  select(Revenue) %>%
  arrange(desc(Revenue))

pareto$percentage <- pareto$Revenue / sum(pareto$Revenue)
pareto$cumulative <- cumsum(pareto$percentage)

num_indi <- which(pareto$cumulative > 0.8)[1]
num_indi

## [1] 6360

percent_indi <- num_indi / nrow(pareto)
percent_indi

## [1] 0.2657086

The Pareto analysis shows that about 6,360 transactions, only around 26.6% of all orders, contribute to 80% of the store’s total revenue. This confirms the Pareto principle that a relatively small proportion of high-value orders drive the majority of revenue. A logical next step would be to profile these transactions by product category, state, and customer segment to see which groups are most responsible for this revenue concentration.

END: YOUR ANSWER