Tutorial 3 Part 2 (30 marks - To be submitted by 15 Sep 9am)

Context: The dataset AustraliaSales.csv contains data on orders made to a cycling equipment store in Australia. Each observation represents an order and provides information regarding the customer, products purchased as well as the cost, revenue and profit made. Each of the column is defined as follows:

#put in your working directory folder pathname 
setwd("/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3")
getwd()
## [1] "/Users/jeriellai/BT1101 Introduction to Business Analytics/Tutorial 3"
#import excel file into RStudio
dfBS <- read.csv("AustraliaSales.csv")

#import libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("tidyverse") #need to call the library before you use the packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("knitr")
library("readxl") # this is required for the read_excel function
library("RColorBrewer") # for color palette 

Q2.(a) Customer Profile Dashboard (7.5 marks)

Conduct an analyses to develop a profile of the customers who have made purchases of the bike equipment. For each of the following variables, create a table and chart to describe the frequency distributions. (6 marks)

Describe your observations from the above analyses. (1.5 marks)

BEGIN: YOUR ANSWER

# Type your codes
# 2ai 
Freqi <- dfBS %>% count(Age_Group)
kable(Freqi, caption = "Frequency of Customers by Age Group", col.names= c("Age Group", "Frequency"))
Frequency of Customers by Age Group
Age Group Frequency
Adults (35-64) 10394
Seniors (64+) 58
Young Adults (25-34) 9102
Youth (<25) 4382
par(mar = c(5, 10, 4, 2))

b1 <- barplot(Freqi$n,
        names.arg = Freqi$Age_Group,
        col = 'blue', 
        main = 'Frequency of Customers by Age Group',
        cex.names = 0.55,
        xlim = c(0, 12000), 
        xlab = 'No. of customers', 
        horiz = TRUE,
        las = 1)

text(x = Freqi$n + 700,
     y = b1,
     labels = Freqi$n,
     pos = 1
       )

# Type your codes
# 2aii 
Freqii <- dfBS %>% count(Customer_Gender)
kable(Freqii, caption = "Frequency of Customers by Gender", col.names= c("Gender", "Frequency"))
Frequency of Customers by Gender
Gender Frequency
F 11506
M 12430
slice.data <- Freqii$n
piepercent <- 100 * round(Freqii$n/sum(Freqii$n), 2)
label <- Freqii$Customer_Gender #change to the variable name
label <- paste(label, ",", piepercent, "%", sep = "")
pie(slice.data,
      labels = label,
      col = c("blue", "cyan", "dodgerblue"),
      radius = 1,
      main = "Frequency of Customers by Gender") # change caption accordingly

# Type your codes
# 2aiii 
Freqiii <- dfBS %>% count(State)
kable(Freqiii, caption = "Frequency of Customers by State", col.names= c("State", "Frequency"))
Frequency of Customers by State
State Frequency
New South Wales 10412
Queensland 5220
South Australia 1564
Tasmania 724
Victoria 6016
par(mar = c(5, 10, 4, 2))

b2 <- barplot(Freqiii$n,
        names.arg = Freqiii$State,
        col = 'blue', 
        main = 'Frequency of Customers by State',
        cex.names = 0.55,
        xlim = c(0, 12000), 
        xlab = 'No. of customers', 
        horiz = TRUE,
        las = 1)

text(x = Freqiii$n + 700,  
     y = b2,
     labels = Freqiii$n,
     pos = 1)              

Most customers are from the adults (35-64) and young adults (25-34) age groups, both contributing to around 81% of total transactions. Seniors (64+) form the smallest customer based with around 0.2% of total market share. The proportion of male and female customers are relatively even (52% and 48% respectively). By state, New South Wales is where the majority of customers are from (around 43%), followed by Victoria (around 25%) and Queensland (around 22%), whereas customers from smaller states such as Tasmania contribute very few transactions (around 3%).

END: YOUR ANSWER

Q2.(b) Product Sales Analyses Dashboard (3.5 marks)

Store manager is also interested in understanding the frequency of transactions involving different products. (2 marks)

Describe your insights from the charts. (1.5 marks)

BEGIN: YOUR ANSWER

# Type your codes
# 2bi 
Freqiv <- dfBS %>% count(Product_Category)
kable(Freqiv, caption = "Frequency of Transactions by Product Category", col.names= c("Product Category", "Frequency"))
Frequency of Transactions by Product Category
Product Category Frequency
Accessories 13498
Bikes 7064
Clothing 3374

Accessories comprises the majority of total transactions (around 57%), making up more than half of all sales. Bikes follow with 7064 transactions (around 30%), while clothing is the smallest category with around 14% of total transactions. This table shows that smaller items (accessories) are purchased more often than larger items (bicycles and clothing).

# Type your codes
# 2bii 
Freqv <- dfBS %>% group_by(Product_Category) %>% count(Sub_Category)
kable(Freqv, caption = "Frequency of Transactions by Sub Category of Product Category", col.names= c("Product Category", "Sub Category", "Frequency"))
Frequency of Transactions by Sub Category of Product Category
Product Category Sub Category Frequency
Accessories Bike Racks 90
Accessories Bike Stands 120
Accessories Bottles and Cages 3144
Accessories Cleaners 430
Accessories Fenders 628
Accessories Helmets 2346
Accessories Hydration Packs 344
Accessories Tires and Tubes 6396
Bikes Mountain Bikes 2156
Bikes Road Bikes 4104
Bikes Touring Bikes 804
Clothing Caps 844
Clothing Gloves 578
Clothing Jerseys 1268
Clothing Shorts 264
Clothing Socks 224
Clothing Vests 196

Within accessories, tires and tubes collectively account for the majority of total transactions (around 47%), while hydration packs comprise the smallest share of total transactions within accessories (around 2.5%). For clothing, jerseys dominate (around 38% of total clothing transactions), while shorts, socks and vests comprise the smallest market share for clothing (around 6-8% for each subcategory). For bikes, road bikes are the most popular, comprising around 60% of total transactions for bikes. Overall, tires and tubes are the most popular subcategory of products, followed by road bikes with bike racks being the least popular.

END: YOUR ANSWER

Q2.(c) Transactions by Time Analyses Dashboard (7 marks)

Describe the trends observed from the above analyses. (2 marks)

# Type your codes
# 2ci 

Freq2ci <- dfBS %>% count(Year)

b4 <- barplot(Freq2ci$n,
        names.arg = Freq2ci$Year,
        col = 'blue', 
        main = 'Frequency of Transactions over the Years',
        cex.names = 0.55,
        xlim = c(0, 8000), 
        xlab = 'No. of Transactions', 
        horiz = TRUE,
        las = 1)

text(x = Freq2ci$n + 400,  
     y = b4,
     labels = Freq2ci$n,
     pos = 1) 

The number of transactions remain constant in 2011 and 2012 but in 2013, the number of transactions increased to around 6 times (5335) of that of the previous year (859) and remained relatively constant after at the 5300-5800 total yearly transactions level.

# Type your codes
# 2cii 

Freq2cii <- dfBS %>% group_by(Year, Product_Category) %>% tally()
Freq2cii.spread <- Freq2cii %>% tidyr::spread(key = Product_Category, value = n)
Freq2cii.spread[is.na(Freq2cii.spread)] <- 0

barmat <- as.matrix(Freq2cii.spread[, -1]) 
rownames(barmat) <- Freq2cii.spread$Year   

par(mar = c(5, 6, 5, 8), xpd = TRUE)
bp <- barplot(t(barmat),
              beside = TRUE,
              col = rainbow(ncol(barmat)),
              main = "Frequency of Each Product Category Across Years",
              xlab = "Year",
              ylab = "No. of Transactions",
              names.arg = rownames(barmat),
              ylim = c(0, max(barmat) * 1.3))

legend("topright",
       inset = c(-0.15, 0), 
       legend = colnames(barmat),
       fill = rainbow(ncol(barmat)),
       cex = 0.7,
       title = "Product Category")

text(x = bp,
     y = t(barmat) + 200,              
     labels = t(barmat),
     cex = 0.7)

The number of transactions increased significantly in 2013 compared to 2012 mainly because the store started selling accessories (accounting for the additional 2855 number of transactions) and clothing (accounting for the additional 763 number of transactions). Moreover, the total number of transactions for bikes doubled (1687) compared to 2012 (859). The total number of transactions from 2013 to 2016 remain relatively constant but the number of transactions for bikes, clothing and accessories fluctuate yearly.

END: YOUR ANSWER

Q2.(d) Revenue Analyses Dashboard (7 marks)

Describe your observations from each chart. Is there any further investigations that you would like to explore? (3 marks)

# Type your codes
# 2di 

hist_input1 <- dfBS$Revenue
hist_rev <- hist(dfBS$Revenue,
                 main = "Histogram of Revenue",
                 xlab = "Revenue",
                 ylab = "Frequency",
                 col  = "darkorange",
                 breaks = 20,
                 labels = TRUE,
                 ylim = c(0, max(hist(dfBS$Revenue, breaks = 20, plot = FALSE)$counts) * 1.2))

rev_bins <- cut(hist_input1, hist_rev$breaks, include.lowest = TRUE, dig.lab = 6)
rev_table <- table(rev_bins)
kable(rev_table, caption = "Frequency Distribution of Revenue")
Frequency Distribution of Revenue
rev_bins Freq
[0,5000] 23308
(5000,10000] 618
(10000,15000] 8
(15000,20000] 0
(20000,25000] 0
(25000,30000] 0
(30000,35000] 0
(35000,40000] 0
(40000,45000] 0
(45000,50000] 0
(50000,55000] 1
(55000,60000] 1

The revenue distribution is extremely right-skewed. A vast majority of transactions (23,308, or more than 97%) fall within the lowest band of 0–5,000 AUD. Only 618 transactions fall between 5,001–10,000 AUD, and beyond 10,000 AUD there are just 10 transactions in total. Notably, there is a single transaction each in the 50,000–55,000 AUD and 55,000–60,000 AUD ranges. This shows that the business is overwhelmingly driven by small orders, with only a handful of very high-value transactions forming a long tail.

# Type your codes
# 2dii 
# 2dii
plot(dfBS$Revenue, dfBS$Profit,
     main = "Scatterplot of Profit vs Revenue",
     xlab = "Revenue",
     ylab = "Profit")

abline(lm(Profit ~ Revenue, data = dfBS))

The scatterplot shows that revenue and profit have a strong positive linear correlation. Transactions with higher revenue generally correspond to higher profit. However, profit levels, especially for transactions below 10000 AUD, are also spread out for a given revenue, pointing to differences in margins across products. The few very high-revenue transactions (more than 50000 AUD) deviate from the trend, with unusually low or irregular profit outcomes (lie below the linear line of best fit), likely due to bulk discounts, contractual pricing, or data errors, which warrant further investigation.

END: YOUR ANSWER

Q2.(e) Pareto Analyses Dashboard (5 marks)

Conduct pareto analyses on Revenue. Report the number and percentage of transactions that contribute most, amounting to 80% of the total revenue.

For tutorial discussion: what follow up questions do you have after performing the Pareto Analyses?

BEGIN: YOUR ANSWER

# Type your codes
# 2e
pareto_dataset <- dfBS
pareto <- pareto_dataset %>%
  select(Revenue) %>%
  arrange(desc(Revenue))

pareto$percentage <- pareto$Revenue / sum(pareto$Revenue)
pareto$cumulative <- cumsum(pareto$percentage)

num_indi <- which(pareto$cumulative > 0.8)[1]
num_indi
## [1] 6360
percent_indi <- num_indi / nrow(pareto)
percent_indi
## [1] 0.2657086

The Pareto analysis shows that about 6,360 transactions, only around 26.6% of all orders, contribute to 80% of the store’s total revenue. This confirms the Pareto principle that a relatively small proportion of high-value orders drive the majority of revenue. A logical next step would be to profile these transactions by product category, state, and customer segment to see which groups are most responsible for this revenue concentration.

END: YOUR ANSWER