Code
library(tidyverse)
library(sweep)
library(scales)
library(DT)library(tidyverse)
library(sweep)
library(scales)
library(DT)This analysis aims to understand bicycle sales behavior using a dataset that includes information on prices, quantities sold, product categories, frame materials, and bike shops. The main goal is to identify patterns and relationships that provide useful insights for marketing decision-making.
To begin the analysis, the bike_sales.csv database was imported using the code obtained from Report 1 in HTML format. This process made it possible to load the original information into the working environment, verify its structure, and prepare the necessary variables for statistical analysis.
# Import the data from the sweep package
bike_sales <- bike_sales |>
mutate(order.id = factor(x = order.id,
ordered = TRUE),
order.line = factor(x = order.line,
ordered = TRUE),
customer.id = factor(x = customer.id,
ordered = FALSE),
bikeshop.name = factor(x = bikeshop.name,
ordered = FALSE),
bikeshop.city = factor(x = bikeshop.city,
ordered = FALSE),
bikeshop.state = factor(x = bikeshop.name,
ordered = FALSE),
product.id = factor(x = product.id,
ordered = FALSE),
model = factor(x = model,
ordered = FALSE),
category.primary = factor(category.primary,
ordered = FALSE),
category.secondary = factor(x = category.secondary,
ordered = FALSE),
frame = factor(x = frame,
ordered = FALSE))The following figure shows the relationship between the price and the quantity sold of bicycles, distinguishing the frame material (Aluminum and Carbon). The dispersion of the points indicates that there is no clear trend between the two variables, suggesting a weak or non-existent relationship between price and the number of units sold Figure 1
bike_sales |>
ggplot() +
geom_point(aes(x = price,
y = quantity,
color = frame)) +
scale_color_manual(values = c("#043074",
"#FDC600")) +
labs(x = "Price (US Dollars / 1 frame)",
y = "Quantity",
color = "Frame",
subtitle = "Relationship between quantity and price")In Figure 1 it is possible to identify that there is not a relationship between the variable price and quantity. For example if you calculate the correlation the result is 0.000023.
The following figure shows the relationship between price and quantity sold using a base-10 logarithmic scale. This adjustment allows a clearer view of data dispersion and confirms the absence of a relationship between the two variables.
bike_sales |>
ggplot() +
geom_point(aes(x = price,
y = quantity,
color = frame)) +
scale_x_continuous(transform = "log10") +
scale_y_continuous(transform = "log10") +
scale_color_manual(values = c("#043074",
"#FDC600")) +
labs(x = "Price (US Dollars / 1 frame)",
y = "Quantity",
color = "Frame",
subtitle = "Relationship between quantity and price using a logarithmic scale in base 10")In Figure 2 it is also possible to identify that there is not a relationship between the variable log 10 price and log 10 quantity. For example if you calculate the correlation the result is 0.000584.
The following figure shows the product gap by secondary category and frame material, highlighting the differences in performance between aluminum and carbon bicycles.
bike_sales |>
count(frame, category.secondary) |>
ggplot() +
geom_label(aes(x = frame,
y = category.secondary,
label = n)) +
labs(x = "Frame",
y = "Category secondary",
subtitle = "Frame vs Category secondary")Some categories, such as Cross Country Race and Endurance Road, perform better with carbon frames, while others favor aluminum. This indicates variations in customer preferences depending on the product type and material Figure 3
This section presents the main statistical measures that describe revenue behavior across different variable combinations. These indicators help identify patterns and differences between categories, materials, and stores, providing a clearer view of overall sales performance.
(category.secondary, frame)The following table shows the mean, median, and standard deviation of revenue grouped by secondary category and frame type. These values allow comparing the average performance of different product combinations and observing the variability among them.
It can be seen that some categories, such as Elite Road and Cross Country Race with carbon frames, have higher average revenues and greater dispersion, indicating a stronger contribution to total sales Table 1
bike_sales_cat2_frame_stats_rev <- bike_sales |>
group_by(category.secondary, frame) |>
summarise(
mean_rev = mean(x = price.ext),
median_rev = median(x = price.ext),
sd_rev = sd(x = price.ext)
) |>
ungroup()
bike_sales_cat2_frame_stats_rev |>
datatable(
colnames = c("Category secondary",
"Frame",
"Mean revenue",
"Median revenue",
"Standard deviation revenue")
) |>
formatRound(
columns = c("mean_rev",
"sd_rev"),
digits = 2
)(category.secondary, frame)This section shows the percentage contributed by each combination of secondary category and frame type to the total revenue. This analysis helps identify which products generate a greater share of total sales.
It is observed that carbon bicycles in categories such as Elite Road and Cross Country Race account for the highest percentages, indicating a market preference for these models due to their better performance and higher perceived value.
bike_sales_cat2_frame_total_pct_rev <- bike_sales |>
group_by(category.secondary, frame) |>
summarise(total_rev = sum(price.ext)) |>
ungroup() |>
mutate(total_rev_pct = (total_rev / sum(total_rev)) * 100)
bike_sales_cat2_frame_total_pct_rev |>
datatable(
colnames = c("Category secondary",
"Frame",
"Total revenue",
"Percentage of total revenue")
) |>
formatRound(
columns = c("total_rev_pct") ,
digits = 2
)bikeshop.nameThis section presents the mean, median, and standard deviation of revenue grouped by bike shop. These indicators allow comparing the average performance of each store and analyzing variability in sales results.
The data show that some stores, such as Denver Bike Shop and Ann Arbor Speed, report the highest average revenues, suggesting greater sales capacity or stronger market positioning in their local areas.
bike_sales_bikeshop_stats_rev <- bike_sales |>
group_by(bikeshop.name) |>
summarise(
mean_rev = mean(x = price.ext),
median_rev = median(x = price.ext),
sd_rev = sd(x = price.ext)
)
bike_sales_bikeshop_stats_rev |>
datatable(
colnames = c("Bikeshop",
"Mean revenue",
"Median revenue",
"Standard deviation revenue")
) |>
formatRound(
columns = c("mean_rev",
"sd_rev"),
digits = 2
)bikeshop.nameThis section shows the percentage contribution to total revenue made by each bike shop. This analysis helps identify which stores contribute the most to the company’s overall performance.
The results show that shops such as Denver Bike Shop and Ithaca Mountain Climbers account for the highest revenue percentages, standing out as the main sales drivers within the distribution network.
bike_sales_bikeshop_total_rev_pct <- bike_sales |>
group_by(bikeshop.name) |>
summarise(total_rev = sum(price.ext)) |>
mutate(total_rev_pct = (total_rev / sum(total_rev)) * 100)
bike_sales_bikeshop_total_rev_pct |>
datatable(
colnames = c("Bikeshop",
"Total revenue",
"Percentage of total revenue")
) |>
formatRound(
columns = c("total_rev_pct"),
digits = 2
)chisq.test()Inspect, using an statistical test, if the total revenue is distributed in a homogeneous way by category.secondary and frame using a \(\chi^2\) test for a given vector of probabilities.
Taking into account that there are 13 available models offer by the corporation use \(p_i = \frac{1}{13}\) for \(i = 1, 2, \ldots, 13\) in the \(\chi^2\) test
Once you apply the test point out your conclusion taking into account the p-value of the test
# To attack the problem you need to really understand
# the chi squared test for given probabilities
## First, prepare the data in a similar or in the following
## way
### table_category_secondary_frame <- bike_sales |>
### group_by(category.secondary, frame) |>
### summarise(total_revenue = sum(price.ext),
### .groups = "drop") |>
### mutate(total_revenue_pct = (total_revenue / sum(total_revenue))*100) |>
### arrange(desc(total_revenue_pct))
## Second, create a new variable to identify
## the models in a similar or in the following way
### table_models <- table_category_secondary_frame |>
### mutate(model = str_c(category.secondary, frame, sep = " ")) |>
### select(model, total_revenue)
#### For more information about str_c check out the examples in
#### https://stringr.tidyverse.org/reference/str_c.html
## Third, create a named vector to apply the chi test
## taking into account what is mention in ?chisq.test
## in the section related to the Examples and the part
## about Testing for population probabilities
### To create a named vector check out the deframe
### function from the tidyverse
#### https://tibble.tidyverse.org/reference/enframe.html
##### deframe(table_models)
## Fourth, apply the chi squared test
### deframe(table_models) |>
### chisq.test(p = rep.int(x = 1/13, times = 13))
#### rep.int(x = 1/13, times = 13) is just a way
#### to repeat 1/13 thirteen times
##### Check out ?rep.int to understand it
## Fifth, use the tidymodels library:
## library(tidymodels) using tidy
### Remember to load the library in the first code
### chunk: library(tidymodels)
#### For more information check out
#### https://broom.tidymodels.org/reference/tidy.htest.html
##### deframe(table_models) |>
##### chisq.test(p = rep.int(x = 1/13, times = 13)) |>
##### tidy()Inspect, using an statistical test, if the total revenue is distributed in a homogeneous way by bikeshop.name using a \(\chi^2\) test for a given vector of probabilities.
Taking into account that there are 30 bike shops offer use \(p_i = \frac{1}{30}\) for \(i = 1, 2, \ldots, 30\) in the \(\chi^2\) test
Once you apply the test point out your conclusion taking into account the p-value of the test
# Apply a similar process as in the case of
# applying the chi squared test for
# category.secondary and frame in relation to
# the total revenue
## See the comments in that section for more
## information about how to apply the testt.test()Inspect, using a two sample t-tests, if the mean revenue between the Aluminum and Carbon bicycle models are the same in relation to the frame variable
Once you apply the test point out your conclusion taking into account the p-value of the test
# If you use base R you can apply t.test
## Check out more information in ?t.test
###
# If you use the tidyverse framework you can
# use t_test from tidymodels
## For more information check out
### https://infer.tidymodels.org/reference/t_test.htmlApply an analysis of variance (ANOVA) using price as a response variable
First begin with a simple model
price ~ category.secondaryThen add another variable
price ~ category.secondary + frameExpand the model in the following way where you can use the following equivalent expressions
price ~ category.secondary + frame + category.secondary:frameprice ~ category.secondary*frame1# Use the functions aov and anova from base R
# to apply the analysis of variance (ANOVA)
## As an example you can begin with a simple model
### aov(formula = price ~ category.secondary,
### data = bike_sales) |>
### anova() |>
### tidy()
## Then change the formula argument with the
## different alternative modelsTaking into account the analysis of variance, point out if it make sense to add category.secondary, frame and its interaction to the model.
Depending on the answer to the above question point out if it make sense to try the following model:
price ~ category.secondary*frame*bikeshop.state2category.secondary*frame is a short way to specify all the possible interactions between these 2 categorical variables which is equivalent to category.secondary + frame + category.secondary:frame↩︎
category.secondary*frame*bikeshop.state is a way to specify all the possible interactions between these 3 categorical variables. In this case the model will have \(2^3 - 1 = 7\) terms↩︎