Marketing report part 2

Authors

Marla Ivonne López

Chomsky

William

Published

November 8, 2025

Code
library(tidyverse)
library(sweep)
library(scales)
library(DT)

1 Exploring data, visualizing it and applying statistical tests

1.1 Introduction

This analysis aims to understand bicycle sales behavior using a dataset that includes information on prices, quantities sold, product categories, frame materials, and bike shops. The main goal is to identify patterns and relationships that provide useful insights for marketing decision-making.

1.2 Import data

To begin the analysis, the bike_sales.csv database was imported using the code obtained from Report 1 in HTML format. This process made it possible to load the original information into the working environment, verify its structure, and prepare the necessary variables for statistical analysis.

Code
# Import the data from the sweep package
bike_sales <- bike_sales |> 
  mutate(order.id = factor(x = order.id, 
                           ordered = TRUE),
         order.line = factor(x = order.line,
                             ordered = TRUE),
         customer.id = factor(x = customer.id,
                              ordered = FALSE),
         bikeshop.name = factor(x = bikeshop.name,
                                ordered = FALSE),
         bikeshop.city = factor(x = bikeshop.city,
                                ordered = FALSE),
         bikeshop.state = factor(x = bikeshop.name,
                                 ordered = FALSE),
         product.id = factor(x = product.id,
                             ordered = FALSE),
         model = factor(x = model,
                        ordered = FALSE),
         category.primary = factor(category.primary,
                                   ordered = FALSE),
         category.secondary = factor(x = category.secondary,
                                     ordered = FALSE),
         frame = factor(x = frame,
                        ordered = FALSE))

1.3 Relation between continuous variables

The following figure shows the relationship between the price and the quantity sold of bicycles, distinguishing the frame material (Aluminum and Carbon). The dispersion of the points indicates that there is no clear trend between the two variables, suggesting a weak or non-existent relationship between price and the number of units sold Figure 1

Code
bike_sales |> 
  ggplot() + 
  geom_point(aes(x = price,
                 y = quantity,
                 color = frame)) +
  scale_color_manual(values = c("#043074",
                                "#FDC600")) +
  labs(x = "Price (US Dollars / 1 frame)",
       y = "Quantity",
       color = "Frame",
       subtitle = "Relationship between quantity and price")
Figure 1: Relationship between frames sold and its price

In Figure 1 it is possible to identify that there is not a relationship between the variable price and quantity. For example if you calculate the correlation the result is 0.000023.

The following figure shows the relationship between price and quantity sold using a base-10 logarithmic scale. This adjustment allows a clearer view of data dispersion and confirms the absence of a relationship between the two variables.

Code
bike_sales |> 
  ggplot() +
  geom_point(aes(x = price,
                 y = quantity,
                 color = frame)) +
  scale_x_continuous(transform = "log10") +
  scale_y_continuous(transform = "log10") +
  scale_color_manual(values = c("#043074",
                                "#FDC600")) +
  labs(x = "Price (US Dollars / 1 frame)",
       y = "Quantity",
       color = "Frame",
       subtitle = "Relationship between quantity and price using a logarithmic scale in base 10")
Figure 2: Relationship between frames sold and its price using a logarithmic scale in base 10

In Figure 2 it is also possible to identify that there is not a relationship between the variable log 10 price and log 10 quantity. For example if you calculate the correlation the result is 0.000584.

1.4 Relation between categorical variables

The following figure shows the product gap by secondary category and frame material, highlighting the differences in performance between aluminum and carbon bicycles.

Code
bike_sales |> 
  count(frame, category.secondary) |> 
  ggplot() +
  geom_label(aes(x = frame,
                 y = category.secondary,
                 label = n)) +
  labs(x = "Frame",
       y = "Category secondary",
       subtitle = "Frame vs Category secondary")
Figure 3: Graphical representation of the product gap

Some categories, such as Cross Country Race and Endurance Road, perform better with carbon frames, while others favor aluminum. This indicates variations in customer preferences depending on the product type and material Figure 3

1.5 Descriptive statistics by group

This section presents the main statistical measures that describe revenue behavior across different variable combinations. These indicators help identify patterns and differences between categories, materials, and stores, providing a clearer view of overall sales performance.

1.5.1 Mean, median and standard deviation for revenue by (category.secondary, frame)

The following table shows the mean, median, and standard deviation of revenue grouped by secondary category and frame type. These values allow comparing the average performance of different product combinations and observing the variability among them.

It can be seen that some categories, such as Elite Road and Cross Country Race with carbon frames, have higher average revenues and greater dispersion, indicating a stronger contribution to total sales Table 1

Code
bike_sales_cat2_frame_stats_rev <- bike_sales |> 
  group_by(category.secondary, frame) |> 
  summarise(
    mean_rev = mean(x = price.ext),
    median_rev = median(x = price.ext),
    sd_rev = sd(x = price.ext)
  ) |> 
  ungroup()

bike_sales_cat2_frame_stats_rev |> 
  datatable(
    colnames = c("Category secondary",
                 "Frame",
                 "Mean revenue",
                 "Median revenue",
                 "Standard deviation revenue") 
      ) |> 
  formatRound(
    columns = c("mean_rev",
                "sd_rev"), 
    digits = 2  
  )
Table 1: Mean, median and standard deviation of revenue by category secondary and frame

1.5.2 Percentage of total revenue generated by (category.secondary, frame)

This section shows the percentage contributed by each combination of secondary category and frame type to the total revenue. This analysis helps identify which products generate a greater share of total sales.

It is observed that carbon bicycles in categories such as Elite Road and Cross Country Race account for the highest percentages, indicating a market preference for these models due to their better performance and higher perceived value.

Code
bike_sales_cat2_frame_total_pct_rev <- bike_sales |> 
  group_by(category.secondary, frame) |> 
  summarise(total_rev = sum(price.ext)) |> 
  ungroup() |> 
  mutate(total_rev_pct = (total_rev / sum(total_rev)) * 100)
  
bike_sales_cat2_frame_total_pct_rev |> 
  datatable(
    colnames = c("Category secondary",
                 "Frame",
                 "Total revenue",
                 "Percentage of total revenue") 
  ) |> 
  formatRound(
    columns = c("total_rev_pct") ,
    digits = 2
      )
Table 2: Total and percentage revenue by category secondary and frame

1.5.3 Mean, median and standard deviation for revenue by bikeshop.name

This section presents the mean, median, and standard deviation of revenue grouped by bike shop. These indicators allow comparing the average performance of each store and analyzing variability in sales results.

The data show that some stores, such as Denver Bike Shop and Ann Arbor Speed, report the highest average revenues, suggesting greater sales capacity or stronger market positioning in their local areas.

Code
bike_sales_bikeshop_stats_rev <- bike_sales |> 
  group_by(bikeshop.name) |> 
  summarise(
    mean_rev = mean(x = price.ext),
    median_rev = median(x = price.ext),
    sd_rev = sd(x = price.ext)
  )

bike_sales_bikeshop_stats_rev |> 
  datatable(
    colnames = c("Bikeshop",
                 "Mean revenue",
                 "Median revenue",
                 "Standard deviation revenue")
  ) |> 
  formatRound(
    columns = c("mean_rev",
                "sd_rev"),
    digits = 2
  )
Table 3: Mean, median and standard deviation of revenue by bikeshops

1.5.4 Percentage of total revenue generated by each bikeshop.name

This section shows the percentage contribution to total revenue made by each bike shop. This analysis helps identify which stores contribute the most to the company’s overall performance.

The results show that shops such as Denver Bike Shop and Ithaca Mountain Climbers account for the highest revenue percentages, standing out as the main sales drivers within the distribution network.

Code
bike_sales_bikeshop_total_rev_pct <- bike_sales |> 
  group_by(bikeshop.name) |> 
  summarise(total_rev = sum(price.ext)) |> 
  mutate(total_rev_pct = (total_rev / sum(total_rev)) * 100)

bike_sales_bikeshop_total_rev_pct |> 
  datatable(
    colnames = c("Bikeshop",
                 "Total revenue",
                 "Percentage of total revenue")
  ) |> 
  formatRound(
    columns = c("total_rev_pct"),
    digits = 2
  )
Table 4: Total and percentage revenue by bikeshop

1.6 Statistical tests

1.6.1 Testing group frequencies: chisq.test()

Inspect, using an statistical test, if the total revenue is distributed in a homogeneous way by category.secondary and frame using a \(\chi^2\) test for a given vector of probabilities.

Taking into account that there are 13 available models offer by the corporation use \(p_i = \frac{1}{13}\) for \(i = 1, 2, \ldots, 13\) in the \(\chi^2\) test

Once you apply the test point out your conclusion taking into account the p-value of the test

Code
# To attack the problem you need to really understand 
# the chi squared test for given probabilities

## First, prepare the data in a similar or in the following
## way
### table_category_secondary_frame <- bike_sales |>
###   group_by(category.secondary, frame) |>
###   summarise(total_revenue = sum(price.ext),
###             .groups = "drop") |>
###   mutate(total_revenue_pct = (total_revenue / sum(total_revenue))*100) |>
###   arrange(desc(total_revenue_pct))

## Second, create a new variable to identify
## the models in a similar or in the following way
### table_models <- table_category_secondary_frame |> 
###  mutate(model = str_c(category.secondary, frame, sep = " ")) |> 
###  select(model, total_revenue)
#### For more information about str_c check out the examples in
#### https://stringr.tidyverse.org/reference/str_c.html

## Third, create a named vector to apply the chi test
## taking into account what is mention in ?chisq.test
## in the section related to the Examples and the part
## about Testing for population probabilities
### To create a named vector check out the deframe
### function from the tidyverse
#### https://tibble.tidyverse.org/reference/enframe.html
##### deframe(table_models)

## Fourth, apply the chi squared test
### deframe(table_models) |> 
###  chisq.test(p = rep.int(x = 1/13, times = 13))
#### rep.int(x = 1/13, times = 13) is just a way
#### to repeat 1/13 thirteen times
##### Check out ?rep.int to understand it

## Fifth, use the tidymodels library:
## library(tidymodels) using tidy
### Remember to load the library in the first code 
### chunk: library(tidymodels)
#### For more information check out
#### https://broom.tidymodels.org/reference/tidy.htest.html
##### deframe(table_models) |> 
#####  chisq.test(p = rep.int(x = 1/13, times = 13)) |> 
#####  tidy()

Inspect, using an statistical test, if the total revenue is distributed in a homogeneous way by bikeshop.name using a \(\chi^2\) test for a given vector of probabilities.

Taking into account that there are 30 bike shops offer use \(p_i = \frac{1}{30}\) for \(i = 1, 2, \ldots, 30\) in the \(\chi^2\) test

Once you apply the test point out your conclusion taking into account the p-value of the test

Code
# Apply a similar process as in the case of
# applying the chi squared test for 
# category.secondary and frame in relation to
# the total revenue
## See the comments in that section for more 
## information about how to apply the test

1.6.2 Testing group means: t.test()

Inspect, using a two sample t-tests, if the mean revenue between the Aluminum and Carbon bicycle models are the same in relation to the frame variable

Once you apply the test point out your conclusion taking into account the p-value of the test

Code
# If you use base R you can apply t.test
## Check out more information in ?t.test
### 
# If you use the tidyverse framework you can 
# use t_test from tidymodels
## For more information check out 
### https://infer.tidymodels.org/reference/t_test.html

1.6.3 Analysis of Variance (ANOVA)

Apply an analysis of variance (ANOVA) using price as a response variable

  • First begin with a simple model

    • price ~ category.secondary
  • Then add another variable

    • price ~ category.secondary + frame
  • Expand the model in the following way where you can use the following equivalent expressions

    • price ~ category.secondary + frame + category.secondary:frame
    • price ~ category.secondary*frame1
Code
# Use the functions aov and anova from base R
# to apply the analysis of variance (ANOVA)
## As an example you can begin with a simple model
### aov(formula = price ~ category.secondary, 
###    data = bike_sales) |> 
###  anova() |> 
###  tidy()

## Then change the formula argument with the 
## different alternative models

Taking into account the analysis of variance, point out if it make sense to add category.secondary, frame and its interaction to the model.

Depending on the answer to the above question point out if it make sense to try the following model:

  • price ~ category.secondary*frame*bikeshop.state2

Footnotes

  1. category.secondary*frame is a short way to specify all the possible interactions between these 2 categorical variables which is equivalent to category.secondary + frame + category.secondary:frame↩︎

  2. category.secondary*frame*bikeshop.state is a way to specify all the possible interactions between these 3 categorical variables. In this case the model will have \(2^3 - 1 = 7\) terms↩︎