Marketing report part 1

Authors

Marla Ivonne Lopez Garcia Cod 0106680

Chomsky Gómez Fodor Von Cod 0106611

Published

October 18, 2025

title: “Marketing report part 1” date: 2025-10-18 author: - Marla Ivonne Lopez Garcia Cod 0106680 - Chomsky Gómez Fodor Von Cod 0106611

format: html: toc: true code-fold: true number-sections: true embed-resources: true editor: visual —

Code
library(tidyverse)
library(sweep)
library(skimr)
library(DT)
library(scales)

1 Describing data

1.1 Introduction

This paper focuses on the initial data analysis and preparation stage for the bicycle sales business case. The primary objective is to prepare and understand the structure of the bike_sales dataset before performing deeper statistical analysis. This ensures that the data is of high quality for future Marketing Research studies, such as customer segmentation or product performance analysis.

1.2 Import data

The frame sales data was imported into the RStudio environment. The read_csv() function was used to load the bike_sales file into an R object with the same name. This guarantees that all data is available in the tibble format for efficient manipulation using the tools from the tidyverse package.

Code
# Import the data from the sweep package
bike_sales <- bike_sales |> 
  mutate(order.id = factor(x = order.id, 
                           ordered = TRUE),
         order.line = factor(x = order.line,
                             ordered = TRUE),
         customer.id = factor(x = customer.id,
                              ordered = FALSE),
         bikeshop.name = factor(x = bikeshop.name,
                                ordered = FALSE),
         bikeshop.city = factor(x = bikeshop.city,
                                ordered = FALSE),
         bikeshop.state = factor(x = bikeshop.name,
                                 ordered = FALSE),
         product.id = factor(x = product.id,
                             ordered = FALSE),
         model = factor(x = model,
                        ordered = FALSE),
         category.primary = factor(category.primary,
                                   ordered = FALSE),
         category.secondary = factor(x = category.secondary,
                                     ordered = FALSE),
         frame = factor(x = frame,
                        ordered = FALSE))

1.3 Inspect data

The initial inspection using glimpse() and skim() revealed the data structure:

Variable Types: It was confirmed that the identification columns (order.id, customer.id, and product.id) were likely numeric but needed to be treated as categorical. The key variables (such as prices or quantities of frames) have correct data types.

Missing Values: No significant missing values (NA) were detected in the main variables, ensuring the integrity of the dataset.

Distribution: The numeric variables such as quantity, price, and revenue (price.ext) show consistent ranges without extreme or negative values, confirming that the dataset is clean and ready for further analysis.

bike_sales |> glimpse()

Code
bike_sales |> 
  glimpse()
Rows: 15,644
Columns: 17
$ order.date         <date> 2011-01-07, 2011-01-07, 2011-01-10, 2011-01-10, 20…
$ order.id           <ord> 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, …
$ order.line         <ord> 1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 1, 2, 3, 4, 1, 2, 3, …
$ quantity           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, …
$ price              <dbl> 6070, 5970, 2770, 5970, 10660, 3200, 12790, 5330, 1…
$ price.ext          <dbl> 6070, 5970, 2770, 5970, 10660, 3200, 12790, 5330, 1…
$ customer.id        <fct> 2, 2, 10, 10, 6, 6, 6, 6, 6, 22, 8, 8, 8, 8, 16, 16…
$ bikeshop.name      <fct> Ithaca Mountain Climbers, Ithaca Mountain Climbers,…
$ bikeshop.city      <fct> Ithaca, Ithaca, Kansas City, Kansas City, Louisvill…
$ bikeshop.state     <fct> Ithaca Mountain Climbers, Ithaca Mountain Climbers,…
$ latitude           <dbl> 42.44396, 42.44396, 39.11405, 39.11405, 38.25267, 3…
$ longitude          <dbl> -76.50188, -76.50188, -94.62746, -94.62746, -85.758…
$ product.id         <fct> 48, 52, 76, 52, 2, 50, 1, 4, 34, 26, 96, 66, 35, 72…
$ model              <fct> Jekyll Carbon 2, Trigger Carbon 2, Beast of the Eas…
$ category.primary   <fct> Mountain, Mountain, Mountain, Mountain, Road, Mount…
$ category.secondary <fct> Over Mountain, Over Mountain, Trail, Over Mountain,…
$ frame              <fct> Carbon, Carbon, Aluminum, Carbon, Carbon, Carbon, C…

1.4 Summarize data

The final summary using skim() on the transformed data confirmed its quality:

Size: The dataset contains several hundred transactions, which is sufficient to identify patterns and trends.

Unique Entities: The data includes multiple unique customers and product models, allowing segmentation and category-level analysis.

Descriptive Conclusion: The main variables, such as price and revenue, show stable central tendencies. This provides a reliable foundation for exploring relationships between sales categories and customer preferences.

1.4.1 Summary statistics

The summary statistics generated by the skim() function show that the main numeric variables (quantity, price, and price.ext) have realistic means and medians, indicating that the dataset represents actual sales behavior. The quantity variable displays a right-skewed distribution, meaning that most transactions involve a small number of units sold, while a few have higher quantities. The variable price shows moderate variability across models, reflecting different product segments (from entry-level to high-end bicycles). Overall, these results confirm that the dataset is consistent and valid for marketing analysis.

bike_sales |> skim()

Code
bike_sales |> 
  skim()
Data summary
Name bike_sales
Number of rows 15644
Number of columns 17
_______________________
Column type frequency:
Date 1
factor 11
numeric 5
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
order.date 0 1 2011-01-07 2015-12-25 2013-08-25 962

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
order.id 0 1 TRUE 2000 127: 30, 191: 30, 324: 30, 326: 30
order.line 0 1 TRUE 30 1: 2000, 2: 1534, 3: 1255, 4: 1102
customer.id 0 1 FALSE 30 10: 2731, 8: 1801, 16: 1086, 2: 975
bikeshop.name 0 1 FALSE 30 Kan: 2731, Den: 1801, Pho: 1086, Ith: 975
bikeshop.city 0 1 FALSE 30 Kan: 2731, Den: 1801, Pho: 1086, Ith: 975
bikeshop.state 0 1 FALSE 30 Kan: 2731, Den: 1801, Pho: 1086, Ith: 975
product.id 0 1 FALSE 97 40: 229, 7: 227, 65: 221, 72: 215
model 0 1 FALSE 97 Sli: 229, Sup: 227, Sca: 221, F-S: 215
category.primary 0 1 FALSE 2 Mou: 8051, Roa: 7593
category.secondary 0 1 FALSE 9 Eli: 3472, Cro: 2852, End: 2586, Tra: 2020
frame 0 1 FALSE 2 Car: 8304, Alu: 7340

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
quantity 0 1 8.57 52.36 1.00 1.00 1.00 1.00 1.000e+03 ▇▁▁▁▁
price 0 1 3521.11 2667.19 415.00 1840.00 2700.00 4260.00 1.279e+04 ▇▅▁▁▁
price.ext 0 1 30169.88 229046.25 415.00 1950.00 3200.00 7000.00 1.066e+07 ▇▁▁▁▁
latitude 0 1 37.94 4.59 25.76 35.47 39.11 40.44 4.761e+01 ▂▂▃▇▁
longitude 0 1 -95.05 13.42 -122.68 -104.99 -94.63 -83.74 -7.141e+01 ▂▅▇▅▅

1.4.2 Counting

The category.secondary variable includes the main types of bicycles, such as mountain, road, and touring. The count table shows that certain categories, like mountain bikes, tend to dominate the dataset, which may indicate a higher consumer demand for this product type. The frame variable analysis shows that aluminum frames appear most frequently, followed by carbon and steel. This suggests that aluminum frames are the preferred material for most customers due to their balance between durability and cost.

The combination of category.secondary and frame reveals interesting patterns: high-end models like road bikes are more often associated with carbon frames, while entry-level categories use aluminum. These relationships help to understand product positioning within the market.

Finally, the analysis of bikeshop.name counts indicates which shops generate more transactions. A small number of stores account for a large portion of sales, suggesting that some distributors play a key role in total revenue generation.

Code
category_secondary <- bike_sales |> 
  count(category.secondary) |> 
  select(category.secondary)

category_secondary |> 
  datatable(
    colnames = c("Category secondary" = "category.secondary") 
      )
Code
frame <- bike_sales |> 
  count(frame) |> 
  select(frame)

frame |> 
  datatable(
    colnames = c("Frame" = "frame") 
      )
Code
category_secondary_frame <- bike_sales |> 
  count(category.secondary,
        frame) |> 
  select(category.secondary, frame)

category_secondary_frame |> 
  datatable(
    colnames = c("Category secondary" = "category.secondary",
      "Frame" = "frame") 
  )

The cross-tabulation between category.secondary and frame confirms that product materials and styles are closely related. This information is useful for designing targeted marketing strategies focused on customer preferences and price sensitivity.

Code
bike_shops <- bike_sales |> 
  count(bikeshop.name, 
        sort = TRUE)

bike_shops |> 
  datatable(
    colnames = c("Bike shop" = "bikeshop.name",
      "Transactions" = "n")
  )

1.5 Data visualization

Data visualization is essential for identifying trends and verifying assumptions about the dataset. The following histograms and boxplots display how variables such as quantity, price, and revenue behave across categories.

1.5.1 Histograms

The histogram illustrates the distribution of the quantity variable:

1.5.1.1 Quantity behavior

The histogram of the quantity variable shows that most transactions involve one or two bicycles, indicating that sales are mainly individual rather than bulk purchases. This pattern is typical for retail-oriented businesses rather than wholesale operations. A small number of higher-quantity transactions may correspond to corporate or promotional purchases.

Code
bike_sales |> 
  ggplot() +
  geom_histogram(aes(x = quantity),
                 color = "black",
                 fill = "steelblue",
                 bins = 30) +
  labs(x = "Units sold",
       y = "Frequency",
       subtitle = "Distribution of units sold")

1.5.1.2 Price behavior

The histogram of prices demonstrates a right-skewed distribution, where most products are in a mid-price range, but a few high-end models have significantly higher prices. This shows a typical market structure with a broad customer base for affordable bicycles and a smaller niche for premium models. The variation in prices also suggests opportunities for price-based segmentation strategies.

Code
bike_sales |> 
  ggplot() +
  geom_histogram(aes(x = price),
                 color = "black",
                 fill = "steelblue",
                 bins = 30) +
  labs(x = "Price (US Dollars)",
       y = "Frequency",
       subtitle = "Distribution of frame prices")

1.5.2 Boxplots

The boxplots are useful for identifying differences in revenue distribution across categories and materials.

The first boxplot by category.secondary shows that revenue varies widely among product types. Categories like mountain and road bikes tend to generate higher revenues, indicating that they are the most profitable segments for the business. The presence of outliers suggests that some models or transactions significantly exceed the average performance, possibly due to premium pricing or large orders.

Code
bike_sales |> 
  mutate(category.secondary = fct_reorder(.f = category.secondary,
                                          .x = price.ext, 
                                          .fun = median)) |> 
  ggplot() +
  geom_boxplot(aes(x = price.ext,
                   y = category.secondary),
               fill = "steelblue") +
  scale_x_continuous(labels = label_currency(), 
                     transform = "log10") +
  labs(x = "Revenue (US Dollars)",
       y = "Category secondary",
       subtitle = "Distribution of revenue by category secondary")

The second boxplot by frame material shows that carbon frames usually generate higher revenues than aluminum or steel, which aligns with their higher price level. Aluminum frames show a moderate revenue level, indicating their strong presence in mid-range products. This visualization helps to confirm that frame material is an important factor influencing total revenue

Code
bike_sales |> 
  ggplot() +
  geom_boxplot(aes(x = price.ext,
                   y = frame),
               fill = "steelblue") +
  scale_x_continuous(labels = label_currency(),
                     transform = "log10") +
  labs(x = "Revenue (US Dollars)",
       y = "Frame material",
       subtitle = "Distribution of revenue by frame material")

1.6 Questions answered with data

  • What are the bike shops, bikeshop.name, that generate more revenue, price.ext, to the corporation?

The analysis shows that a small group of bike shops contributes the majority of total revenue. These shops likely have stronger customer bases, better geographic locations, or more effective sales strategies. Understanding which shops perform best can guide marketing decisions such as regional promotions or incentive programs for top-performing retailers.

Code
bike_sales_revenue_by_cat2 <- bike_sales |> 
  group_by(bikeshop.name) |> 
  summarise(revenue = sum(price.ext)) |> 
  mutate(bikeshop.name = fct_reorder(.f = bikeshop.name,
                                     .x = revenue))

bike_sales_revenue_by_cat2 |> 
  ggplot() +
  geom_col(aes(x = revenue,
               y = bikeshop.name),
           color = "black",
           fill = "steelblue") +
  scale_x_continuous(label = label_currency()) +
  labs(x = "Revenue (US Dollars)",
       y = "Bike shops",
       subtitle = "Revenue by bike shops")

  • What are the specific bicycle categories, category.secondary, that generate more revenue, price.ext, to the corporation?

The results show that mountain and road categories generate the highest total revenues, confirming their importance in the company’s product portfolio. These categories likely attract the most active customers who value performance and brand reputation. In contrast, touring and comfort bikes have lower revenues, representing smaller but potentially stable market segments. This insight is valuable for deciding where to focus marketing investments and product development.

Code
bike_sales_revenue_by_frame <- bike_sales |> 
  group_by(category.secondary) |> 
  summarise(revenue = sum(price.ext)) |> 
  mutate(category.secondary = fct_reorder(.f = category.secondary,
                                          .x = revenue))

bike_sales_revenue_by_frame |> 
  ggplot() +
  geom_col(aes(x = revenue,
               y = category.secondary),
           fill = "steelblue",
           color = "black") +
  scale_x_continuous(labels = label_currency()) +
  labs(x = "Revenue (US Dollars)",
       y = "Category secondary",
       subtitle = "Revenue by category secondary")