Code
library(tidyverse)
library(sweep)
library(skimr)
library(DT)
library(scales)title: “Marketing report part 1” date: 2025-10-18 author: - Marla Ivonne Lopez Garcia Cod 0106680 - Chomsky Gómez Fodor Von Cod 0106611
format: html: toc: true code-fold: true number-sections: true embed-resources: true editor: visual —
library(tidyverse)
library(sweep)
library(skimr)
library(DT)
library(scales)This paper focuses on the initial data analysis and preparation stage for the bicycle sales business case. The primary objective is to prepare and understand the structure of the bike_sales dataset before performing deeper statistical analysis. This ensures that the data is of high quality for future Marketing Research studies, such as customer segmentation or product performance analysis.
The frame sales data was imported into the RStudio environment. The read_csv() function was used to load the bike_sales file into an R object with the same name. This guarantees that all data is available in the tibble format for efficient manipulation using the tools from the tidyverse package.
# Import the data from the sweep package
bike_sales <- bike_sales |>
mutate(order.id = factor(x = order.id,
ordered = TRUE),
order.line = factor(x = order.line,
ordered = TRUE),
customer.id = factor(x = customer.id,
ordered = FALSE),
bikeshop.name = factor(x = bikeshop.name,
ordered = FALSE),
bikeshop.city = factor(x = bikeshop.city,
ordered = FALSE),
bikeshop.state = factor(x = bikeshop.name,
ordered = FALSE),
product.id = factor(x = product.id,
ordered = FALSE),
model = factor(x = model,
ordered = FALSE),
category.primary = factor(category.primary,
ordered = FALSE),
category.secondary = factor(x = category.secondary,
ordered = FALSE),
frame = factor(x = frame,
ordered = FALSE))The initial inspection using glimpse() and skim() revealed the data structure:
Variable Types: It was confirmed that the identification columns (order.id, customer.id, and product.id) were likely numeric but needed to be treated as categorical. The key variables (such as prices or quantities of frames) have correct data types.
Missing Values: No significant missing values (NA) were detected in the main variables, ensuring the integrity of the dataset.
Distribution: The numeric variables such as quantity, price, and revenue (price.ext) show consistent ranges without extreme or negative values, confirming that the dataset is clean and ready for further analysis.
bike_sales |> glimpse()
bike_sales |>
glimpse()Rows: 15,644
Columns: 17
$ order.date <date> 2011-01-07, 2011-01-07, 2011-01-10, 2011-01-10, 20…
$ order.id <ord> 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, …
$ order.line <ord> 1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 1, 2, 3, 4, 1, 2, 3, …
$ quantity <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 1, 1, 1, …
$ price <dbl> 6070, 5970, 2770, 5970, 10660, 3200, 12790, 5330, 1…
$ price.ext <dbl> 6070, 5970, 2770, 5970, 10660, 3200, 12790, 5330, 1…
$ customer.id <fct> 2, 2, 10, 10, 6, 6, 6, 6, 6, 22, 8, 8, 8, 8, 16, 16…
$ bikeshop.name <fct> Ithaca Mountain Climbers, Ithaca Mountain Climbers,…
$ bikeshop.city <fct> Ithaca, Ithaca, Kansas City, Kansas City, Louisvill…
$ bikeshop.state <fct> Ithaca Mountain Climbers, Ithaca Mountain Climbers,…
$ latitude <dbl> 42.44396, 42.44396, 39.11405, 39.11405, 38.25267, 3…
$ longitude <dbl> -76.50188, -76.50188, -94.62746, -94.62746, -85.758…
$ product.id <fct> 48, 52, 76, 52, 2, 50, 1, 4, 34, 26, 96, 66, 35, 72…
$ model <fct> Jekyll Carbon 2, Trigger Carbon 2, Beast of the Eas…
$ category.primary <fct> Mountain, Mountain, Mountain, Mountain, Road, Mount…
$ category.secondary <fct> Over Mountain, Over Mountain, Trail, Over Mountain,…
$ frame <fct> Carbon, Carbon, Aluminum, Carbon, Carbon, Carbon, C…
The final summary using skim() on the transformed data confirmed its quality:
Size: The dataset contains several hundred transactions, which is sufficient to identify patterns and trends.
Unique Entities: The data includes multiple unique customers and product models, allowing segmentation and category-level analysis.
Descriptive Conclusion: The main variables, such as price and revenue, show stable central tendencies. This provides a reliable foundation for exploring relationships between sales categories and customer preferences.
The summary statistics generated by the skim() function show that the main numeric variables (quantity, price, and price.ext) have realistic means and medians, indicating that the dataset represents actual sales behavior. The quantity variable displays a right-skewed distribution, meaning that most transactions involve a small number of units sold, while a few have higher quantities. The variable price shows moderate variability across models, reflecting different product segments (from entry-level to high-end bicycles). Overall, these results confirm that the dataset is consistent and valid for marketing analysis.
bike_sales |> skim()
bike_sales |>
skim()| Name | bike_sales |
| Number of rows | 15644 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 11 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order.date | 0 | 1 | 2011-01-07 | 2015-12-25 | 2013-08-25 | 962 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| order.id | 0 | 1 | TRUE | 2000 | 127: 30, 191: 30, 324: 30, 326: 30 |
| order.line | 0 | 1 | TRUE | 30 | 1: 2000, 2: 1534, 3: 1255, 4: 1102 |
| customer.id | 0 | 1 | FALSE | 30 | 10: 2731, 8: 1801, 16: 1086, 2: 975 |
| bikeshop.name | 0 | 1 | FALSE | 30 | Kan: 2731, Den: 1801, Pho: 1086, Ith: 975 |
| bikeshop.city | 0 | 1 | FALSE | 30 | Kan: 2731, Den: 1801, Pho: 1086, Ith: 975 |
| bikeshop.state | 0 | 1 | FALSE | 30 | Kan: 2731, Den: 1801, Pho: 1086, Ith: 975 |
| product.id | 0 | 1 | FALSE | 97 | 40: 229, 7: 227, 65: 221, 72: 215 |
| model | 0 | 1 | FALSE | 97 | Sli: 229, Sup: 227, Sca: 221, F-S: 215 |
| category.primary | 0 | 1 | FALSE | 2 | Mou: 8051, Roa: 7593 |
| category.secondary | 0 | 1 | FALSE | 9 | Eli: 3472, Cro: 2852, End: 2586, Tra: 2020 |
| frame | 0 | 1 | FALSE | 2 | Car: 8304, Alu: 7340 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| quantity | 0 | 1 | 8.57 | 52.36 | 1.00 | 1.00 | 1.00 | 1.00 | 1.000e+03 | ▇▁▁▁▁ |
| price | 0 | 1 | 3521.11 | 2667.19 | 415.00 | 1840.00 | 2700.00 | 4260.00 | 1.279e+04 | ▇▅▁▁▁ |
| price.ext | 0 | 1 | 30169.88 | 229046.25 | 415.00 | 1950.00 | 3200.00 | 7000.00 | 1.066e+07 | ▇▁▁▁▁ |
| latitude | 0 | 1 | 37.94 | 4.59 | 25.76 | 35.47 | 39.11 | 40.44 | 4.761e+01 | ▂▂▃▇▁ |
| longitude | 0 | 1 | -95.05 | 13.42 | -122.68 | -104.99 | -94.63 | -83.74 | -7.141e+01 | ▂▅▇▅▅ |
The category.secondary variable includes the main types of bicycles, such as mountain, road, and touring. The count table shows that certain categories, like mountain bikes, tend to dominate the dataset, which may indicate a higher consumer demand for this product type. The frame variable analysis shows that aluminum frames appear most frequently, followed by carbon and steel. This suggests that aluminum frames are the preferred material for most customers due to their balance between durability and cost.
The combination of category.secondary and frame reveals interesting patterns: high-end models like road bikes are more often associated with carbon frames, while entry-level categories use aluminum. These relationships help to understand product positioning within the market.
Finally, the analysis of bikeshop.name counts indicates which shops generate more transactions. A small number of stores account for a large portion of sales, suggesting that some distributors play a key role in total revenue generation.
category_secondary <- bike_sales |>
count(category.secondary) |>
select(category.secondary)
category_secondary |>
datatable(
colnames = c("Category secondary" = "category.secondary")
)frame <- bike_sales |>
count(frame) |>
select(frame)
frame |>
datatable(
colnames = c("Frame" = "frame")
)category_secondary_frame <- bike_sales |>
count(category.secondary,
frame) |>
select(category.secondary, frame)
category_secondary_frame |>
datatable(
colnames = c("Category secondary" = "category.secondary",
"Frame" = "frame")
)The cross-tabulation between category.secondary and frame confirms that product materials and styles are closely related. This information is useful for designing targeted marketing strategies focused on customer preferences and price sensitivity.
bike_shops <- bike_sales |>
count(bikeshop.name,
sort = TRUE)
bike_shops |>
datatable(
colnames = c("Bike shop" = "bikeshop.name",
"Transactions" = "n")
)Data visualization is essential for identifying trends and verifying assumptions about the dataset. The following histograms and boxplots display how variables such as quantity, price, and revenue behave across categories.
The histogram illustrates the distribution of the quantity variable:
The histogram of the quantity variable shows that most transactions involve one or two bicycles, indicating that sales are mainly individual rather than bulk purchases. This pattern is typical for retail-oriented businesses rather than wholesale operations. A small number of higher-quantity transactions may correspond to corporate or promotional purchases.
bike_sales |>
ggplot() +
geom_histogram(aes(x = quantity),
color = "black",
fill = "steelblue",
bins = 30) +
labs(x = "Units sold",
y = "Frequency",
subtitle = "Distribution of units sold")The histogram of prices demonstrates a right-skewed distribution, where most products are in a mid-price range, but a few high-end models have significantly higher prices. This shows a typical market structure with a broad customer base for affordable bicycles and a smaller niche for premium models. The variation in prices also suggests opportunities for price-based segmentation strategies.
bike_sales |>
ggplot() +
geom_histogram(aes(x = price),
color = "black",
fill = "steelblue",
bins = 30) +
labs(x = "Price (US Dollars)",
y = "Frequency",
subtitle = "Distribution of frame prices")The boxplots are useful for identifying differences in revenue distribution across categories and materials.
The first boxplot by category.secondary shows that revenue varies widely among product types. Categories like mountain and road bikes tend to generate higher revenues, indicating that they are the most profitable segments for the business. The presence of outliers suggests that some models or transactions significantly exceed the average performance, possibly due to premium pricing or large orders.
bike_sales |>
mutate(category.secondary = fct_reorder(.f = category.secondary,
.x = price.ext,
.fun = median)) |>
ggplot() +
geom_boxplot(aes(x = price.ext,
y = category.secondary),
fill = "steelblue") +
scale_x_continuous(labels = label_currency(),
transform = "log10") +
labs(x = "Revenue (US Dollars)",
y = "Category secondary",
subtitle = "Distribution of revenue by category secondary")The second boxplot by frame material shows that carbon frames usually generate higher revenues than aluminum or steel, which aligns with their higher price level. Aluminum frames show a moderate revenue level, indicating their strong presence in mid-range products. This visualization helps to confirm that frame material is an important factor influencing total revenue
bike_sales |>
ggplot() +
geom_boxplot(aes(x = price.ext,
y = frame),
fill = "steelblue") +
scale_x_continuous(labels = label_currency(),
transform = "log10") +
labs(x = "Revenue (US Dollars)",
y = "Frame material",
subtitle = "Distribution of revenue by frame material")bikeshop.name, that generate more revenue, price.ext, to the corporation?The analysis shows that a small group of bike shops contributes the majority of total revenue. These shops likely have stronger customer bases, better geographic locations, or more effective sales strategies. Understanding which shops perform best can guide marketing decisions such as regional promotions or incentive programs for top-performing retailers.
bike_sales_revenue_by_cat2 <- bike_sales |>
group_by(bikeshop.name) |>
summarise(revenue = sum(price.ext)) |>
mutate(bikeshop.name = fct_reorder(.f = bikeshop.name,
.x = revenue))
bike_sales_revenue_by_cat2 |>
ggplot() +
geom_col(aes(x = revenue,
y = bikeshop.name),
color = "black",
fill = "steelblue") +
scale_x_continuous(label = label_currency()) +
labs(x = "Revenue (US Dollars)",
y = "Bike shops",
subtitle = "Revenue by bike shops")category.secondary, that generate more revenue, price.ext, to the corporation?The results show that mountain and road categories generate the highest total revenues, confirming their importance in the company’s product portfolio. These categories likely attract the most active customers who value performance and brand reputation. In contrast, touring and comfort bikes have lower revenues, representing smaller but potentially stable market segments. This insight is valuable for deciding where to focus marketing investments and product development.
bike_sales_revenue_by_frame <- bike_sales |>
group_by(category.secondary) |>
summarise(revenue = sum(price.ext)) |>
mutate(category.secondary = fct_reorder(.f = category.secondary,
.x = revenue))
bike_sales_revenue_by_frame |>
ggplot() +
geom_col(aes(x = revenue,
y = category.secondary),
fill = "steelblue",
color = "black") +
scale_x_continuous(labels = label_currency()) +
labs(x = "Revenue (US Dollars)",
y = "Category secondary",
subtitle = "Revenue by category secondary")