First, I loaded the CSV data into a tibble called
bikes.
bikes <- read_csv("bike_data.csv")
## Rows: 1000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Marital Status, Gender, Income, Education, Occupation, Home Owner,...
## dbl (4): ID, Children, Cars, Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Then, I used the skim() function to summarize the
numeric and categorical variables:
skim(bikes)
| Name | bikes |
| Number of rows | 1000 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Marital Status | 0 | 1 | 6 | 7 | 0 | 2 | 0 |
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Income | 0 | 1 | 7 | 8 | 0 | 16 | 0 |
| Education | 0 | 1 | 9 | 19 | 0 | 5 | 0 |
| Occupation | 0 | 1 | 6 | 14 | 0 | 5 | 0 |
| Home Owner | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Commute Distance | 0 | 1 | 9 | 13 | 0 | 5 | 0 |
| Region | 0 | 1 | 6 | 13 | 0 | 3 | 0 |
| Age Brackets | 0 | 1 | 3 | 10 | 0 | 3 | 0 |
| Purchased Bike | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 19965.99 | 5347.33 | 11000 | 15290.75 | 19744 | 24470.75 | 29447 | ▇▇▇▇▇ |
| Children | 0 | 1 | 1.90 | 1.63 | 0 | 0.00 | 2 | 3.00 | 5 | ▇▃▂▂▂ |
| Cars | 0 | 1 | 1.44 | 1.13 | 0 | 1.00 | 1 | 2.00 | 4 | ▆▆▇▂▂ |
| Age | 0 | 1 | 44.16 | 11.36 | 25 | 35.00 | 43 | 52.00 | 89 | ▆▇▅▁▁ |
bikes %>%
mutate(Children = as.numeric(Children)) %>%
reframe(
gender_n = n_distinct(Gender),
gender_counts = table(Gender),
children_n = n_distinct(Children),
children_total = sum(Children)
)
## # A tibble: 2 × 4
## gender_n gender_counts children_n children_total
## <int> <table[1d]> <int> <dbl>
## 1 2 489 6 1898
## 2 2 511 6 1898
This shows the counts for the two categorical variables.
Data Documentation Insights
The data tracks customer demographics (age, income, number of children, education level etc) along with bike purchase history. This suggests the goal is to understand the target customer base.
It includes detailed segmentation data - region, commute distance ranges, and occupation categories. This indicates a desire to precisely understand behavioral differences in subgroups.
Unique IDs are provided for each customer, which shows the data could be appended or merged with additional customer data from other systems, like service history or web analytics.
No direct marketing costs or channel details are included. To calculate ROI, connecting purchase events back to marketing touchpoints would be valuable.
Project Goals
The data documentation shows this data was collected to analyze bike purchasing decisions. We could look at how family status impacts likelihood to purchase.
As noted above, a core goal seems to be understanding the target customer base and aligning product offerings appropriately. Find actionable customer sub-segments.
Key differentiator variables like occupation and commute distance suggest a desire to tailor messaging and products to consumer needs. Support customization.
Inclusion of extensive demographics indicates potential interest in propensity modeling - predicting purchase probability per segment. Prioritize high-probability targets.
Based on the initial overview, here are some questions to investigate further:
What factors predict whether someone purchases a bike? Income seems correlated but are there other predictors like occupation?
Does bike purchasing behavior differ by region? The data has customers in Europe, North America, and the Pacific.
Is age correlated with income level? Since age is approximately normally distributed, then we can investigate this.
I used aggregation to find the average income by occupation:
bikes %>%
group_by(Occupation) %>%
summarize(avg_income = mean(as.numeric(gsub("[$|,]", "", Income))))
## # A tibble: 5 × 2
## Occupation avg_income
## <chr> <dbl>
## 1 Clerical 31073.
## 2 Management 86647.
## 3 Manual 16723.
## 4 Professional 75072.
## 5 Skilled Manual 51608.
The management occupation has the highest average income by a significant margin. Professionals and skilled manuals also have higher incomes on average.
First, I plotted income distribution by bike purchase:
bikes %>%
mutate(Income = as.numeric(gsub("[$|,]", "", Income))) %>%
ggplot(aes(Income, fill = `Purchased Bike`)) +
geom_density(alpha = 0.5)
We can see those who purchased bikes have higher incomes on average. Next, I plotted age brackets versus average income:
bikes %>%
mutate(Income = as.numeric(gsub("[$|,]", "", Income))) %>%
group_by(`Age Brackets`) %>%
summarize(avg_income = mean(Income)) %>%
ggplot(aes(`Age Brackets`, avg_income, color = `Age Brackets`)) +
geom_col()
This shows income tends to increase up through middle age then declines in retirement ages.
There are still many additional relationships to analyze in this data - occupation types, regions, commute distance and other factors that may influence bike purchasing. But this initial analysis has provided me with some useful insights.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: