Loading and Summarizing the Data:

First, I loaded the CSV data into a tibble called bikes.

bikes <- read_csv("bike_data.csv")
## Rows: 1000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Marital Status, Gender, Income, Education, Occupation, Home Owner,...
## dbl  (4): ID, Children, Cars, Age
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Then, I used the skim() function to summarize the numeric and categorical variables:

skim(bikes)
Data summary
Name bikes
Number of rows 1000
Number of columns 14
_______________________
Column type frequency:
character 10
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Marital Status 0 1 6 7 0 2 0
Gender 0 1 4 6 0 2 0
Income 0 1 7 8 0 16 0
Education 0 1 9 19 0 5 0
Occupation 0 1 6 14 0 5 0
Home Owner 0 1 2 3 0 2 0
Commute Distance 0 1 9 13 0 5 0
Region 0 1 6 13 0 3 0
Age Brackets 0 1 3 10 0 3 0
Purchased Bike 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 19965.99 5347.33 11000 15290.75 19744 24470.75 29447 ▇▇▇▇▇
Children 0 1 1.90 1.63 0 0.00 2 3.00 5 ▇▃▂▂▂
Cars 0 1 1.44 1.13 0 1.00 1 2.00 4 ▆▆▇▂▂
Age 0 1 44.16 11.36 25 35.00 43 52.00 89 ▆▇▅▁▁

The key insights from the skim:

A numeric summary - Gender and Number of Children:

bikes %>%
  mutate(Children = as.numeric(Children)) %>%
  reframe(
    gender_n = n_distinct(Gender),
    gender_counts = table(Gender),
    children_n = n_distinct(Children),
    children_total = sum(Children)
  )
## # A tibble: 2 × 4
##   gender_n gender_counts children_n children_total
##      <int> <table[1d]>        <int>          <dbl>
## 1        2 489                    6           1898
## 2        2 511                    6           1898

This shows the counts for the two categorical variables.

Data Documentation Insights

Project Goals

Questions for Further Analysis:

Based on the initial overview, here are some questions to investigate further:

Using Aggregation to Summarize Data:

I used aggregation to find the average income by occupation:

bikes %>%
  group_by(Occupation) %>%
  summarize(avg_income = mean(as.numeric(gsub("[$|,]", "", Income))))
## # A tibble: 5 × 2
##   Occupation     avg_income
##   <chr>               <dbl>
## 1 Clerical           31073.
## 2 Management         86647.
## 3 Manual             16723.
## 4 Professional       75072.
## 5 Skilled Manual     51608.

The management occupation has the highest average income by a significant margin. Professionals and skilled manuals also have higher incomes on average.

Visualizations:

First, I plotted income distribution by bike purchase:

bikes %>% 
  mutate(Income = as.numeric(gsub("[$|,]", "", Income))) %>%
  ggplot(aes(Income, fill = `Purchased Bike`)) +
  geom_density(alpha = 0.5)

We can see those who purchased bikes have higher incomes on average. Next, I plotted age brackets versus average income:

bikes %>%
 mutate(Income = as.numeric(gsub("[$|,]", "", Income))) %>%
 group_by(`Age Brackets`) %>%
 summarize(avg_income = mean(Income)) %>%
 ggplot(aes(`Age Brackets`, avg_income, color = `Age Brackets`)) +
 geom_col()

This shows income tends to increase up through middle age then declines in retirement ages.

There are still many additional relationships to analyze in this data - occupation types, regions, commute distance and other factors that may influence bike purchasing. But this initial analysis has provided me with some useful insights.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: