This week’s data dive presents a comprehensive data analysis focused on understanding factors influencing bike purchases. I will explore various relationships within the Bike sale dataset, emphasizing the importance of data documentation, detailed analysis and referencing the documentation for the data that I am using. Through this analysis, I aim to uncover insights that can guide strategic decisions and highlight the value of rigorous data examination, which is a bit similar to last week’s data dive.
As usual, I will load the dataset and perform necessary pre-analysis steps to prepare the data for the data dive. The bike sale dataset is loaded below for my analysis, I had already loaded the various libraries I will be using for this analysis in the setup code chunk above.
# Dataset loading
bike_data <- read.csv("bike_data.csv")
This step involves preparing the dataset for analysis by converting
data types and creating new variables. The Income column is
transformed from a string to a numeric type after removing the dollar
sign and and commas. The Commute Distance is converted into
an ordered factor to reflect the ordinal nature of commute lengths.
Purchased Bike is re-factored from a categorical to a
numeric binary variable to facilitate the correlation analysis.
Lastly, Income Level is derived from income quartiles,
creating an ordered factor that categorizes income into four levels:
Low, Medium, High, and
Very High. These transformations are crucial for enabling
our statistical analysis and ensuring data types align with the analytic
methods used.
# Converting Income to numeric by removing the dollar sign and commas
bike_data$Income <- as.numeric(gsub("\\$", "", gsub(",", "", bike_data$Income)))
# Categorize 'Commute Distance' into ordered factor
bike_data$`Commute.Distance` <- factor(bike_data$`Commute.Distance`,
levels = c("0-1 Miles", "2-5 Miles", "5-10 Miles", "10+ Miles"),
ordered = TRUE)
# Convert 'Purchased Bike' to numeric (1 for Yes, 0 for No)
bike_data$`Purchased.Bike_Numeric` <- ifelse(bike_data$`Purchased.Bike` == "Yes", 1, 0)
# Create an Income Level variable based on quartiles
bike_data$Income_Level <- cut(bike_data$Income,
breaks = quantile(bike_data$Income, probs = c(0, 0.25, 0.5, 0.75, 1)),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
I will now proceed to analyze the three pairs of variables, incorporating both created and existing variables, to explore their relationships and implications on bike purchasing behavior from our Bike sales dataset.
# Plotting Age vs. Income
ggplot(bike_data, aes(x = Age, y = Income)) +
geom_point() +
theme_minimal() +
labs(title = "Age vs. Income Level", x = "Age", y = "Income")
From the visualization that we plotted above which depicts that of
Age vs. Income Level, we can see that it appears not to be
any clear linear relationship between Age and Income in the given
dataset. Given this, we can then make some observations and insights as
follows:
In terms of statistical analysis, we might need to delve deeper, possibly segmenting the data by other factors or conducting regression analysis while controlling for other variables to understand the complexities of the relationship between age and income.
OUTLIERS: The observations on
Income vs Age Group suggests a general increase in income
with age. Outliers are present, we can start seeing them close to the 80
age mark (before and after), indicating individuals with significantly
higher or lower incomes within age groups.
# Plotting Number of Children vs Bike Purchase
ggplot(bike_data, aes(x = Children, y = `Purchased.Bike_Numeric`)) +
geom_jitter(alpha=0.5) +
theme_minimal() +
labs(title = "Number of Children vs Bike Purchase", x = "Number of Children", y = "Bike Purchase")
The plot Number of Children vs Bike Purchase displays
the relationship between the number of children each family has and
their decision to purchase a bike, with 1(YES) indicating a purchase and
0 (NO) indicating no purchase.
Moving on, we can gather the following from the
Household Size and Bike Purchase Relationship Purchase
plot:
No Strong Pattern: There is no strong visual pattern suggesting a clear relationship between the number of children and bike purchase decisions. Individuals with any number of children appear just as likely to purchase a bike as not.
Data Distribution: The spread of data points
across the Number of Children axis is fairly consistent,
indicating that bike purchasing decisions are similarly distributed
regardless of family size.
Variability in Purchases: There is considerable overlap in bike purchase decisions for individuals with 0 to around 4 children, suggesting that the number of children is not a determining factor in the decision to purchase a bike.
However, we cannot come to a tangible conclusion without delving deep into our data, as this plot does not provide clear evidence of a direct or simple relationship between the number of children and the likelihood of purchasing a bike.
OUTLIERS: Well, in this plot, there DO NOT appear to be any clear outlier in terms of the number of children; the data points are spread across a range but cluster around certain counts of children. So, there is no clear indicator that the number of children in a household significantly influences the likelihood of purchasing a bike; the distribution of purchases versus non-purchases seems relatively similar across different numbers of children.
# Removing rows with missing values in 'Commute Distance' or 'Purchased Bike'
bike_data_clean <- bike_data %>%
filter(!is.na(`Commute.Distance`) & !is.na(`Purchased.Bike_Numeric`))
# Plotting with the cleaned data (This will be temporary)
ggplot(bike_data_clean, aes(x = `Commute.Distance`, y = `Purchased.Bike_Numeric`)) +
geom_jitter(width = 0.2, height = 0.1) +
theme_minimal() +
labs(title = "Commute Distance vs Bike Purchase Decision", x = "Commute Distance", y = "Bike Purchase")
The plot above displays the distribution of individuals’ decisions to purchase a bike against their commute distances. Each point represents an individual’s decision at various commute distances (0-1 miles, 2-5 miles, and 5-10 miles).
We can clearly deduce the following insights from the plot:
Bike Purchase Distribution: There appears to be a roughly equal distribution of bike purchase decisions across all commute distances. This indicates that the decision to purchase a bike might not be strongly dependent on how far an individual commutes.
No Clear Trend: There is no clear trend indicating that a particular commute distance category has a higher or lower likelihood of bike purchases. This lack of trend suggests that factors other than commute distance may have a more significant influence on the decision to purchase a bike.
Data Spread: The jitters in the plot helps to avoid over-plotting and shows the spread of data points within each commute distance category, indicating variability in bike purchase decisions among individuals with similar commutes.
Commute Distance Ranges: The
Commute Distance variable has distinct groups, possibly
discrete distances at which the data was recorded.
OUTLIERS: Before discussing outliers, I managed to realize that the plot appears to show that as commute distance increases, the number of bike purchases decreases. To the more specific on the insights, let me discuss the following:
Now, to the outliers, they DO NOT appear in this plot.
# Age vs Income Level
suppressWarnings({
age_income_corr <- cor.test(~ Age + as.numeric(Income_Level), data = bike_data, method = "spearman")
print(age_income_corr)
})
##
## Spearman's rank correlation rho
##
## data: Age and as.numeric(Income_Level)
## S = 135833042, p-value = 3.772e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1850009
# Household Size and Bike Purchase Relationship
suppressWarnings({
children_bike_corr <- cor.test(bike_data$Children, bike_data$`Purchased.Bike_Numeric`, method = "spearman")
print(children_bike_corr)
})
##
## Spearman's rank correlation rho
##
## data: bike_data$Children and bike_data$Purchased.Bike_Numeric
## S = 184149783, p-value = 0.000893
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1048998
# Make 'Commute Length' an ordered factor
bike_data$`Commute.Distance` <- factor(bike_data$`Commute.Distance`, ordered = TRUE,
levels = c("0-1 Miles", "2-5 Miles", "5-10 Miles", "10+ Miles"))
# Convert 'Commute Length' to a numeric vector
bike_data$Commute.Distance_Numeric <- as.numeric(bike_data$`Commute.Distance`)
suppressWarnings({
commute_bike_corr <- cor.test(bike_data$Commute.Distance_Numeric, bike_data$`Purchased.Bike_Numeric`, method = "spearman")
print(commute_bike_corr)
})
##
## Spearman's rank correlation rho
##
## data: bike_data$Commute.Distance_Numeric and bike_data$Purchased.Bike_Numeric
## S = 68631608, p-value = 0.005547
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1032623
Age vs Income Level: The positive Spearman’s rank correlation coefficient of approximately 0.185 indicates a weak positive relationship between age and income level. This suggests that as age increases, there’s a slight tendency for income level to increase as well. The extremely low p-value suggests this observed relationship is highly significant, indicating a very low probability that this correlation occurred by chance.
It is quite reasonable to expect that individuals’ income levels might increase with age due to factors like career progression, accumulation of experience, and advancements in positions that typically come with time in the workforce. However, the correlation is weak, suggesting that while there’s a general trend of increasing income with age, the relationship is not strong. This implies other factors also play significant roles in determining an individual’s income level beyond just age.
Household Size and Bike Purchase Relationship: This negative rank correlation coefficient suggests a weak inverse relationship between the household size and the decision to purchase a bike. Specifically, as the household increases, there is a slightly lower likelihood of purchasing a bike. The p-value indicates this finding is statistically significant, suggesting the observed relationship is unlikely due to random chance.
Why then does it make sense? If at all; we can say that households with more children might face different transportation needs or financial priorities, possibly leading to a reduced likelihood of investing in bicycles. The visualization likely showed a scattered distribution of bike purchases across different numbers of children, with no strong pattern but a slight trend indicating fewer bike purchases in larger families.
Commute Distance and Bike Purchase Decision: A negative correlation suggests a weak inverse relationship between commute length (when converted to a numeric scale obviously) and the decision to purchase a bike. This suggests that individuals with longer commutes are slightly less likely to purchase a bike. The p-value signifies that this relationship is statistically significant, albeit weak.
So, as mentioned above, it shows that longer commutes might be less conducive to biking, either due to the inconvenience of biking long distances or because longer commutes often involve highway travel or areas less accessible by bike. The visualization shows a broad spread of bike purchases across commute lengths, with a subtle decrease in bike purchasing as commute length increases.
Let us assume that we are only interested in a 95% confidence interval for the proportion of individuals who purchased a bike. We can then get that outcome by running the below analysis:
p_hat <- mean(bike_data$`Purchased.Bike_Numeric`)
n <- length(bike_data$`Purchased.Bike_Numeric`)
z <- qnorm(0.975) # Z-score for 95% confidence
# Calculate standard error
se <- sqrt(p_hat * (1 - p_hat) / n)
# Confidence interval
ci_lower <- p_hat - z * se
ci_upper <- p_hat + z * se
# Display the confidence interval
ci_lower
## [1] 0.4500326
ci_upper
## [1] 0.5119674
Based on the calculated confidence interval, we can conclude the following about the population:
Finally in this week’s data dive, the confidence interval provides valuable insight into the behavior of the population from which the sample was drawn, offering a range within which the true proportion of interest likely falls. This information is crucial for making informed decisions, understanding the population’s tendencies, and guiding further research or policy decisions related to bike purchasing behaviors.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.