This week’s data dive is focused on exploring the bike sales dataset on hypothesis testing, after having explored this particular dataset over the past few weeks; sometimes I come to a halt, confused and lost in my analysis process. I came across various questions that I could not answer instantly, however, I will try to address some of these questions in this week’s data dive.
So, this week’s data dive will analyze the dataset around the following topics as discussed this week:
As required, I devised at least two different null hypotheses based on two different aspects (e.g., columns) of my data, and for each hypothesis, I also came up with an alpha level, power level, and minimum effect size, and I explained why I chose each value.
I proceeded to determine if I have enough data to perform a Neyman-Pearson hypothesis test, I calculated my sample size calculation, performed the test, and interpreted the results. I also performed a Fisher’s style test for significance on the same hypothesis, and interpreted the p-value. At the end, I had two hypothesis tests for each hypothesis, equating two four total tests.
And finally, I built two visualizations that best illustrate my results, one for each null hypothesis, and also explained the insights and made my recommendations. As required for each tasks, I described all the insights that was gathered, its significance, and every other questions that I have which might need to be further investigated in subsequent data dive.
Before we can generate the hypotheses, it’s crucial to understand the dataset structure, variables, and potential relationships within the bike sales dataset. I started by loading the dataset and performing an initial exploration.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
bike_data <- read_csv("bike_data.csv", show_col_types = FALSE)
head(bike_data)
## # A tibble: 6 × 15
## ID `Marital Status` Gender Income Children Education Occupation
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 12496 Married Female 40000 1 Bachelors Skilled Manual
## 2 24107 Married Male 30000 3 Partial College Clerical
## 3 14177 Married Male 80000 5 Partial College Professional
## 4 24381 Single Male 70000 0 Bachelors Professional
## 5 25597 Single Male 30000 0 Bachelors Clerical
## 6 13507 Married Female 10000 2 Partial College Manual
## # ℹ 8 more variables: `Home Owner` <chr>, Cars <dbl>, `Commute Distance` <chr>,
## # Region <chr>, Age <dbl>, `Age Brackets` <chr>, `Purchased Bike` <chr>,
## # `Bike Sold` <dbl>
summary(bike_data)
## ID Marital Status Gender Income
## Min. :11000 Length:1000 Length:1000 Min. : 10000
## 1st Qu.:15291 Class :character Class :character 1st Qu.: 30000
## Median :19744 Mode :character Mode :character Median : 60000
## Mean :19966 Mean : 56360
## 3rd Qu.:24471 3rd Qu.: 70000
## Max. :29447 Max. :170000
## Children Education Occupation Home Owner
## Min. :0.000 Length:1000 Length:1000 Length:1000
## 1st Qu.:0.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.898
## 3rd Qu.:3.000
## Max. :5.000
## Cars Commute Distance Region Age
## Min. :0.000 Length:1000 Length:1000 Min. :25.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:35.00
## Median :1.000 Mode :character Mode :character Median :43.00
## Mean :1.442 Mean :44.16
## 3rd Qu.:2.000 3rd Qu.:52.00
## Max. :4.000 Max. :89.00
## Age Brackets Purchased Bike Bike Sold
## Length:1000 Length:1000 Min. :0.000
## Class :character Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :0.000
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
sum(is.na(bike_data))
## [1] 0
# Exploring my distribution of key variables
ggplot(bike_data, aes(x = Age)) +
geom_bar()
ggplot(bike_data, aes(x = Income)) +
geom_bar()
hist(bike_data$Age)
hist(bike_data$Income)
# Correlation between variables if at all applicable
cor(bike_data$Age, bike_data$Income)
## [1] 0.1700767
Both age distribution histograms show a similar pattern where the majority of the data points are concentrated around the middle age range, with fewer counts as age increases or decreases, indicating a roughly bell-shaped distribution. The first histogram appears to have multiple peaks which suggest the presence of different age groups or cohorts with higher frequencies.
The distribution is approximately normal, but with some irregularities that might suggest multiple modes.
There is a sharp drop in counts past the age of around 60, which may indicate less bike usage or fewer members in that age range.
The income distribution histograms display a right-skewed distribution, indicating that a majority of the individuals have an income in the lower to middle range, with very few high-income earners.
The distribution of income is not normal as it is skewed to the right.
There are significantly fewer individuals with high incomes.
Based on the exploration of the dataset and what we can see from the visualizations above, here are two null hypotheses we can test in this data dive to fulfill the requirementrs of this assignment:
Null Hypothesis (H0): There is no significant difference in bike sale frequency across different age groups. For this hypothesis, I would need to have a dataset where bike sale frequency is recorded alongside age.
Alpha level: 0.05, this is a standard operation that I will use to determine the significance.
Power level: 0.80, I will use this to ensure we have a good balance between Type I and Type II error rates.
Minimum effect size: This would be based on what difference in bike sale would be considered meaningful. For the purpose of this hypothesis, I will use an assumption that a difference of 50 bikes is significant for our analysis.
# Assuming my dataset has a variable 'Bike Sold' for the number of bikes sold.
age_groups <- cut(bike_data$Age, breaks = c(0, 30, 60, 90), include.lowest = TRUE, right = FALSE)
age_sales_anova <- aov(`Bike Sold` ~ age_groups, data = bike_data)
summary(age_sales_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_groups 2 3.88 1.9411 7.875 0.000404 ***
## Residuals 997 245.76 0.2465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Key Insights
Statistical Significance: The p-value of 0.000404 is less than the common significance level of 0.05. This means the result is statistically significant. I reject the null hypothesis that there’s no difference in bike sales across different age groups.
Effect of Age: The ANOVA suggests that a person’s age group has a significant effect on the likelihood of purchasing a bike.
Unexplained Variation: While age is important, a large portion of the variation in bike sales is still unexplained, as seen in the Residuals row. Other factors might also influence bike purchasing decisions.
In conclusion for this first hypothesis, we can see that ANOVA tells us there is a difference, but it does not tell us the direction of that difference in this hypothesis test.
Null Hypothesis (H0): Bike sales are independent of the customer’s income level.
Alpha level: 0.05, to maintain consistency.
Power level: 0.80, to ensure that the hypothesis test is robustness.
Minimum effect size: Assume a difference of 50 bikes sold is also significant for income level analysis.
# Assuming my dataset has a variable 'Bike_Sold' for the number of bikes sold.
income_groups <- cut(bike_data$Income, breaks = c(0, 50000, 100000, 150000, Inf), include.lowest = TRUE, right = FALSE)
income_sales_anova <- aov(`Bike Sold` ~ income_groups, data = bike_data)
summary(income_sales_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## income_groups 3 0.7 0.2325 0.93 0.425
## Residuals 996 248.9 0.2499
Key Insights
No Statistical Significance: The p-value of 0.425 is greater than the typical significance level of 0.05. This means that I fail to reject the null hypothesis. There is not enough evidence to conclude that there is a statistically significant difference in bike sales across different income groups.
Unexplained Variation: Most of the variation in bike sales remains unexplained by income, as seen in the large Sum of Squares for Residuals. This suggests that other factors are likely more important in determining bike purchases.
Now let me elaborate on the limitations of ANOVA for this particular scenario. As mentioned earlier, it only tells us if there is a difference, not the nature of that difference. So, that moves us to the next question, I would be doing more further analysis to explore if there are any trends at all within different income groups, before I end it with some visualizations.
Furthermore, there are obviously other potential factors that are likely other factors playing a larger role in influencing bike purchases, e.g weather, lifestyle, location (urban vs. rural), interest in cycling, etc.
To determine if I have enough data, I will be using a power analysis
tool like pwr. Now, let us use the
following code snippet which will show how to calculate the necessary
sample size for each group to achieve the desired power level.
library(pwr)
# Hypothesis 1: Comparing two groups' means
effect_size <- 0.5
pwr_result <- pwr.t.test(d = effect_size, power = 0.80, sig.level = 0.05, type = "two.sample")
pwr_result
##
## Two-sample t test power calculation
##
## n = 63.76561
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Typically, it is known that if the calculated required sample size is less than or equal to the size of the data that I have, then I have enough data. If it is more, then I do not have enough data.
So, to have a good chance (80% power) of detecting a medium-sized difference (effect size of 0.5) between the means of two groups, with a significance level of 0.05, I will need a sample size of approximately 64 participants per group.
Using this approach is just another slightly different philosophical approach to hypothesis testing, we use it when we do not pre-specify an alpha level but instead interpret the p-value directly. The step and code to perform the test is the same as the Neyman-Pearson test.
However, the interpretation focuses more on the p-value obtained and what it suggests about the evidence against the null hypothesis without a strict cutoff for significance.
The p-value is the probability of observing data at least as extreme as what was observed if the null hypothesis were true. A low p-value (typically below 0.05) indicates that such an extreme observed result would be very unlikely under the null hypothesis, thus providing evidence against the null hypothesis.
library(ggplot2)
ggplot(bike_data, aes(x = age_groups, y = `Bike Sold`)) +
geom_bar(stat = "identity") +
labs(title = "Number of Bikes Sold by Age Group", x = "Age Group", y = "Bikes Sold")
This chart illustrates the number of bikes sold in three different age groups: [0, 30], [30, 60], and [60, 90].
Observations:
The [30, 60] age group shows the highest number of bikes sold, which is substantially more than the other age groups.
The youngest age group [0, 30] has significantly fewer sales compared to the [30, 60] age group but more than the [60, 90] group.
The oldest age group [60, 90] has the fewest sales, which are markedly lower than the other groups.
This could indicate that the middle-aged group is the primary market for bike sales, possibly due to a combination of disposable income and lifestyle choices that include fitness or environmentally friendly transportation options. The lower sales in the youngest age group might reflect financial constraints or alternative preferences for transportation or recreation. The decrease in the oldest age group could be related to reduced mobility or a different set of recreational preferences.
library(ggplot2)
# Assuming your data is in a dataframe called 'bike_data' and it has columns 'Income' and 'bikes_sold'
# Create a new factor variable in the dataframe with readable income level labels
bike_data$income_groups <- cut(bike_data$Income,
breaks = c(0, 5e+04, 1e+05, 1.5e+05, Inf),
labels = c("0-50K", "50K-100K", "100K-150K", "150K+"),
include.lowest = TRUE, right = FALSE)
ggplot(bike_data, aes(x = income_groups, y = `Bike Sold`)) +
geom_col(fill = "steelblue") +
labs(title = "Number of Bikes Sold by Income Level",
x = "Income Level",
y = "Bikes Sold") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart provided shows the number of bikes sold categorized by different income levels: 0-50K, 50K-100K, 100K-150K, and 150K+.
Observations:
The number of bikes sold to individuals in the 50K-100K income bracket is the highest among the categories displayed.
The 0-50K and 100K-150K income brackets show a lower number of bikes sold compared to the 50K-100K bracket, but the sales are still substantial. Individuals in the highest income bracket (150K+) show the lowest number of bikes sold compared to other groups.
Insights:
The data suggests that middle-income earners (specifically those in the 50K-100K bracket) are the primary purchasers of bikes. This may reflect a combination of having disposable income and a lifestyle that includes biking, whether for commuting or recreation.
Despite presumably having more disposable income, the highest earners (150K+) are buying fewer bikes, which could indicate that this demographic may have access to or preference for other modes of transportation or leisure activities.
The lowest income group (0-50K) also represents a significant market segment, suggesting that bikes may be an essential mode of transportation for this group.
Further Analysis:
It would be useful to understand the types of bikes purchased across these income levels. For instance, are higher-income individuals buying more expensive bikes less frequently?
I will need to investigate if there are any regional or demographic factors that may influence these trends.
Also, I need to consider the impact of marketing efforts and sales channels on these figures, as there might be different preferences for how and where bikes are purchased across income levels.
The bar chart effectively communicates where the bulk of bike sales are occurring by income level, which can inform targeted marketing and sales strategies. It’s also important to consider the broader economic context, such as the cost of living or average income in the regions where these sales are taking place, to draw more nuanced conclusions.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: