Hypothesis Testing

In this data dive I will run two hypothesis tests. The first will utilize the Neyman-Pearson framework, and the second will utilize the Fisher’s Significance Testing framework. Visualizations will be constructed for each to illustrate the results.

Bike share systems consists of a network of bike docking stations spread throughout a region. For the system to function properly, each station must neither be empty of bicycles nor full; an empty station is a lost rider, and a full station prevents a rider from docking at their destination. Organizations manage their bike share system actively to ensure an optimal distribution of bikes across their network. This involves staffing personnel to use a vehicle to transport bikes around the network to meet demand. Based on demand for a particular day, staffing levels can fluctuate.

First, I will take the daily sum of rented bikes and inspect the distribution by day of the week. Only days where bike share is functioning during all hours will be included.

df <- SeoulBikeData |>
  group_by(date) |>
  summarize(all_functioning = all(functioning_day == "Yes"),
            rented_bikes = sum(rented_bikes[functioning_day == "Yes"])) |>
  filter(all_functioning) |>
  select(date, rented_bikes)
    
df$day_of_week <- weekdays(as.Date(df$date))

df |>
  ggplot(aes(x = day_of_week, y = rented_bikes)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Rented Bikes by Day of Week",
    x = "Day of Week",
    y = "Number of Rented Bikes"
  ) +
  theme_minimal()

Except Sundays, each day of the week has similar distributions of total bikes rented. Now let’s examine means by day of week.

#New column where value is 1 if Sunday
df <- df |>
  mutate(is_sunday = wday(date) == 1)

df_sun = df |>
  filter(day_of_week == "Sunday")
df_not_sun = df |>
  filter(day_of_week != "Sunday")

df |>
  group_by(day_of_week) |>
  summarize(mean = mean(rented_bikes))
## # A tibble: 7 × 2
##   day_of_week   mean
##   <chr>        <dbl>
## 1 Friday      18634 
## 2 Monday      17534.
## 3 Saturday    17389.
## 4 Sunday      15298.
## 5 Thursday    17240.
## 6 Tuesday     17887.
## 7 Wednesday   18479.

The average bikes rented on Sunday is atleast 12% lower than other days of the week. A bike share organization could reduce staff on Sunday if demand is truly lower, but is this just noise?

H0: The daily sum of rented bikes Sunday is the same as every other day of the week.
HA: The daily sum of rented bikes Sunday is less than every other day of the week.

To test this, I will use first evaluate the sample sizes then choose a test, alpha level, Type 2 Error, and minimum effect size.

Alpha Value

I chose an alpha value of 0.05. If a Type I error is made, where the null hypothesis is rejected but is actually true, then the organization would quickly tell due to changes in key performance indicators i.e. bike spread and availability. This means the risk is not very high, so a 0.05 alpha value is acceptable.

Beta Value

A beta value of 0.8/power of 0.2 seems acceptable. If a Type II error is made, the worst case is having more staff that day. This would bring higher costs in the long term, but there would also be benefits to the bike share network by staff having more bandwidth on Sunday.

Effect Size

A minimum effect size of 0.6 would be a large enough to justify lowering staffing on Sundays.

cohen.d(df$rented_bikes[df$day_of_week != "Sunday"],
        df$rented_bikes[df$day_of_week == "Sunday"])
## 
## Cohen's d
## 
## d estimate: 0.2582428 (small)
## 95 percent confidence interval:
##       lower       upper 
## -0.04019281  0.55667849

Cohen’s d is 0.26, which indicates only a small difference exists between the number of rented bikes on Sunday and other days. This shows a small practical significance if the null hypothesis is proven false.

Sample Size

Using the chosen parameters, I computed the minimum sample size for a two sample t-test at the given power level.

alpha <- 0.05
power <- 0.8
effect_size <- 0.6

# Calculate required sample size
pwr.t.test(d = effect_size, sig.level = alpha, power = power, type = "two.sample")$n
## [1] 44.58577

Each sample needs to have at least 44 observations to detect a statistically significant effect if such exists.

df |>
group_by(day_of_week)|>
  summarise(Row_Count=n())
## # A tibble: 7 × 2
##   day_of_week Row_Count
##   <chr>           <int>
## 1 Friday             51
## 2 Monday             52
## 3 Saturday           50
## 4 Sunday             51
## 5 Thursday           50
## 6 Tuesday            48
## 7 Wednesday          50

Test for Normality and Variance

Normality and variance were tested to choose the appropriate statistical test.

shapiro.test(df$rented_bikes[df$day_of_week == "Sunday"])
## 
##  Shapiro-Wilk normality test
## 
## data:  df$rented_bikes[df$day_of_week == "Sunday"]
## W = 0.91635, p-value = 0.001549
shapiro.test(df$rented_bikes[df$day_of_week != "Sunday"])
## 
##  Shapiro-Wilk normality test
## 
## data:  df$rented_bikes[df$day_of_week != "Sunday"]
## W = 0.92408, p-value = 3.04e-11

Both tests returned a value smaller than the alpha value of 0.5, which means that the null hypothesis that the distributions are normal is rejected. Rented bikes does not follow a normal distribution.

Next, I used the Shaprio-Wilkes test to see if the variances of rented bikes is the same between Sunday and the other days.

var.test(rented_bikes ~ is_sunday, data = df)
## 
##  F test to compare two variances
## 
## data:  rented_bikes by is_sunday
## F = 1.0652, num df = 300, denom df = 50, p-value = 0.8117
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6723938 1.5803614
## sample estimates:
## ratio of variances 
##           1.065228

The p-value is equal to 0.81, so we retain the null hypothesis that the variances are the same.

Wilcoxon rank-sum test

The small sample size and non-normality is why I chose the Wilcoxon rank-sum to test my null hypothesis.

wilcox.test(df$rented_bikes[df$day_of_week == "Sunday"],
            df$rented_bikes[df$day_of_week != "Sunday"])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$rented_bikes[df$day_of_week == "Sunday"] and df$rented_bikes[df$day_of_week != "Sunday"]
## W = 6482, p-value = 0.07585
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the alpha value, so we retain the null hypothesis. Sunday’s sum of bikes rented is neither statistically nor practically significant. This means that the bikeshare organization should not adjust staffing on Sunday, and maintain level staffing throughout the week.

Fisher’s Significance Testing

Precipitation is negatively correlated with ridership. Does Spring get more rain than Autumn?

H0: Spring and Autumn receive the same amount of rain
H1: Spring and Autumn receive different amounts of rain.
Alpha: 0.05

Randomization was used to compare rain in the Spring and Autumn.

#Sum of rainfall by day on functioning days
rain_df <- SeoulBikeData |>
  group_by(date) |>
  summarize(all_functioning = all(functioning_day == "Yes"),
            rain = sum(rainfall_mm[functioning_day == "Yes"])) |>
  filter(all_functioning)

#Assign season to date
get_season <- function(dates) {
  # Ensure dates are Date objects
  dates <- as.Date(dates)

  # Get month and day
  month <- month(dates)
  day <- day(dates)

  # Determine season
  season <- case_when(
    (month == 12 & day >= 21) | (month == 1) | (month == 2) | (month == 3 & day < 20) ~ "Winter",
    (month == 3 & day >= 20) | (month == 4) | (month == 5) | (month == 6 & day < 21) ~ "Spring",
    (month == 6 & day >= 21) | (month == 7) | (month == 8) | (month == 9 & day < 22) ~ "Summer",
    (month == 9 & day >= 22) | (month == 10) | (month == 11) | (month == 12 & day < 21) ~ "Autumn",
    TRUE ~ NA_character_ # Handle any unexpected cases
  )
  return(season)
}

rain_df$season <- get_season(rain_df$date)

spring_rain <- rain_df$rain[rain_df$season == "Spring"]
autumn_rain <- rain_df$rain[rain_df$season == "Autumn"]

combined_rain <- c(spring_rain, autumn_rain)
observed_diff <- mean(spring_rain) - mean(autumn_rain)
###Randomization
# Number of random permutations
num_permutations <- 10000
random_diffs <- numeric(num_permutations)

# Randomization
for (i in 1:num_permutations) {
  random_perm <- sample(combined_rain)

  # Split the shuffled data into two groups
  random_spring <- random_perm[1:length(spring_rain)]
  random_autumn <- random_perm[(length(spring_rain) + 1):length(combined_rain)]

  # Calculate the difference in means
  random_diffs[i] <- mean(random_spring) - mean(random_autumn)
}

# Calculate the p-value
p_value <- sum(random_diffs >= observed_diff) / num_permutations

The p-value is 0.0434, which is below our alpha value of 0.05. Therefore, we reject the null hypothesis. Spring has more rainfall than Autumn.

ggplot(data.frame(diffs = random_diffs), aes(x = diffs)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  geom_vline(xintercept = observed_diff, color = "red", linetype = "dashed") +
  labs(
    title = "Randomization Test: Distribution of Differences",
    x = "Difference in Means (Spring - Autumn)",
    y = "Frequency"
  ) +
  theme_minimal()

This histogram shows the observed difference of means between Spring and Autumn rainfall compared to our randomization test. By operating under the assumption that Spring and Autumn have the same rainfall, we randomized the variable assignments and attained our distribution. The observed difference is represeneted by the red dotted line, and the p-value shows it is statistically significant.