Sampling and Drawing Conclusions

In this data dive, I demonstrate sampling and drawing conclusions from Seoul Bike Share data. For the exercise, I pretended that the data is the population. Five random samples were taken that consisted of roughly 50% of the 8760 observations. These samples were combined into one dataframe. The five samples were:

  1. A random sample using sample()
  2. A random sample where data is not collected on holidays due to researcher’s union agreements.
  3. A random sample where there is a non-zero solar radiation value (assuming the data collection sensor is solar powered).
  4. A random sample of observations that began 2018/1/1
  5. A random sample where there is no snow, assuming the data collection machine can’t function in the snow.
df_1 = SeoulBikeData[sample(nrow(SeoulBikeData), size = 0.5 * nrow(SeoulBikeData), replace = TRUE), ] |>
  mutate(sample_num = 1) #Random Sample using the sample() function.

df_2 = SeoulBikeData |>
  filter(holiday =="No Holiday") |>
  sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
  mutate(sample_num = 2) #The researchers did not work on holidays 

df_3 <- SeoulBikeData |>
  filter(solar_radiation != 0) |>
  sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
  mutate(sample_num = 3) #The data collection machine was solar powered

df_4 <- SeoulBikeData |>
  filter(year >= 2018) |>
  sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
  mutate(sample_num = 4) #The data collection started 2018/1/1

df_5 <- SeoulBikeData |>
  filter(snowfall_cm == 0) |>
  sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
  mutate(sample_num = 5)   #The data collection machine doesn't work during the snow

samples = rbind(df_1,df_2,df_3,df_4,df_5) #concatenate 5 samples vertically
head(samples)
##      day month year rented_bikes hour temp_c humid_pct wind_ms visibility_10m
## 5979   7     8 2018          443    2   27.9        68     0.9           2000
## 2061  24     2 2018          233   20    1.1        32     1.4           1785
## 132    6    12 2017          381   11    1.6        57     1.2           1261
## 1222  20     1 2018          220   21    2.1        48     2.8            346
## 7282  30     9 2018            0    9   18.5        62     0.5           1847
## 2657  21     3 2018          241   16    2.4        79     1.9           1995
##      dew_point_temp_c solar_radiation rainfall_mm snowfall_cm seasons
## 5979             21.4            0.00           0           0  Summer
## 2061            -13.8            0.00           0           0  Winter
## 132              -6.0            0.93           0           0  Winter
## 1222             -7.7            0.00           0           0  Winter
## 7282             11.0            1.00           0           0  Autumn
## 2657             -0.8            0.37           0           0  Spring
##         holiday functioning_day month_name       date sample_num
## 5979 No Holiday             Yes     August 2018-08-07          1
## 2061 No Holiday             Yes   February 2018-02-24          1
## 132  No Holiday             Yes   December 2017-12-06          1
## 1222 No Holiday             Yes    January 2018-01-20          1
## 7282 No Holiday              No  September 2018-09-30          1
## 2657 No Holiday             Yes      March 2018-03-21          1

Inspection

I inspected the five samples to understand their similarities and differences. The first attribute I wanted to understand was Bike Rentals. I computed summary statistics and found that the maximum, minimum, standard deviation, and mean bikes rented is similar across samples, except for Sample 3 which is excludes times when the sun is down.

samples |>
  group_by(sample_num) |>
  summarise(mean = mean(rented_bikes), min(rented_bikes), max(rented_bikes), sd(rented_bikes))
## # A tibble: 5 × 5
##   sample_num  mean `min(rented_bikes)` `max(rented_bikes)` `sd(rented_bikes)`
##        <dbl> <dbl>               <int>               <int>              <dbl>
## 1          1  690.                   0                3418               627.
## 2          2  736.                   0                3556               660.
## 3          3  904.                   0                3556               680.
## 4          4  750.                   0                3309               658.
## 5          5  710.                   0                3298               624.

Bike rental counts vary significantly by hour, so I controlled for this variable to inspect deeper into average bikes rented. Again, the average are similar with the exception of Sample three in hours 6, 7, 19, and 20. Having sunlight at these hours may have influenced more people to ride.

samples |>
  group_by(sample_num, hour) |>
  summarise(mean = mean(rented_bikes),.groups = "drop") |>
  ggplot(aes(x = hour, y = mean, color = as.factor(sample_num))) +
  geom_point() +
  labs(title = "Heatmap of Average Hourly Rented Bikes by Sample",
       x = "Hour",
       y = "Average Rented Bikes per Hour",
       color = "Sample Number") +
  scale_color_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 0:23) +
  theme_minimal()

Since the time of year has a strong effect on cycling as well, I inspected the count of observations by month for each sample and compared the result. All samples but three show a pattern based on number of days in a month, while sample three peaks in the summer months and trails off to a low in the winter months. This causes a significant skew in the data towards summer conditions.

samples |>
  group_by(sample_num, month) |>
  summarise(count = n(), .groups = "drop") |>
  ggplot(aes(x = as.factor(month), y = count, fill = as.factor(sample_num))) +
  geom_col(position = "dodge") +
  labs(title = "Grouped Bar Chart of Sample Counts by Month",
       x = "Month",
       y = "Count",
       fill = "Sample Number") +
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

samples |>
  group_by(sample_num, month) |>
  summarise(count = n(), .groups = "drop") |>
  ggplot(aes(x = as.factor(month), y = sample_num, fill = count, label=count)) +
  geom_tile() +
  geom_text(color = "black", size = 4) +
  labs(title = "Heatmap of Sample Counts by Month",
       x = "Month",
       y = "Count",
       fill = "Sample Number") +
  theme_minimal()

I compared the ridership with temperature across the samples.

samples |>
  ggplot(aes(x=temp_c, y=rented_bikes, color=as.factor(sample_num))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot of Bike Rentals per Hour by Temperature",
       x = "Temperature (Celsius)", 
       y = "Hourly Bikes Rented") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

As expected, the coefficient of the linear regression of temperature and hourly bikes rented in sample three is higher than the rest of the samples.

Drawing Conclusions

If I were to draw conclusions from this data I would likely cut sample three from the study unless I was testing a hypothesis about ridership during certain times of the day. It would be important to not allow make inferences on nighttime ridership using sample 3.

If the study was about winter riding, I would not want to include sample 5 because the presence of snow would be an important variable.