In this data dive, I demonstrate sampling and drawing conclusions from Seoul Bike Share data. For the exercise, I pretended that the data is the population. Five random samples were taken that consisted of roughly 50% of the 8760 observations. These samples were combined into one dataframe. The five samples were:
df_1 = SeoulBikeData[sample(nrow(SeoulBikeData), size = 0.5 * nrow(SeoulBikeData), replace = TRUE), ] |>
mutate(sample_num = 1) #Random Sample using the sample() function.
df_2 = SeoulBikeData |>
filter(holiday =="No Holiday") |>
sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
mutate(sample_num = 2) #The researchers did not work on holidays
df_3 <- SeoulBikeData |>
filter(solar_radiation != 0) |>
sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
mutate(sample_num = 3) #The data collection machine was solar powered
df_4 <- SeoulBikeData |>
filter(year >= 2018) |>
sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
mutate(sample_num = 4) #The data collection started 2018/1/1
df_5 <- SeoulBikeData |>
filter(snowfall_cm == 0) |>
sample_n(size = 0.5 * nrow(SeoulBikeData), replace = TRUE) |>
mutate(sample_num = 5) #The data collection machine doesn't work during the snow
samples = rbind(df_1,df_2,df_3,df_4,df_5) #concatenate 5 samples vertically
head(samples)
## day month year rented_bikes hour temp_c humid_pct wind_ms visibility_10m
## 5979 7 8 2018 443 2 27.9 68 0.9 2000
## 2061 24 2 2018 233 20 1.1 32 1.4 1785
## 132 6 12 2017 381 11 1.6 57 1.2 1261
## 1222 20 1 2018 220 21 2.1 48 2.8 346
## 7282 30 9 2018 0 9 18.5 62 0.5 1847
## 2657 21 3 2018 241 16 2.4 79 1.9 1995
## dew_point_temp_c solar_radiation rainfall_mm snowfall_cm seasons
## 5979 21.4 0.00 0 0 Summer
## 2061 -13.8 0.00 0 0 Winter
## 132 -6.0 0.93 0 0 Winter
## 1222 -7.7 0.00 0 0 Winter
## 7282 11.0 1.00 0 0 Autumn
## 2657 -0.8 0.37 0 0 Spring
## holiday functioning_day month_name date sample_num
## 5979 No Holiday Yes August 2018-08-07 1
## 2061 No Holiday Yes February 2018-02-24 1
## 132 No Holiday Yes December 2017-12-06 1
## 1222 No Holiday Yes January 2018-01-20 1
## 7282 No Holiday No September 2018-09-30 1
## 2657 No Holiday Yes March 2018-03-21 1
I inspected the five samples to understand their similarities and differences. The first attribute I wanted to understand was Bike Rentals. I computed summary statistics and found that the maximum, minimum, standard deviation, and mean bikes rented is similar across samples, except for Sample 3 which is excludes times when the sun is down.
samples |>
group_by(sample_num) |>
summarise(mean = mean(rented_bikes), min(rented_bikes), max(rented_bikes), sd(rented_bikes))
## # A tibble: 5 × 5
## sample_num mean `min(rented_bikes)` `max(rented_bikes)` `sd(rented_bikes)`
## <dbl> <dbl> <int> <int> <dbl>
## 1 1 690. 0 3418 627.
## 2 2 736. 0 3556 660.
## 3 3 904. 0 3556 680.
## 4 4 750. 0 3309 658.
## 5 5 710. 0 3298 624.
Bike rental counts vary significantly by hour, so I controlled for this variable to inspect deeper into average bikes rented. Again, the average are similar with the exception of Sample three in hours 6, 7, 19, and 20. Having sunlight at these hours may have influenced more people to ride.
samples |>
group_by(sample_num, hour) |>
summarise(mean = mean(rented_bikes),.groups = "drop") |>
ggplot(aes(x = hour, y = mean, color = as.factor(sample_num))) +
geom_point() +
labs(title = "Heatmap of Average Hourly Rented Bikes by Sample",
x = "Hour",
y = "Average Rented Bikes per Hour",
color = "Sample Number") +
scale_color_brewer(palette = "Dark2") +
scale_x_continuous(breaks = 0:23) +
theme_minimal()
Since the time of year has a strong effect on cycling as well, I
inspected the count of observations by month for each sample and
compared the result. All samples but three show a pattern based on
number of days in a month, while sample three peaks in the summer months
and trails off to a low in the winter months. This causes a significant
skew in the data towards summer conditions.
samples |>
group_by(sample_num, month) |>
summarise(count = n(), .groups = "drop") |>
ggplot(aes(x = as.factor(month), y = count, fill = as.factor(sample_num))) +
geom_col(position = "dodge") +
labs(title = "Grouped Bar Chart of Sample Counts by Month",
x = "Month",
y = "Count",
fill = "Sample Number") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
samples |>
group_by(sample_num, month) |>
summarise(count = n(), .groups = "drop") |>
ggplot(aes(x = as.factor(month), y = sample_num, fill = count, label=count)) +
geom_tile() +
geom_text(color = "black", size = 4) +
labs(title = "Heatmap of Sample Counts by Month",
x = "Month",
y = "Count",
fill = "Sample Number") +
theme_minimal()
I compared the ridership with temperature across the samples.
samples |>
ggplot(aes(x=temp_c, y=rented_bikes, color=as.factor(sample_num))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatterplot of Bike Rentals per Hour by Temperature",
x = "Temperature (Celsius)",
y = "Hourly Bikes Rented") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
As expected, the coefficient of the linear regression of temperature and
hourly bikes rented in sample three is higher than the rest of the
samples.
If I were to draw conclusions from this data I would likely cut sample three from the study unless I was testing a hypothesis about ridership during certain times of the day. It would be important to not allow make inferences on nighttime ridership using sample 3.
If the study was about winter riding, I would not want to include sample 5 because the presence of snow would be an important variable.