In this data dive, I pair the count of rented bikes with two created fields: a numeric column that estimated desirability to bike based on weighted weather conditions, and an ordered categorical column based on time of day. These relationships were visualized and conclusions were made based on the plot. Finally, confidence intervals were built for the response variables.
Cycling is an outdoor activity, thus certain weather conditions would disincentive people in Seoul from renting from bikeshare. Temperature, humidity, wind speed, visibility, solar radiation, rain, and snow were weighted based on personal opinion about ideal conditions.
# Compute desirability score
df <- filter(SeoulBikeData, functioning_day=="Yes" & hour == 18)
df <- df |>
mutate(
desirability = round(
# Assign weights to favorable conditions
0.25 * (temp_c + 20) + # Prefer temperatures around 20 degrees celsius
0.1 * (50 - abs(humid_pct - 50)) + # Favor moderate humidity
0.05 * (5 - wind_ms) + # Favor low wind speed
0.05 * (visibility_10m / 2000)+ # Favor high visibility
0.1 * (solar_radiation * 50) - # Favor higher solar radiation
0.25 * (rainfall_mm * 30) - # Strong penalty for rain
0.2 * (snowfall_cm * 40), # Strong penalty for snow
)
) |>
mutate(desirability = pmax(pmin(desirability, 100), 0))
cor_value <- cor(df$desirability, df$rented_bikes, use = "complete.obs")
ggplot(aes(x = desirability, y = rented_bikes), data = df) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Desirability to Ride a Bike Based on Weather vs. Rented Bikes at 6pm",
x = "Desirability Score (0-100)",
y = "Number of Rented Bikes")
## `geom_smooth()` using formula = 'y ~ x'
Ridership fluctuates throughout the day, so hour was fixed at 6pm which typically exhibits the highest ridership. A scatterplot of desirability score (explanatory variable) and hourly bikes rented (response variable) was created.
sprintf("Desirability and Rented Bikes have an R^2 value of %.2f.", cor_value)
## [1] "Desirability and Rented Bikes have an R^2 value of 0.78."
This is a high R^2 value and indicates that the majority of variance in bikes rented can be explained by our desirable weather column.
Next, a 95% confidence interval for the mean rented was calculated.
rented_bikes <- df$rented_bikes
# Sample statistics
n <- length(rented_bikes)
mean_rented <- mean(rented_bikes, na.rm = TRUE)
sd_rented <- sd(rented_bikes, na.rm = TRUE)
confidence_level <- 0.95
alpha <- 1 - confidence_level
t_value <- qt(1 - alpha/2, df = n - 1) # Critical t-value
margin_error <- t_value * (sd_rented / sqrt(n))
# Confidence Interval
lower_bound <- mean_rented - margin_error
upper_bound <- mean_rented + margin_error
print(paste("95% Confidence Interval for Mean Rented Bikes: [", round(lower_bound, 2), ",", round(upper_bound, 2), "]"))
## [1] "95% Confidence Interval for Mean Rented Bikes: [ 1448.51 , 1659.53 ]"
The confidence interval suggests we are 95% confident that at 6pm, the population mean of average rides falls between 1448 and 1659 rides per hour. In other words, if we sampled the the population 100 times, 95 samples would contain the true population mean of average rides at 6pm.
The second created field is an ordered categorical column that abstracts the hour column into 5 groups based on time of day:
df2 <- filter(SeoulBikeData, functioning_day=="Yes")
df2 <- df2 |>
mutate(time_of_day = case_when(
hour >= 0 & hour <= 5 ~ "1. Late Night",
hour >= 6 & hour <= 9 ~ "2. Morning Peak",
hour >= 10 & hour <= 15 ~ "3. Midday",
hour >= 16 & hour <= 19 ~ "4. Evening Peak",
hour >= 20 & hour <= 23 ~ "5. Night"
))
Time of day was compared with hourly rented bikes again. While the assignment asked for two different columns, this was not feasible with Seoul Bikeshare data set.
A boxplot was created that showed the relationship between time of day (explanatory variable) and rented bikes (response variable).
ggplot(df2, aes(x = time_of_day, y = rented_bikes, fill = time_of_day)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Bike Rentals by Time of Day",
x = "Time of Day",
y = "Number of Rented Bikes") +
theme(legend.position = "none")
The boxplots show that ridership is highest during evening peak and lowest in the late night. Evening peak has the highest variability as well, while late night has the lowest variability. Interestingly, Late night and and morning peak have many outliers, midday has some outliers, and evening peak and night don’t have any outliers. This may be because of how the hours are grouped and actual demand. The middle of the night does not experience many riders at all, so higher demand at 5am could cause many outliers. The same concept likely holds for the time groupings, which in the case of morning peak, can have large hourly spikes in demand.
A histogram was created for the categorical time of day groupings with hourly bikes rented. Overlaid on the plot was the averages for the group, connect with a dotted line to show trend.
# Summarize mean rentals per category
df_summary <- df2 %>%
group_by(time_of_day) %>%
summarise(mean_rentals = mean(rented_bikes))
# Scatterplot with line connecting mean values
ggplot(df2, aes(x = time_of_day, y = rented_bikes)) +
geom_jitter(alpha = 0.5, color = "blue", width = 0.2) + # Scatter points with slight jitter
geom_point(data = df_summary, aes(x = time_of_day, y = mean_rentals),
color = "red", size = 3) + # Mean rental points
geom_line(data = df_summary, aes(x = time_of_day, y = mean_rentals, group = 1),
color = "red", linetype = "dashed", linewidth = 1) + # Connect means
theme_minimal() +
labs(title = "Bike Rentals by Time of Day",
x = "Time of Day Category",
y = "Number of Rented Bikes")
As we have gathered earlier data dives, demand grows throughout the day
until evening peak. Interestingly, while the morning peak has more
values that fall higher than midday, the averages are about the
same.
Before running a correlation test, the normality and linearity must be determined.
ggplot(df2, aes(x = hour, y = rented_bikes)) +
geom_jitter(alpha = 0.5, color = "black", width = 0.5) +
geom_point() +
theme_minimal() +
labs(title = "Bike Rentals by Hour (with Jitter)",
x = "Hour",
y = "Number of Rented Bikes")
The relationship between hour and bikes rented is non-linear and not
normal. Therefore, the Spearman correlation is a better choice than the
Pearson correlation.
The Spearman correlation coefficient was calculated between the ordinal categorical variable time of day and the hourly rented bikes. The rho value was found to be 0.43, which indicates a postive, moderate association between our time of day categories and bikeshare demand. The P-Value was 2.2e-16, which indicates the rho value is statistically significant.
cor.test(as.numeric(substr(df2$time_of_day,1,1)), df2$rented_bikes, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: as.numeric(substr(df2$time_of_day, 1, 1)) and df2$rented_bikes
## S = 5.7404e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4321735
In conclusion, both weather desirability and time of day are positively correlated with hourly bikeshare rentals.