Data Dive 14

Jakob Morales

Import Libraries and Data

Goal 1: Business Scenario

A cycling apparel company is launching a new line of rain gear and wants to identify urban markets where wet weather causes the lowest and largest drops in bikeshare ridership. Without this insight, the marketing team may overlook untapped markets (highest drops) or markets with many consumers that already have gear and may need more at some point (lowest drops). Understanding which cities see the most shallow or steep ridership declines during rain events can help target advertising to cyclists who are most likely to purchase the new line of rain gear.

Scope

This analysis would use hourly bikeshare ridership data and hourly weather data (specifically rainfall levels) across several cities. Key variables include:

Hourly bikes rented
Rainfall (mm/hr)
Time of day, day of week, and season
Temperature, humidity, solar radiation, and wind (as control variables)
City or region identifier

This information would be used to calculate the percentage change in ridership during rainy vs. dry hours, controlling for time of day and seasonality. A multivariate regression analysis could help isolate the effect of rain on ridership by city. The cities with the lowest and highest coefficients would be selected for further research.

Assumptions:

Rainfall is accurately recorded and aligns in time and location with bikeshare data.
Ridership decline is due to rain, not confounding variables like holidays or system outages.
Bikeshare customers when it is raining are a reasonable proxy for local cyclists impacted by rain.

Objective

The objective is to identify the top urban markets where rain causes the largest and smallest relative decreases in hourly bikeshare ridership, suggesting either unmet demand for functional rain gear among cyclists or markets with more riders accustomed to the rain. Success will be defined by

Producing a ranked list of cities by rain-related ridership drop, and
Identifying statistically significant correlations between rainfall and ridership decline, controlling for other variables.

Goal 2: Model Critique

Improvement 1

My first improvement would be to recognize autocorrelation in my models. The Seoul Bike Share data is a time series data set. As people use bike share more, they be influenced by others riding or by their own habits forming around transportation choice. Rain causes the ground to be wet until it dries, and a suddenly warm day in winter may not be sufficient to convince people to cycle. Thus, the total bikes rented at any given hour is partly dependent on the time periods before. One solution is to use and ARIMA model.

ts_data <- filter(df, seasons=="Summer")
ts_data <- ts(df$rented_bikes, frequency = 24)
#Test the data is not stationary
adf.test(ts_data)

## Warning in adf.test(ts_data): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -7.8352, Lag order = 20, p-value = 0.01
## alternative hypothesis: stationary

#p-value < 0.05, reject null hypothesis that data is not stationary.
model <- auto.arima(ts_data,
                    seasonal = TRUE,
                    stepwise = TRUE,
                    approximation = TRUE)
summary(model)

## Series: ts_data 
## ARIMA(5,0,1)(2,1,0)[24] 
## 
## Coefficients:
##          ar1     ar2      ar3     ar4      ar5     ma1     sar1     sar2
##       0.2067  0.5994  -0.1355  0.1231  -0.0961  0.8638  -0.4315  -0.3006
## s.e.  0.0226  0.0236   0.0138  0.0110   0.0110  0.0203   0.0105   0.0104
## 
## sigma^2 = 38887:  log likelihood = -56581.69
## AIC=113181.4   AICc=113181.4   BIC=113244.8
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.2349593 196.8254 115.6206 -26.14823 55.07473 0.4752417
##                       ACF1
## Training set -0.0001631722

Improvement 2

In Data Dive 6, I calculated a desirability score based on variables that I believed influenced ridership. In the calculation, I made arbitrary changes to the variables to reflect their impact before applying a weight. Instead of transforming them like this, I should have scaled each variable.

#Before
df <- filter(SeoulBikeData, functioning_day=="Yes")

df <- df |>
  mutate(
    desirability = round(
      # Assign weights to favorable conditions
      0.5 * scale(temp_c) +
      0.05 * scale(humid_pct) +
      0.025 * scale(wind_ms) +
      0.025 * scale(visibility_10m) +
      0.1 * scale(solar_radiation) -
      0.3 * scale(rainfall_mm), )
  ) |>
  mutate(desirability = pmax(pmin(desirability, 100), 0))

#Calculate R-Squared
cor_value <- cor(df$desirability, df$rented_bikes, use = "complete.obs")
sprintf("Desirability and Rented Bikes have an R^2 value of %.2f.", cor_value)

## [1] "Desirability and Rented Bikes have an R^2 value of 0.42."

Improvement 3

The weights were chosen arbitrarily. A function could be used to maximize these weights.

# Objective function to maximize R^2
objective <- function(weights, df) {
  with(df, {
    desirability <- weights[1] * scale(temp_c) +
                    weights[2] * scale(humid_pct) +
                    weights[3] * scale(wind_ms) +
                    weights[4] * scale(visibility_10m) +
                    weights[5] * scale(solar_radiation) -
                    weights[6] * scale(rainfall_mm)
    desirability <- pmax(pmin(desirability, 100), 0)
    
    r2 <- cor(desirability, rented_bikes, use = "complete.obs")^2
    return(-r2)  # we minimize negative R²
  })
}

# Initial weights guess: (temp, humid, wind, visibility, solar, rain)
init_weights <- c(0.25, 0.1, 0.05, 0.05, 0.1, 0.25)

# Optimize the weights
opt_result <- optim(par = init_weights, fn = objective, df = df, method = "BFGS")

# Get optimal weights and R²
best_weights <- opt_result$par
max_r2 <- -opt_result$value

cat(sprintf("Optimized R²: %.4f\n", max_r2))

## Optimized R²: 0.4593

cat("Optimized weights:\n")

## Optimized weights:

print(setNames(round(best_weights, 4), c("temp", "humid", "wind", "visib", "solar", "rain")))

##    temp   humid    wind   visib   solar    rain 
##  0.2102 -0.0911  0.0327 -0.0049 -0.0519  1.8644

Further analysis of individual variables is necessary to make corrections before scaling. For example, bikes rented is positively correlated with temperature until after about 86 degrees Fahrenheit. These corrections would improve the R-Squared score.

Goal 3: Ethical and Epistemological Concerns

This analysis raises several important ethical and epistemological considerations that relate to data representation, potential biases, broader societal impacts, and the inherent limitations of what the data can and cannot reveal.

A primary concern is representativeness. Bikeshare systems tend to be concentrated in denser, more affluent urban areas, potentially excluding lower-income neighborhoods or cities with less-developed cycling infrastructure. As a result, the data may not reflect the behaviors or needs of a broad spectrum of cyclists. Additionally, the analysis assumes that bikeshare users are an appropriate proxy for the broader cycling population. However, people who own and ride their own bicycles may have different behaviors and levels of resilience to inclement weather. This assumption introduces potential bias in interpreting rain-related ridership changes. Temporal misalignment between weather data and bikeshare trip data is another risk. If rainfall data does not precisely match the location and timing of bike rentals, the estimates may either understate or overstate the impact of rain on ridership.

Some of these biases can be mitigated through thoughtful modeling. Including controls for time of day, day of week, and seasonality helps isolate the effect of rainfall. Incorporating contextual data, such as the commuting mode share and quality of local cycling infrastructure, can improve the accuracy and fairness of the analysis. To enhance validity, the analysis could be supplemented with additional data sources, such as mobility app usage (e.g., Strava) or surveys from local cycling organizations, to capture a more holistic picture of urban cycling behavior.

There are also potential unintended consequences. If the analysis leads to marketing efforts that only focus on cities with high rain resilience, it may inadvertently neglect regions where demand for rain gear is high but current ridership is low due to safety concerns or insufficient infrastructure.

Some important factors influencing ridership are not easily quantifiable. For example, we cannot directly measure an individual’s motivation for riding in the rain. Cultural attitudes toward cycling in adverse weather also vary by region and are difficult to capture through quantitative analysis. Similarly, perceptions of safety and personal risk tolerance strongly influence riding behavior and are not reflected in the data.

The primary stakeholders affected by this analysis are urban cyclists, whose behaviors and preferences , may influence product development and marketing decisions. However, the analysis may also inform city planners, policymakers, and transit agencies by highlighting disparities in infrastructure or accessibility.

Ultimately, this analysis should not only serve commercial interests but also contribute to a more inclusive understanding of urban cycling patternsespecially in the context of weather resilience and equitable access to gear and infrastructure.