Regression Modeling

In this week’s data dive, I will explore correlations between variables in the Seoul Bikeshare data set. The documentation lists rented bikes as a feature and variable that denotes the bikeshare system was functioning as a target variable. This does not align with our interest in learning what factors influence ridership most, so the hourly bikes rented will be the response variable.

For the hypotheses within this data dive, I will use an alpha value of 0.05. The risk is minimal if an error is made, and there are no specific reasons for me to choose a different alpha value.

ANOVA

I’d like to know if ridership differs by day of the week. This information could be important to an organization for marketing, staffing, and maintenance purposes.

H₀: All days of the week have the same demand for bike rentals. H_A: Days of the week have a statistical significant effect on demand for bike rentals.

Before testing ANOVA, I must check three things are true:

““” • the observations are independent within and across groups, • the data within each group are nearly normal, and • the variability across the groups is about equal.

““” (Diez et. al., 2024)

Independence and Variability

To ensure independence and similar variance across groups, I constructed a boxplot of rented bikes by day of week.

ggplot(df, aes(x = day_of_week, y = rented_bikes)) +
  geom_boxplot() +
  labs(title = "Bike Rentals Over Time",
       x = "Day",
       y = "Rented Bikes") +
  theme_minimal()

The boxplots show that the mean rented bikes each day is similar to each other, with the exception of Sunday. Is the difference due to chance, or does Sunday have lower demand than the other days? Knowing this is important for an organization. For example, it would make sense to do more repairs on Sunday since less bikes are being used, and the network would have freshly maintained bikes to start the work week.

The distributions vary slightly between weekdays and weekends because weekends have less outliers.

Normality

# Shapiro-Wilk Test for Normality
shapiro.test(filter(df, day_of_week == "Monday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Monday")$rented_bikes
## W = 0.87921, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Tuesday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Tuesday")$rented_bikes
## W = 0.88308, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Wednesday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Wednesday")$rented_bikes
## W = 0.88271, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Thursday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Thursday")$rented_bikes
## W = 0.88057, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Friday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Friday")$rented_bikes
## W = 0.89292, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Saturday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Saturday")$rented_bikes
## W = 0.90318, p-value < 2.2e-16

shapiro.test(filter(df, day_of_week == "Sunday")$rented_bikes)

## 
##  Shapiro-Wilk normality test
## 
## data:  filter(df, day_of_week == "Sunday")$rented_bikes
## W = 0.85793, p-value < 2.2e-16

The p-value for each day of the week in the Shapiro-Wilk tests were less than the alpha value, so we reject the null hypothesis of normality.

ANOVA

Despite the data failing the assumption of normality, and I continued with the ANOVA test to see if difference between the days of the week are greater than the differences within each day.

m <- aov(df$rented_bikes ~ df$day_of_week, data=df)
summary(m)

##                  Df    Sum Sq Mean Sq F value   Pr(>F)    
## df$day_of_week    6 1.548e+07 2580132   6.277 1.35e-06 ***
## Residuals      8458 3.477e+09  411077                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

CBased on the output, I can conclude that there are significant differences in the number of rented bikes across the different days of the week. The high f-value shows that the difference between days are much greater than the differences within. The p-value code “***” corresponds to 0, which is less than our alpha value and indicates that the differences are very likely not due to chance.

Next, I used pairwise t-tests to check my work.

pairwise.t.test(df$rented_bikes, df$day_of_week, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df$rented_bikes and df$day_of_week 
## 
##           Friday  Monday  Saturday Sunday  Thursday Tuesday
## Monday    1.00000 -       -        -       -        -      
## Saturday  1.00000 1.00000 -        -       -        -      
## Sunday    1.8e-06 0.00643 0.01079  -       -        -      
## Thursday  0.54120 1.00000 1.00000  0.03986 -        -      
## Tuesday   1.00000 1.00000 1.00000  0.00088 1.00000  -      
## Wednesday 1.00000 1.00000 1.00000  7.7e-06 1.00000  1.00000
## 
## P value adjustment method: bonferroni

The test shows that Sunday is less than our alpha value 0.5 for every day of the week, while Thursday is less than alpha compared to Friday. For these days, we reject the null hypothesis that demand is the same. This provides further evidence in favor of the result from the ANOVA test.

This result shows differences in travel behavior based on the day of the week. Interestingly when hour is held constant at 8pm, there are no differences.

pairwise.t.test(df20$rented_bikes, df20$day_of_week, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df20$rented_bikes and df20$day_of_week 
## 
##           Friday Monday Saturday Sunday Thursday Tuesday
## Monday    1      -      -        -      -        -      
## Saturday  1      1      -        -      -        -      
## Sunday    1      1      1        -      -        -      
## Thursday  1      1      1        1      -        -      
## Tuesday   1      1      1        1      1        -      
## Wednesday 1      1      1        1      1        1      
## 
## P value adjustment method: bonferroni

All values being 1 means we retain the null hypothesis. This suggests that travel behaviors around 8pm are very similar no matter what day it is. Perhaps varying travel behavior differs at other hours across days of the week.

Regression

Intuitively, temperature is one of the most likely reasons people would not choose to ride a bicycle. Another intuitive reason is time of day; unless someone needs to travel, is an athlete, or is woken out of their sleep with the overwhelming urge to ride a bike, most people won’t be riding at 3am. Time should be held constant to view the relationship betweem temperature and bikes rented. Day of week and time of day may be interaction terms as well based on people’s travel behaviors for work and leisure. 8pm was chosen as the time of day to hold constant due to being outside of normal 9-5 working hours, but not late enough where people act like they need to work the next day.

I created a scatter plot of temperature and bikes rented over the year at 8pm, and visualized the relationship’s linear model/line of best fit.

df20 |>
  ggplot(aes(x=temp_c,y=rented_bikes)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
  labs(title = 'Temperature vs Hourly Rented Bikes',
       x = "Temperature (Celsius)",
       y = "Rented Bikes")

## `geom_smooth()` using formula = 'y ~ x'

Temperature and bikes rented at 8pm share a positive linear relationship.

model <- lm(df20$rented_bikes ~ df20$temp_c)

tidy(model) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))

## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    416. 
## 2 df20$temp_c     51.9

Based on this result, I expect 416 bikes rented at 8pm when it is 0 degrees Celsius (32 degrees Fahrenheit). I also expect an additional 52 riders per single degree (Celsius) increase in temperature.

References

Diez, D. M., Çetinkaya-Rundel, M., & Barr, C. D. (2024). OpenIntro Statistics (4th ed.). OpenIntro.

W8 Data Dive: Seoul Bike Sharing

Jakob Morales

2025-03-10

Regression Modeling

ANOVA

Independence and Variability

Normality

ANOVA

Regression

References