A Slight Adjustment

Intro

A confounding variable is one which influences both independent and dependent variables within a regression model. The omission of a confounder is a concern as it causes the model to incorrectly attribute its effects onto the remaining variables within a model. This can create situations where causal relationships are incorrectly identified, or relationships remain hidden.

An adjustment is the introduction of the additional variable to the model to observe its effects on the existing variables. Below is an example of an adjustment using data from the public bike sharing program in Seoul, South Korea.

Data

The dataset is available from the UCI Machine Learning Repository ¹, and originally sourced from Seoul Government’s Open Data Portal, which contains the number of bikes rented over time and a variety of weather statistics.

# Required Packages 
library(readr)
library(dplyr)
library(ggplot2)
 
# Load Data
bike_rides <- read_csv("./data/SeoulBikeData.csv")

# Clean
bike_rides <- bike_rides %>% 
  mutate(date = as.POSIXct(date, format = "%d/%m/%Y"), 
         season = as.factor(season))

head(bike_rides)

## # A tibble: 6 x 14
##   date                rented_bike_cou~  hour temperature_C humidity_perc
##   <dttm>                         <dbl> <dbl>         <dbl>         <dbl>
## 1 2017-12-01 00:00:00              254     0          -5.2            37
## 2 2017-12-01 00:00:00              204     1          -5.5            38
## 3 2017-12-01 00:00:00              173     2          -6              39
## 4 2017-12-01 00:00:00              107     3          -6.2            40
## 5 2017-12-01 00:00:00               78     4          -6              36
## 6 2017-12-01 00:00:00              100     5          -6.4            37
## # ... with 9 more variables: wind_speed_ms <dbl>, visibility_10m <dbl>,
## #   dew_point_temp_C <dbl>, solar_radiation_MJ_m2 <dbl>, rainfall_mm <dbl>,
## #   snowfall_cm <dbl>, season <fct>, holiday <chr>, functioning_day <chr>

The data set provides stats per hour. Let’s aggregate to daily statistics:

# Aggregate to daily intervals 
daily_df <- bike_rides %>% 
  filter(functioning_day == "Yes") %>% 
  group_by(date) %>% 
  summarise(count = sum(rented_bike_count), 
            temp = mean(temperature_C), 
            rainfall = sum(rainfall_mm), 
            season = last(season)) %>% 
  ungroup()

head(daily_df)

## # A tibble: 6 x 5
##   date                count    temp rainfall season
##   <dttm>              <dbl>   <dbl>    <dbl> <fct> 
## 1 2017-12-01 00:00:00  9539 -2.45        0   Winter
## 2 2017-12-02 00:00:00  8523  1.32        0   Winter
## 3 2017-12-03 00:00:00  7222  4.88        4   Winter
## 4 2017-12-04 00:00:00  8729 -0.304       0.1 Winter
## 5 2017-12-05 00:00:00  8307 -4.46        0   Winter
## 6 2017-12-06 00:00:00  6669  0.0458      1.3 Winter

Model

One might hypothesize that good weather may encourage a higher demand for rides. Plotting daily average temperatures against the number of rides seems to support this relationship. A nice upward sloping trend of increasing rides as temperatures increase. A simple regression model returns an R² of 56.6% and a strong statistical significance for the coefficient for Temperature. An increase of 1°C on average will increase the number of bike rentals by approx. 639.

Nice. Submit assignment. Close several StackOverflow tabs…

# Simple Linear Regression
# Plot ride counts vs temperature 
daily_df %>% 
  ggplot(aes(x = temp, y = count, colour = "steelblue")) + 
  geom_point(colour = "steelblue") + 
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "black") + 
  geom_hline(yintercept = mean(daily_df$count), linetype = "dashed") + 
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Seoul Bike Rentals", 
       x = "Temperature (C)", 
       y = "No. of Rides")

# Fit model
lm.simple <- lm(count ~ temp, data = daily_df)
summary(lm.simple)

## 
## Call:
## lm(formula = count ~ temp, data = daily_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20160.9  -4330.8    -34.5   5580.6  12789.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9326.57     515.93   18.08   <2e-16 ***
## temp          638.61      29.78   21.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6547 on 351 degrees of freedom
## Multiple R-squared:  0.5671, Adjusted R-squared:  0.5659 
## F-statistic: 459.9 on 1 and 351 DF,  p-value: < 2.2e-16

Adjustment

However, the more observant may notice small clusters of points to the lower right corner of the graph, where increased temperature does not increase the number of rides. Those familiar with the weather in Seoul may also theorise when this relationship breaks down.

As with many other countries in the region, South Korea is impacted by East Asian Monsoon winds, driven by the difference in temperatures from the ocean and land. This creates a much higher level of rainfall within the Summer, which might be counter-intuitive to some people depending on where they live. By contrast, San Francisco is another major city of similar latitude but different continent, which experiences the majority of its rainfall in the winter. Below is a comparison of the levels of rainfall over the year, seasons coloured for emphasis. San Francisco data has been taken from NCEI/NOAA ².

# Seoul rainfall per month 
seoul_rain <- daily_df %>% 
  mutate(month = format(date, "%m")) %>% 
  group_by(month) %>% 
  summarise(rainfall = sum(rainfall), season = last(season)) %>% 
  mutate(location = "Seoul") %>% 
  ungroup() 

# Compare with San Francisco 
# Load data
sf_data <- read_csv("./data/USW00023272.csv")

# Average monthly rainfall 
sf_rain <- sf_data %>% 
  select(DATE, PRCP, TAVG) %>% 
  filter(DATE >= "2008-12", DATE <= "2018-11") %>% 
  mutate(month = substr(DATE, 6, 8)) %>% 
  group_by(month) %>% 
  summarise(rainfall = mean(PRCP)) %>% 
  ungroup()

sf_rain$season <- seoul_rain$season 
sf_rain$location <- "San Francisco"

# Merge with Seoul data
rain_df <- rbind(seoul_rain, sf_rain)
rain_df$location <- factor(rain_df$location, levels = c("Seoul", "San Francisco"))

# Plot monthly rainfall of two cities 
rain_df %>% 
  ggplot(aes(x = month, y = rainfall, fill = season)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = c("lightblue", "tan1", "indianred2", "steelblue2")) + 
  labs(title = "Comparative Rainfall per Month", 
       x = "Month", 
       y = "Total Rainfall (mm)", 
       fill = "Season") + 
  theme() + 
  facet_wrap(~location, ncol = 1, scales = "free")

Mutliple Regression

Revisiting the previous regression, we’ll create a conditional variable to identify rainy days (say 2mm of rainfall or more). Re-plotting we see a clear downward bias on ride numbers for rainy days. The mean of each group is denoted by the coloured dashed lines. Running a regression on each group separately would yield very different results, as depicted by the solid coloured trends. The black represents the previous simple regression for comparison.

The results of this model show that the temperature actually has a larger positive effect, increasing from 639 to 672 additional bike rentals per 1°C change once adjusting for rainy days. Meanwhile a rainy day will on average decrease the number of rentals for the day by 8,509. The Adjusted R² also increases to 67.3%.

# Mutliple Regression 

# Flag rainy days (min. 2mm rainfall)
daily_df <- daily_df %>% 
  mutate(rain = ifelse(rainfall > 2, "Y", "N"))

# Group mean for plot
group_mean <- daily_df %>% 
  group_by(rain) %>% 
  summarise(mean = mean(count)) %>% 
  ungroup()

# Redo plot with rainy days identified 
daily_df %>% 
  ggplot(aes(x = temp, y = count)) + 
  geom_point(aes(colour = rain)) + 
  geom_smooth(aes(colour = rain), method = "lm", formula = y ~ x, se = FALSE) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "black") +
  scale_color_manual(values = c("tan1", "steelblue2")) + 
  scale_y_continuous(labels = scales::comma) +
  geom_hline(yintercept = unlist(group_mean[,2]), 
             colour = c("tan1", "steelblue2"), 
             linetype = "dashed") + 
  labs(title = "Seoul Bike Rentals", 
       x = "Temperature (C)", 
       y = "No. of Rides", 
       colour = "Rain") +
  theme(legend.position = c(0.925, 0.15))

# Fit model 
lm.multi <- lm(count ~ temp + rain, data = daily_df)
summary(lm.multi)

## 
## Call:
## lm(formula = count ~ temp + rain, data = daily_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13498.9  -3855.0    -80.3   4524.9  16874.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10448.85     459.83   22.72   <2e-16 ***
## temp          671.52      26.03   25.80   <2e-16 ***
## rainY       -8509.13     790.61  -10.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5683 on 350 degrees of freedom
## Multiple R-squared:  0.6748, Adjusted R-squared:  0.6729 
## F-statistic: 363.1 on 2 and 350 DF,  p-value: < 2.2e-16

Breakdown

Let’s dig into these results a bit more closely and discuss the marginal and conditional associations. We revisit the previous graph, with emphasis on the means of each group. The marginal association, ignoring the x-axis (i.e. Temperature), differs quite significantly with a large downward bias for rainy days. Since the non-rainy days represent the majority of the sample, its mean is much closer to the total group mean (black). So the simple regression model looks OK at first glance because of the influence from the larger group.

# Plot with emphasis on means
daily_df %>% 
  ggplot(aes(x = temp, y = count)) + 
  geom_point(aes(colour = rain), alpha = 0.2) + 
  scale_color_manual(values = c("tan1", "steelblue2")) + 
  scale_y_continuous(labels = scales::comma) +
  geom_hline(yintercept = unlist(group_mean[,2]), 
             colour = c("tan1", "steelblue2"), 
             linetype = "dashed", 
             size = 1) + 
  geom_hline(yintercept = mean(daily_df$count), linetype = "dashed", size = 1) + 
  labs(title = "Seoul Bike Rentals",
       x = "Temperature (C)", 
       y = "No. of Rides", 
       colour = "Rain") +
  theme(legend.position = c(0.925, 0.15))

Now with emphasis on the conditional association, observing the change in temperature along the x-axis, we see the disparity between the groups grows larger as temperature increases. Hence the adjustment provides significantly more explanatory power to the overall model, and changes the coefficient of temperature.

# Plot with emphasis on slope
daily_df %>% 
  ggplot(aes(x = temp, y = count)) + 
  geom_point(aes(colour = rain), alpha = 0.2) + 
  geom_smooth(aes(colour = rain), method = "lm", formula = y ~ x, se = FALSE) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "black") +
  scale_color_manual(values = c("tan1", "steelblue2")) + 
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Seoul Bike Rentals",
       x = "Temperature (C)", 
       y = "No. of Rides", 
       colour = "Rain") +
  theme(legend.position = c(0.925, 0.15))

UCI data file headings may cause issues. Data reshared on github ↩︎
Also reshared at github ↩︎