In this project, we analyze data from bicycle counters in Rostock, Germany. Data is kindly provided in open form from several counters in Rostock, via the web site http://www.opendata-hro.de/dataset/radmonitore
Not being familiar with this city, we choose (rather arbitrarily) a single bike counter in the center.
Importantly, in this project, we follow the recently published book “R for data science”, in order to get familiar with the suite of packages commonly referred to as “tidyverse”
This document is written in R markdown, in order to provide maximal reproducibility.
We are working in the tidyverse:
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Data from the Rostock open data portal was downloaded on 2016-10-03 and saved locally. Providing the locale to read_csv circumvents a nasty issue with daylight saving time and the one- or two-hour time shift from UTC.
data <- read_csv('data/external/radmonitore_daten.csv',
locale = locale(tz="Europe/Berlin"))
## Parsed with column specification:
## cols(
## standort_id = col_integer(),
## zeitpunkt = col_datetime(format = ""),
## summe = col_integer()
## )
data.radmonitore <- read_csv('data/external/radmonitore_standorte.csv')
## Parsed with column specification:
## cols(
## geometrie_wkt = col_character(),
## uuid = col_character(),
## kreis_name = col_character(),
## kreis_schluessel = col_integer(),
## gemeindeverband_name = col_character(),
## gemeindeverband_schluessel = col_integer(),
## gemeinde_name = col_character(),
## gemeinde_schluessel = col_double(),
## gemeindeteil_name = col_character(),
## gemeindeteil_schluessel = col_character(),
## strasse_name = col_character(),
## strasse_schluessel = col_character(),
## hausnummer = col_integer(),
## hausnummer_zusatz = col_character(),
## postleitzahl = col_integer(),
## id = col_integer(),
## bezeichnung = col_character(),
## website = col_character()
## )
id.bezeichung <- data.radmonitore %>%
select(id, bezeichnung) %>%
rename(standort_id=id)
data <- data %>% left_join(id.bezeichung, by = "standort_id") %>%
select(-standort_id) %>%
mutate(bezeichnung = factor(bezeichnung)) %>%
filter(bezeichnung == 'Am Strande/Holzhalbinsel') %>%
select(-bezeichnung)
train <- data%>% filter(year(zeitpunkt)<2015)
test <- data %>% filter(year(zeitpunkt)>=2015)
daily <- train %>%
mutate(date = date(zeitpunkt)) %>%
group_by(date) %>%
summarise(n = sum(summe)) %>%
mutate(weekday = wday(date, label = TRUE))
Historical weather data is kindly provided by the DWD. The corresponding weather station for Rostock has the ID 04271. Weather data was accessed on 2016-10-04.
The closest weather station from the DWD is presumably 04271 19911101 20161002 4 54.1803 12.0808 Rostock-Warnemünde Mecklenburg-Vorpommern, as revealed by the list of weather stations provided at ftp://ftp-cdc.dwd.de/pub/CDC/help/EB_Stundenwerte_Beschreibung_Stationen.txt, where the first number is the station id .
Data for this weather station was downloaded from ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/daily/kl/historical/tageswerte_04271_19470101_20151231_hist.zip on 2016-10-04.
This zip file contains, along with descriptions, the central datafile data/external/dwd_rostock. With annoying whitespaces, delimited by;` and an unparsed date. No problem for readr.
For solar data, it is important to know that missing values are encoded as -999.
# unfortunately, this file ncontains neither valid sunshine time nor precipitation for our date range
weather <- read_delim('data/external/dwd_rostock/produkt_klima_Tageswerte_19470101_20151231_04271.txt',';', trim_ws = TRUE)
## Parsed with column specification:
## cols(
## STATIONS_ID = col_integer(),
## MESS_DATUM = col_integer(),
## QUALITAETS_NIVEAU = col_integer(),
## LUFTTEMPERATUR = col_double(),
## DAMPFDRUCK = col_double(),
## BEDECKUNGSGRAD = col_integer(),
## LUFTDRUCK_STATIONSHOEHE = col_double(),
## REL_FEUCHTE = col_double(),
## WINDGESCHWINDIGKEIT = col_integer(),
## LUFTTEMPERATUR_MAXIMUM = col_double(),
## LUFTTEMPERATUR_MINIMUM = col_double(),
## LUFTTEMP_AM_ERDB_MINIMUM = col_double(),
## WINDSPITZE_MAXIMUM = col_integer(),
## NIEDERSCHLAGSHOEHE = col_integer(),
## NIEDERSCHLAGSHOEHE_IND = col_integer(),
## SONNENSCHEINDAUER = col_integer(),
## SCHNEEHOEHE = col_integer(),
## eor = col_character()
## )
## Warning: 110325 parsing failures.
## row col expected actual
## 1462 BEDECKUNGSGRAD no trailing characters .0
## 1462 NIEDERSCHLAGSHOEHE no trailing characters .8
## 1462 SONNENSCHEINDAUER no trailing characters .500
## 1463 BEDECKUNGSGRAD no trailing characters .2
## 1463 NIEDERSCHLAGSHOEHE no trailing characters .0
## .... .................. ...................... ......
## See problems(...) for more details.
weather$date <- ymd(weather$`MESS_DATUM`)
weather <- weather %>%
select(date, temperature = LUFTTEMPERATUR)
# sunshine: ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/daily/solar/
solar <- read_delim('data/external/dwd_rostock/produkt_strahlung_Tageswerte_19980101_20160831_04271.txt', ';', trim_ws = TRUE, na = "-999")
## Parsed with column specification:
## cols(
## STATIONS_ID = col_integer(),
## MESS_DATUM = col_integer(),
## QUALITAETS_NIVEAU = col_integer(),
## SONNENSCHEINDAUER = col_double(),
## GLOBAL_KW_J = col_double(),
## DIFFUS_HIMMEL_KW_J = col_double(),
## ATMOSPHAERE_LW_J = col_character(),
## eor = col_character()
## )
solar <- solar %>%
mutate(date = ymd(MESS_DATUM)) %>%
select(date, sunshinetime = SONNENSCHEINDAUER)
# precipitation: ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/daily/more_precip/historical/
precip <- read_delim('data/external/dwd_rostock/precip_klima_Tageswerte_19010101_20151231_04271.txt', ';', trim_ws=TRUE) %>%
mutate(date = ymd(MESS_DATUM)) %>%
select(date, precipitation = NIEDERSCHLAGSHOEHE)
## Parsed with column specification:
## cols(
## STATIONS_ID = col_integer(),
## MESS_DATUM = col_integer(),
## QUALITAETS_NIVEAU = col_integer(),
## NIEDERSCHLAGSHOEHE = col_double(),
## NIEDERSCHLAGSHOEHE_IND = col_integer(),
## SCHNEEHOEHE = col_integer(),
## eor = col_character()
## )
## Warning: 2 parsing failures.
## row col expected actual
## 23742 STATIONS_ID an integer
## 23742 NA 7 columns 1 columns
daily.weather <- daily %>%
left_join(weather, by="date") %>%
left_join(solar, by="date") %>%
left_join(precip, by = "date")
We have split our data into training and testing data, in order to have an untouched dataset for testing possible predictive models in a later stage. Also, we have aggregated bicycle counts for each day.
First, we investigate the bicycle counts over the entire time of our testing data:
ggplot(daily, aes(x=date, y=n)) +
geom_line() +
labs(title="Bicycle counts show strong variation over time")
We observe strong seasonal changes, summer in rostock sees many more cyclists than winter (no surprise here). Also, we observe faster fluctuations, maybe on a weekly scale.
Of course, we can enhance our plot by incorporating information about the day of the week:
ggplot(daily, aes(x=date, y=n)) +
geom_line(aes(alpha = 0.3)) +
geom_point(aes(color = weekday, size = 2, alpha = 0.3)) +
labs(title="Weekends tend to see fewer cyclists")
ggplot(daily.weather, aes(x=date, y=n)) +
geom_line(alpha=0.3) +
geom_point(aes(alpha = 0.4, size = 1.5, color=weekday)) +
labs(title='Summer sees far more cyclists')
ggplot(daily.weather, aes(x=temperature, y=n)) +
geom_point(aes(alpha=0.2, size = 1.5,color = weekday)) +
labs(title='On warmer days, cyclist counts are higher')
ggplot(daily.weather, aes(x=precipitation)) +
geom_histogram() +
labs(title="Most days in Rostock are dry")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(daily.weather, aes(x=precipitation, y=n)) +
geom_point(aes(alpha=0.2, size = 1.5,color = weekday)) +
scale_x_log10() +
labs(title="No strong relation of precipitation to cyclist counts")
ggplot(daily.weather, aes(x=sunshinetime, y=n)) +
geom_point() +
labs(title = "Many cyclists ride even when the sun is not shining")
## Warning: Removed 63 rows containing missing values (geom_point).
From our exploratory data analysis, we know that our count data is dominated by three main effects:
seasonal changes: during summertime, we observe a lot more cyclists
changes over weekdays: on sundays, we observe reduced bike usage
changes over daytime: several counters show a bimodal distribution of bicycle counts
In this section, we try to capture these observations in a regression model, starting with the most obvious effect of seasonal changes.
In order to keep things simple, we further reduce our data set to the single counter with the most counts (‘Am Strande/Holzhalbinsel’).
Moreover, we work with daily counts and thus discard changes over daytime.
We can use two well-known effects of summer for capturing the increase in bikers during the summer months:
Obviously, the two effects are strongly correlated. However, appropriate data is not available in the original dataframe. Whereas the sunshine time is easier to access, we can also try to capture weather data - this has the additional benefit of providing precipitation data for the observed time - and it is not unlikely that rain has a strong effect on bike counts.
The relation between daily cyclist count and temperature looks roughly linear. Nice. Nevertheless, there is a lot of variation in the data, which indicates the influence of additional factors. A likely factor is the day of the week - Saturdays and Sundays appear to be associated with ower overall counts.
We try to capture the influence of the temperature on the cyclist counts:
library(modelr)
model.daily.temperature <- lm(n~ temperature, data = daily.weather)
# and overlay the model fit
grid <- daily.weather %>%
data_grid(temperature=seq_range(temperature, 50)) %>%
add_predictions((model.daily.temperature), "n_pred")
ggplot(daily.weather, aes(x=temperature, y=n)) +
geom_point(aes(size=2, alpha=0.2, color = weekday)) +
geom_line(data=grid, aes(x=temperature, y=n_pred,size=1)) +
labs(title = "Temperature appears to be linearly related to cyclist counts")
daily.weather <- daily.weather %>%
add_residuals(model.daily.temperature, var = "temperature.detrended")
rms.daily.temp <- sqrt(mean(daily.weather$temperature.detrended^2))
ggplot(daily.weather, aes(date, temperature.detrended)) +
geom_line(alpha=0.2) +
geom_point(aes(alpha=0.5, size = 2, color=weekday)) +
geom_ref_line(h=0) +
labs(title=sprintf("After accounting for temperature, a lot of structure remains; RMS = %.1e", rms.daily.temp))
After accounting for the influence of air temperature, we observe that there is still some structure in the residuals: in particular, we observe that weekdays see more cyclists than weekends, and we observe some remaining seasonal patterns.
We see that weekend cyclist counts are much lower than anticipated by our simple temperature model. In the next iteration, we include this effect as an additional variable. (later on, we could try to differentiate only the two level factor business day or not)
model.daily.temperature.wday <- lm(n~ temperature + weekday, data = daily.weather)
# and overlay the model fit
grid <- daily.weather %>%
data_grid(temperature=seq_range(temperature, 50), weekday) %>%
add_predictions((model.daily.temperature.wday), "n.pred.temp.wday")
daily.weather <- daily.weather %>%
add_residuals(model.daily.temperature.wday, var = "temperature.wday.detrended")
rms.daily.temp.weekday <- sqrt(mean(daily.weather$temperature.wday.detrended^2))
ggplot(daily.weather, aes(date, temperature.wday.detrended)) +
geom_line(alpha=0.2) +
geom_point(aes(alpha=0.5, size = 2, color=weekday)) +
geom_ref_line(h=0) +
labs(title=sprintf("Correcting for temperature and weekdays removes a lot of structure; RMS = %.1e", rms.daily.temp.weekday))
This looks very reasonable. Root mean square error (RMS) has reduced considerably, and the weekend/workday pattern is much less pronounced - if visible at all.
We observe:
model.daily.temperature.wday.sunshine <- lm(n~ temperature + weekday + sunshinetime, data = daily.weather)
daily.weather <- daily.weather %>%
add_residuals(model.daily.temperature.wday.sunshine, var = "temperature.wday.sunshine.detrended")
rms.daily.temp.weekday.sunshine <- sqrt(mean(daily.weather$temperature.wday.sunshine.detrended^2,na.rm = TRUE))
ggplot(daily.weather, aes(date, temperature.wday.sunshine.detrended)) +
geom_line(alpha=0.2) +
geom_point(aes(alpha=0.5, size = 2, color=weekday)) +
geom_ref_line(h=0) +
labs(title=sprintf("Correcting for temperature and weekdays removes a lot of structure; RMS = %.1e", rms.daily.temp.weekday.sunshine))
## Warning: Removed 59 rows containing missing values (geom_path).
## Warning: Removed 63 rows containing missing values (geom_point).
This was a very successfull step in modeling our cyclist counts: The remaining seasonal patterns have reduced considerably (on account of sunshine time in summer being potentially much longer). Also, the root mean square error has reduced again notably.
We have not seen an apparent influence of precipitation on cyclist counts. Let’s see whether this holds up in univariate analysis and in the final model:
model.daily.precipiation <- lm(n ~ precipitation, data = daily.weather)
model.daily.temperature.wday.sunshine.precipitation <- lm(n~ temperature + weekday + sunshinetime + precipitation, data = daily.weather)
daily.weather <- daily.weather %>%
add_residuals(model.daily.temperature.wday.sunshine.precipitation, var = "temperature.wday.sunshine.precipitation.detrended")
rms.daily.temp.weekday.sunshine.precipitation <- sqrt(mean(daily.weather$temperature.wday.sunshine.precipitation.detrended^2,na.rm = TRUE))
ggplot(daily.weather, aes(date, temperature.wday.sunshine.precipitation.detrended)) +
geom_line(alpha=0.2) +
geom_point(aes(alpha=0.5, size = 2, color=weekday)) +
geom_ref_line(h=0) +
labs(title=sprintf("Accounting precipitation further reduces RMSE to %.1e", rms.daily.temp.weekday.sunshine.precipitation))
## Warning: Removed 59 rows containing missing values (geom_path).
## Warning: Removed 63 rows containing missing values (geom_point).
Again, accounting for the additional variable has reduced the root mean square error. However, the reduction was not as pronounced as previously. We will have to be careful with interpretation of this result; and maybe we will have to have a look at parameters like the Akaike information criterion.
Focusing on the outliers in the residuals, we observe:
three days, approximately in August, see far more cyclists than our model expects.
around the end of December, our model predicts more cyclists than we actually see.
Approximately in October 2014, we see a strong dip, which is not visible in October 2016.
We proceed with a close inspection of the outliers:
outliers <- daily.weather %>%
filter(abs(temperature.wday.sunshine.precipitation.detrended) > 900)
ggplot(outliers, aes(x=date, y=temperature.wday.sunshine.precipitation.detrended)) +
geom_point(aes(color=weekday)) +
geom_text(aes(label=date, colour = weekday, vjust = -0.5)) +
labs(title="Negative outliers tend to be public holidays")
Labeling the outliers with dates permits the following insights:
Public holidays, such as Christmas, October 3rd, possibly Easter, are associated with much fewer cyclists than our model accounts for. This indicates that we really should use a variable ‘public holiday’. Alternatively, we can pick up our previous notion of discrimination only ‘workdays’ and ’holidays.
The days with more cyclists are more difficult to interpret. Most notably, this affects some days in August. We need to check local events in Rostock.
We grab a csv file with official holidays in Mecklenburg-Vorpommern. Google was very helpful: http://www.feiertage.net/frei-tage.php
# we need to grab public holidays:
holidays.2013 <- read_delim('http://www.feiertage.net/csvfile.php?state=MV&year=2013&type=csv',';', trim_ws = TRUE) %>%
mutate(date = dmy(Tag)) %>%
select(date, Feiertage)
## Parsed with column specification:
## cols(
## Tag = col_character(),
## Feiertage = col_character(),
## Bundesland = col_character()
## )
holidays.2014 <- read_delim('http://www.feiertage.net/csvfile.php?state=MV&year=2014&type=csv',';', trim_ws = TRUE) %>%
mutate(date = dmy(Tag))%>%
select(date, Feiertage)
## Parsed with column specification:
## cols(
## Tag = col_character(),
## Feiertage = col_character(),
## Bundesland = col_character()
## )
holidays.2015 <- read_delim('http://www.feiertage.net/csvfile.php?state=MV&year=2015&type=csv',';', trim_ws = TRUE) %>% mutate(date = dmy(Tag))%>%
select(date, Feiertage)
## Parsed with column specification:
## cols(
## Tag = col_character(),
## Feiertage = col_character(),
## Bundesland = col_character()
## )
# we concatenate these frames
holidays <- bind_rows(holidays.2013, holidays.2014, holidays.2015)
# and create a column 'holiday' in daily.weather, which is true when a
# corresponding element exsits in the holiday table
daily.weather <- left_join(daily.weather, holidays, by = 'date') %>%
mutate(holiday = is.na(Feiertage)) %>%
mutate(holiday = !holiday)
model.daily.temperature.wday.sunshine.precipitation.holiday <- lm(n~ temperature + weekday + sunshinetime + precipitation + holiday, data = daily.weather)
daily.weather <- daily.weather %>%
add_residuals(model.daily.temperature.wday.sunshine.precipitation.holiday, var = "temperature.wday.sunshine.precipitation.holiday.detrended")
rms.daily.temp.weekday.sunshine.precipitation.holiday <- sqrt(mean(daily.weather$temperature.wday.sunshine.precipitation.holiday.detrended^2,na.rm = TRUE))
ggplot(daily.weather, aes(date, temperature.wday.sunshine.precipitation.holiday.detrended)) +
geom_line(alpha=0.2) +
geom_point(aes(alpha=0.5, size = 2, color=weekday)) +
geom_ref_line(h=0) +
labs(title=sprintf("Now with holidays: RMSE %.1e", rms.daily.temp.weekday.sunshine.precipitation.holiday))
## Warning: Removed 59 rows containing missing values (geom_path).
## Warning: Removed 63 rows containing missing values (geom_point).
outliers <- daily.weather %>%
filter(abs(temperature.wday.sunshine.precipitation.holiday.detrended) > 900)
ggplot(outliers, aes(x=date, y=temperature.wday.sunshine.precipitation.holiday.detrended)) +
geom_point(aes(color=weekday)) +
geom_text(aes(label=date, colour = weekday, vjust = -0.5)) +
labs(title="Some negative outliers remain")
Accounting for public holidays has further reduced the root mean square error. Some negative outliers remain, such as christmas, where many people are presumably on holiday.
The days with cyclist counts that are much higher than our models expect still need further investigation. Google to the rescue:
Check.
Our tentative final model has the following parameters:
summary(model.daily.temperature.wday.sunshine.precipitation)
##
## Call:
## lm(formula = n ~ temperature + weekday + sunshinetime + precipitation,
## data = daily.weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1622.55 -210.21 16.18 221.62 1283.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 674.782 31.570 21.374 < 2e-16 ***
## temperature 81.832 2.664 30.718 < 2e-16 ***
## weekday.L 95.921 40.251 2.383 0.0175 *
## weekday.Q -955.043 40.120 -23.805 < 2e-16 ***
## weekday.C 167.704 40.242 4.167 3.50e-05 ***
## weekday^4 -249.670 40.270 -6.200 1.01e-09 ***
## weekday^5 43.557 40.202 1.083 0.2790
## weekday^6 57.519 40.031 1.437 0.1512
## sunshinetime 66.894 3.754 17.822 < 2e-16 ***
## precipitation -31.954 4.587 -6.966 8.14e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 386.6 on 639 degrees of freedom
## (63 observations deleted due to missingness)
## Multiple R-squared: 0.8272, Adjusted R-squared: 0.8248
## F-statistic: 340 on 9 and 639 DF, p-value: < 2.2e-16
Modelling cyclist counts in Rostock thus permits the following insights:
Currently, this model has the following limitations:
We cannot account for sudden drops in cyclists when temperature is freezing. (is there an influence? we need to explore this)
We have not investigated changes due to holidays. Given the strong weekend/workday pattern, we would expect to see holidays as suspicious outliers in the residuals.
On a related matter, we should try to model the workday/weekend thing with a binary factor. This may help to reduce potential overfitting.