In this assignment, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.

Two data tables are available:

You can access the data like this.

Stations <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
data_site <- "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds" 
Trips <- readRDS(gzcon(url(data_site)))

The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the name of the data_site.

Time of day

It’s natural to expect that bikes are rented more at some times of day than others. The variable sdate gives the time (including the date) that the rental started.

Make these plots and interpret them:

  1. A density plot of the events versus sdate. Your plot should look like:
Trips %>% 
  ggplot(aes(x = sdate)) + 
  geom_density()

  1. A density plot of the events versus time of day. You can use lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate, e.g.

    Trips %>% 
      mutate(time_of_day = 
    lubridate::hour(sdate) + lubridate::minute(sdate) / 60) %>%
      ... further processing ...
  2. Facet (2) by day of the week. (Use lubridate::wday() to generate day of the week.)

  3. Set the fill aesthetic for geom_density() to the client variable.1 You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.

  4. Same as (4) but using geom_density() with the argument position = position_stack().

  1. Rather than faceting on day of the week, consider creating a new faceting variable like this:

    mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))

    Your plot should look like:

Trip distance

How does the start-to-end trip distance depend on time of day, day of the week, and client?

To answer this, you need first to compute the distance in each trip. As a start, compute a table like the following from the Stations data.

##         sstation      lat      long            estation     lat2     long2
## 1  5th & F St NW 38.89722 -77.01935      19th & L St NW 38.90341 -77.04365
## 2  5th & K St NW 38.90304 -77.01903 12th & Army Navy Dr 38.86290 -77.05280
## 3  6th & H St NE 38.89997 -76.99835 14th & Upshur St NW 38.94202 -77.03265
## 4 14th & G St NW 38.89807 -77.03182      24th & N St NW 38.90660 -77.05152

How to do this?

  1. Make two copies of Stations, which we’ll call Left and Right. Left will have names sstation, lat, and long. Right will have names estation, lat2, and long2. The other variables, nbBikes and nbEmptyDocks should be dropped. Use the function dpylr::rename() to do the renaming of name,lat, and long (i.e. dyplyr::rename(sstation=name)).
  2. Join Left and Right with a full outer join. This is a join in which every case in Left is matched to every case in Right. You can accomplish the full outer join with left%>% merge(right,all=TRUE).

Of course, with the latitude and longitude of each station, you have enough information to calculate the distance between stations. This calculation is provided by the haversine() function, which you can load with

source("http://tiny.cc/dcf/haversine.R")

Then …

  1. Using the merged table, add a variable dist like this:

    mutate(dist = haversine(lat, long, lat2, long2))

The end result, which we’ll call Distances, should look like this:

##         sstation            estation     dist
## 1  5th & F St NW      19th & L St NW 2.212522
## 2  5th & K St NW 12th & Army Navy Dr 5.335478
## 3  6th & H St NE 14th & Upshur St NW 5.537507
## 4 14th & G St NW      24th & N St NW 1.950647

Join Trips to Distances to add a dist value for each trip.

Distributions of distances

  1. Make a density plot of dist broken down by weekend vs weekday and by client.

  2. Show the distribution of dist in a compact way with a violin plot or box and whiskers plot.For both geom_violin() and geom_boxplot(), you will want to map the group aesthetic to hour. For geom_boxplot(), you may prefer to set outlier.size = 1. You might also want to add a stat_smooth() layer.

For example, this is a possible plot:


  1. client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal).