Source file ⇒ Assignment_7.Rmd

In this assignment, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.

Two data tables are available:

You can access the data like this:

head(Stations)
##                                         name      lat      long nbBikes
## 1                             20th & Bell St 38.85610 -77.05120       7
## 2                            18th & Eads St. 38.85725 -77.05332       8
## 3                          20th & Crystal Dr 38.85640 -77.04920       8
## 4                          15th & Crystal Dr 38.86017 -77.04959       9
## 5 Aurora Hills Community Ctr/18th & Hayes St 38.85787 -77.05949       7
## 6    Pentagon City Metro / 12th & S Hayes St 38.86230 -77.05994       7
##   nbEmptyDocks
## 1            4
## 2            3
## 3            7
## 4            2
## 5            4
## 6           12
head(data_site)
## [1] "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds"
head(Trips)
##          duration               sdate                            sstation
## 344758  0h 9m 15s 2014-11-06 16:26:00                      15th & L St NW
## 113251 0h 47m 21s 2014-10-12 11:30:00                       3rd & D St SE
## 633756 2h 46m 22s 2014-12-27 14:24:00                      10th & E St NW
## 466862 0h 15m 15s 2014-11-23 16:42:00                       4th & M St SW
## 474332 0h 18m 33s 2014-11-24 17:29:00 1st & Washington Hospital Center NW
## 581597  0h 2m 36s 2014-12-15 13:11:00                 11th & Kenyon St NW
##                      edate                        estation bikeno
## 344758 2014-11-06 16:35:00                  15th & L St NW W00169
## 113251 2014-10-12 12:17:00       Jefferson Dr & 14th St SW W01482
## 633756 2014-12-27 17:10:00                  10th & E St NW W21346
## 466862 2014-11-23 16:57:00                   5th & K St NW W00647
## 474332 2014-11-24 17:47:00 Columbus Circle / Union Station W21580
## 581597 2014-12-15 13:14:00         Park Rd & Holmead Pl NW W21286
##            client
## 344758 Registered
## 113251 Registered
## 633756     Casual
## 466862     Casual
## 474332 Registered
## 581597 Registered

The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the name of the data_site.

Time of Day

It’s natural to expect that bikes are rented more at some times of day than others. The variable sdate gives the time (including the date) that the rental started.

Make these plots and interpret them:

  1. A density plot of the events versus sdate. Your plot should look like:
Trips %>% 
  ggplot(aes(x = sdate)) + 
  geom_density()

This density plot shows how frequent trips started on on a specific date relative to all of the dates.

  1. A density plot of the events versus time of day. You can use lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate, e.g.
Trips2 <- Trips %>% 
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate) / 60)

Trips2 %>%
  ggplot(aes(x = time_of_day)) +
  geom_density()

  1. Facet (2) by day of the week. (Use lubridate::wday() to generate day of the week.)
Trips3 <- Trips %>% 
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate) / 60, day_of_week = lubridate::wday(sdate))

Trips3 %>% 
  ggplot(aes(x = time_of_day)) +
  geom_density() +
  facet_wrap(~day_of_week)

  1. Set the fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.
Trips4 <- Trips %>% 
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate) / 60, day_of_week = lubridate::wday(sdate))

Trips4 %>% 
  ggplot(aes(x = time_of_day, alpha = 0.2)) +
  geom_density(aes(fill = client)) +
  facet_wrap(~day_of_week)

NOTE: client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal).

  1. Same as (4) but using geom_density() with the argument position = position_stack().
Trips5 <- Trips %>% 
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate) / 60, day_of_week = lubridate::wday(sdate))

Trips5 %>% 
  ggplot(aes(x = time_of_day, alpha = 0.5)) +
  geom_density(aes(fill = client), position = position_stack()) +
  facet_wrap(~day_of_week)

  1. Rather than faceting on day of the week, consider creating a new faceting variable like this:
Trips6 <- Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate) / 60) %>%
  mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))

Trips6 %>% 
  ggplot(aes(x = time_of_day, alpha = 0.1)) +
  geom_density(aes(fill = client), position = position_stack()) +
  facet_wrap(~wday)

Trip distance

How does the start-to-end trip distance depend on time of day, day of the week, and client?

To answer this, you need first to compute the distance in each trip. As a start, compute a table like the following from the Stations data.

How to do this?

  1. Make two copies of Stations, which we’ll call Left and Right. Left will have names sstation, lat, and long. Right will have names estation, lat2, and long2. The other variables, nbBikes and nbEmptyDocks should be dropped. Use the function dpylr::rename() to do the renaming of name,lat, and long (i.e. dyplyr::rename(sstation=name)).
  2. Join Left and Right with a full outer join. This is a join in which every case in Left is matched to every case in Right. You can accomplish the full outer join with left%>% merge(right,all=TRUE).

Of course, with the latitude and longitude of each station, you have enough information to calculate the distance between stations. This calculation is provided by the haversine() function, which you can load with

left <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
## Reading data with read.csv()
right <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
## Reading data with read.csv()
left1 <- left %>%
  select(name,lat,long) %>%
  dplyr::rename(sstation=name)
head(left1)
##                                     sstation      lat      long
## 1                             20th & Bell St 38.85610 -77.05120
## 2                            18th & Eads St. 38.85725 -77.05332
## 3                          20th & Crystal Dr 38.85640 -77.04920
## 4                          15th & Crystal Dr 38.86017 -77.04959
## 5 Aurora Hills Community Ctr/18th & Hayes St 38.85787 -77.05949
## 6    Pentagon City Metro / 12th & S Hayes St 38.86230 -77.05994
right2 <- right %>%
  select(name,lat,long) %>%
  dplyr::rename(estation=name,lat2=lat,long2=long) 
head(right2)
##                                     estation     lat2     long2
## 1                             20th & Bell St 38.85610 -77.05120
## 2                            18th & Eads St. 38.85725 -77.05332
## 3                          20th & Crystal Dr 38.85640 -77.04920
## 4                          15th & Crystal Dr 38.86017 -77.04959
## 5 Aurora Hills Community Ctr/18th & Hayes St 38.85787 -77.05949
## 6    Pentagon City Metro / 12th & S Hayes St 38.86230 -77.05994
new <- left1 %>%  
  merge(right2,all=TRUE) 
  1. Using the merged table, add a variable dist like this:
source("http://tiny.cc/dcf/haversine.R")

Stations2 <- new %>% 
  mutate(dist= haversine(lat,long,lat2,long2)) %>%  
  select(sstation,estation,dist) 

head(Stations2)
##                                     sstation       estation      dist
## 1                             20th & Bell St 20th & Bell St 0.0000000
## 2                            18th & Eads St. 20th & Bell St 0.2237177
## 3                          20th & Crystal Dr 20th & Bell St 0.1763635
## 4                          15th & Crystal Dr 20th & Bell St 0.4734716
## 5 Aurora Hills Community Ctr/18th & Hayes St 20th & Bell St 0.7441989
## 6    Pentagon City Metro / 12th & S Hayes St 20th & Bell St 1.0236764
Trips$hours <- (lubridate::hour(Trips$sdate))
finaltable <-Stations2 %>%
  merge(Trips, all =TRUE)

ff <- finaltable[complete.cases(finaltable),]
head(ff)
##         sstation                      estation      dist   duration
## 1 10th & E St NW                10th & E St NW 0.0000000 2h 46m 22s
## 2 10th & E St NW                10th & E St NW 0.0000000 0h 43m 10s
## 5 10th & E St NW                10th & U St NW 2.3669377 0h 14m 19s
## 6 10th & E St NW 10th St & Constitution Ave NW 0.3209389 1h 41m 53s
## 7 10th & E St NW 10th St & Constitution Ave NW 0.3209389  0h 3m 24s
## 8 10th & E St NW 10th St & Constitution Ave NW 0.3209389 0h 23m 46s
##                 sdate               edate bikeno     client hours
## 1 2014-12-27 14:24:00 2014-12-27 17:10:00 W21346     Casual    14
## 2 2014-10-31 18:57:00 2014-10-31 19:40:00 W01048     Casual    18
## 5 2014-10-12 13:57:00 2014-10-12 14:11:00 W20237 Registered    13
## 6 2014-10-19 09:34:00 2014-10-19 11:16:00 W01330     Casual     9
## 7 2014-10-04 15:49:00 2014-10-04 15:55:00 W21458     Casual    15
## 8 2014-11-23 11:34:00 2014-11-23 11:58:00 W21957     Casual    11