In this assignment, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
Two data tables are available:
Stations
gives the locations of the bike rental stations.Trips
contains records of individual rentals.You can access the data like this.
Stations <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
data_site <- "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds"
Trips <- readRDS(gzcon(url(data_site)))
The Trips
data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small
from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day than others. The variable sdate
gives the time (including the date) that the rental started.
Make these plots and interpret them:
sdate
. Your plot should look like:Trips %>%
ggplot(aes(x = sdate)) +
geom_density()
A density plot of the events versus time of day. You can use lubridate::hour()
, and lubridate::minute()
to extract the hour of the day and minute within the hour from sdate
, e.g.
Trips %>%
mutate(time_of_day =
lubridate::hour(sdate) + lubridate::minute(sdate) / 60) %>%
... further processing ...
Facet (2) by day of the week. (Use lubridate::wday()
to generate day of the week.)
Set the fill
aesthetic for geom_density()
to the client
variable.1 You may also want to set the alpha
for transparency and color=NA
to suppress the outline of the density function.
Same as (4) but using geom_density()
with the argument position = position_stack()
.
Rather than faceting on day of the week, consider creating a new faceting variable like this:
mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))
Your plot should look like:
wday
and fill with client
, or vice versa?How does the start-to-end trip distance depend on time of day, day of the week, and client?
To answer this, you need first to compute the distance in each trip. As a start, compute a table like the following from the Stations
data.
## sstation lat long estation lat2 long2
## 1 5th & F St NW 38.89722 -77.01935 19th & L St NW 38.90341 -77.04365
## 2 5th & K St NW 38.90304 -77.01903 12th & Army Navy Dr 38.86290 -77.05280
## 3 6th & H St NE 38.89997 -76.99835 14th & Upshur St NW 38.94202 -77.03265
## 4 14th & G St NW 38.89807 -77.03182 24th & N St NW 38.90660 -77.05152
How to do this?
Stations
, which we’ll call Left
and Right
. Left
will have names sstation
, lat
, and long
. Right
will have names estation
, lat2
, and long2
. The other variables, nbBikes
and nbEmptyDocks
should be dropped. Use the function dpylr::rename()
to do the renaming of name
,lat
, and long
(i.e. dyplyr::rename(sstation=name)
).Left
and Right
with a full outer join. This is a join in which every case in Left
is matched to every case in Right
. You can accomplish the full outer join with left%>% merge(right,all=TRUE)
.Of course, with the latitude and longitude of each station, you have enough information to calculate the distance between stations. This calculation is provided by the haversine()
function, which you can load with
source("http://tiny.cc/dcf/haversine.R")
Then …
Using the merged table, add a variable dist
like this:
mutate(dist = haversine(lat, long, lat2, long2))
The end result, which we’ll call Distances
, should look like this:
## sstation estation dist
## 1 5th & F St NW 19th & L St NW 2.212522
## 2 5th & K St NW 12th & Army Navy Dr 5.335478
## 3 6th & H St NE 14th & Upshur St NW 5.537507
## 4 14th & G St NW 24th & N St NW 1.950647
Join Trips
to Distances
to add a dist
value for each trip.
Make a density plot of dist
broken down by weekend vs weekday and by client
.
Show the distribution of dist
in a compact way with a violin plot or box and whiskers plot.For both geom_violin()
and geom_boxplot()
, you will want to map the group
aesthetic to hour
. For geom_boxplot()
, you may prefer to set outlier.size = 1
. You might also want to add a stat_smooth()
layer.
For example, this is a possible plot:
client
describes whether the renter is a regular user (level Registered
) or has not joined the bike-rental organization (Causal
).↩