Introduction
My project was about bicycle sharing in the Washington, D.C. area. Each bike can be taken from any given station, and returned to any station in the area. A user can either be a “Casual” or a “Registered” user. “Casual” users pay a five day fee, and get to use the bikes for an unlimited amount of time within those five days. “Registered” users pay a one year fee, and get to use the bikes for an unlimited amount of time within that one year. It is important that there are enough bikes at every station to meet the demand of all users, and enough empty spaces so bikes can be returned at any given station, at any given time, without hassle. It is important that they record this data so that they can guarantee there will be a bike or empty space at any given station at any point that someone needs one or the other. When they are able to guarantee bikes and empty spaces like this, customers can trust that they will have either a bike or empty space for their bike there, and will continue to use their service. Of course, bike sharing stations rely on regular customers in order to pay their workers, repair and provide maintenance to their bikes and the stations themselves. Yet, in order to do all of this, they first must gather data to analyze which station needs more or less bikes on any given day, at any given time. The data that I used in this study was collected between October 1, 2014 to December 31, 2014, a three month period.
When I looked at this data set, some questions that I had were:
- What time of day are the bikes used the most often?
- Which days have more bike usage?
- Which type of user uses the bikes more during the week/weekend?
- How far apart are each of the stations from each other?
- How dense is the number of stations found in more popular areas in D.C. (downtown, outer part of downtown)?
The Stations data included information about the latitude and longitude location of each bicycle sharing station (lat and long), along with the station’s name (station). The other dataset I used, called Trips, included information about each of the trips for all the bikes individually. This data includes the station in which the bike was checked out from (sstation), the station in which the bike was returned (estation), the type of client (client, which is either “regular” or “casual”), the starting date and time of the trip (sdate), and finally, the ending date and time of the trip (edate).
Analysis
Finding the Most Popular Starting and Ending Stations:
Here I group the stations by starting and ending station, and add up the total number of bikes taken and returned. Then, I join the data sets.
MostPopStart <-
Trips %>%
group_by(sstation) %>%
summarize(BikesTaken = n())
MostPopEnd <-
Trips %>%
group_by(estation) %>%
summarize(BikesReturned = n())
MostPop <-
inner_join(MostPopStart, MostPopEnd, by = c("sstation" = "estation")) %>%
rename(Station = sstation)Table 1: Trying to Find the Most Popular Bike Stations:
DT::datatable(MostPop, rownames = FALSE,
options = list(pageLength = 9, scrollX = TRUE,
lengthMenu = c(5, 10, 15, 20)))Table 2: The Most Popular Stations that Bikes are Taken From:
MostPop1 <-
MostPop %>%
select(Station, BikesTaken) %>%
arrange(desc(BikesTaken))
DT::datatable(MostPop1, rownames = FALSE,
options = list(pageLength = 5, scrollX = TRUE,
lengthMenu = c(5, 10, 15, 20)))This data table shows the most popular stations in which a bike was taken from. This data would tell the people who manage these bike stations where more bikes need to be available for use, so everyone who wants to ride a bike from this location will be able to.
Table 3: The Most Popular Stations that Bikes are Taken To:
MostPop2 <-
MostPop %>%
select(Station, BikesReturned) %>%
arrange(desc(BikesReturned))
DT::datatable(MostPop2, rownames = FALSE,
options = list(pageLength = 5, scrollX = TRUE,
lengthMenu = c(5, 10, 15, 20)))This data table shows the most popular stations in which bikes were ridden and dropped off. This is also important to know because managers need to know how many open spots they need to have, so users can drop their bikes off without hassle.
By comparing both of these tables, we can simply see that the “Columbus Circle/Union Station” and the “Massachusetts Station Ave & Dupont Circle NW” are the top 2 most popular stations for bikes to be taken from and returned. Since they are the 2 most popular stations for bikes to be taken from and returned, this also means that they are the two most popular stations in Washington, D.C.. This data would be extremely useful to the management because they would need to know that these two stations are extremely popular, and they need to have a lot of bikes available, yet they also need to have a lot places open for bikes to be dropped off.
Graph 1: Client Type and Duration of Ride
Trips %>%
mutate(durMin = as.numeric(edate - sdate)/60) %>%
ggplot(aes(x = client, y = durMin)) +
geom_boxplot(fill = "burlywood") +
ylim(c(0, 60)) +
labs(x = "Type of User",
y = "Duration of Trip (Minutes)",
title = "Duration of Trip in Minutes Based on Type of User")This boxplot compares Casual and Registered users on how long they use a bike
This graph shows us that “Casual” clients usually use a bike longer than “Registered” clients. There could be many reasons for this: we could say that most registered users use the bicycle sharing system to commute to work, and most casual users use the bicycles for leisurely rides. Also, some good information from the data to know is that registered users pay a yearly membership fee, while casual users pay a five-day membership. It is also clear to see that registered users have a more diverse duration of trip. This may be because some users may have longer trips to and from work, or maybe they are also just leisurely riding around for long periods of time. Registered users overall seem to use the bikes in higher quantities than casual users.
Table 4: Who Uses the Bikes More?
Trips %>%
group_by(client) %>%
summarize(totalType = n()) %>%
select(client, totalType) %>%
arrange(desc(totalType)) %>%
as.data.frame()## client totalType
## 1 Registered 8552
## 2 Casual 1448
Since registered users get to use the bikes all year, they take shorter rides in higher quantities. Yet, since casual users only get to use the bikes for five days, the “Casual” user count is not as high since they do not have as much time to use the bikes, and take longer rides in order to take advantage of the five days they have to use them. After all, they have paid to use the bikes for five days, whether they take advantage of them or not.
Graph 2: Client Type and Day of the Week
Temp <-
Trips %>%
mutate(DayofWeek = wday(sdate),
H = hour(sdate)) %>%
group_by(H, DayofWeek, client) %>%
summarize(count = n()) %>%
mutate(dw = plyr::mapvalues(DayofWeek, from = 1:7,
to = c("Su", "Mo", "Tu", "We", "Th", "Fr", "Sa"))) %>%
mutate(dw = factor(dw, levels = c("Su", "Mo", "Tu", "We", "Th", "Fr", "Sa")))
Temp %>%
ggplot(aes(x = H)) +
geom_density(fill = "burlywood") +
facet_grid(dw ~client) +
labs(title = "Client Usage on Different Days and Hours of the Week",
x = "Hour")From this graph, we can see that “casual” users tend to use the bikes during the afternoon and evening hours on weekdays, late morning and evenings on Saturday. Yet, on Sunday, we can see a slight decrease in use during “usually popular” hours, but the bikes are used all throughout the day. On the other hand, “registered” users have an increase in bike activity on weekdays at about 7-8am, staying fairly consistant throughout the day. Later in the day, there is a decrease in bike usage at around 5-6pm. Strangely enough, 7-8am is usually when people are going to work, and 5-6pm is usually when people are going home from work. On the weekends, “registered” users seem to have a lower amount of overall usage, but the increase in usage starts a little bit earlier in the day, and decrease in usage goes down at about the same time as weekdays.
Graph 3: Hour of Day and Amount of Users
Trips %>%
mutate(H = hour(sdate)) %>%
ggplot(aes(x = H)) +
geom_density(fill = "burlywood") +
labs(x = "Hour of Day",
title = "Amount of Bike Users Based on the Hour of the Day")Amount of bike users based on hour of day
This density plot shows us the amount of users based on each hour of the day. In this graph, we see two main density spikes. One is around 7am-8am, and the other is around 5pm-6pm. As many of us may be aware of, a lot of people go to work around 7 or 8am, and leave work around 5 or 6pm. This would explain the spikes in usage at this time. Of course, in between these times, we see a smaller, more average amount of usage. This could be a result of many actions, such as the fact that people could be going to lunch at some point in this time frame, some people may have to leave work early for various reasons, or this could be the casual users having an impact on the usage of bicycles during this time. Yet, usage does increase quite a bit at around 12:30pm, which leads me to assume that people are going on for lunch around this time. Below I have included a graph that shows the actual numerical amount of times a bike has incurred activity:
Graph 4: Number of Events per Hour of Day
Trips %>%
mutate(H = hour(sdate)) %>%
group_by(H) %>%
summarize(count = n()) %>%
ggplot(aes(x = H, y = count)) +
geom_point() + geom_line() +
labs(title = "Number of Events per Hour of Day",
x = "Hour of Day",
y = "Number of Bike Usage Occurences")Number of Times Activity was Incurred by Hour of Day
This graph shows the same curve as the density plot above, except it shows the number of occurences that someone had an interaction (either taking or returning) with a bike during each hour. It was interesting to see the actual number of times a bike was taken at each hour of the day. We can see that there are more occurences around 7am-8am and 5pm-6pm, and an average amount of use spread between those times. This could be because people are going to or coming from work during these time frames.
Finding the Distance Between Stations:
Simple <-
Stations %>%
select(name, lat, long) %>%
rename(sstation = name)
Simple2 <-
Simple %>%
rename(estation = sstation, lat2 = lat, long2 = long)
StationPairs <- merge(Simple, Simple2, by = NULL)
PairDistances <-
StationPairs %>%
mutate(distance = haversine(lat, long, lat2, long2)) %>%
select(sstation, estation, distance)
RideDists <-
Trips %>%
inner_join(PairDistances)Graph 5: Average Distance Between Stations
RideDists %>%
group_by(distance) %>%
summarize(NumofTrips = n()) %>%
ggplot(aes(x = distance)) +
geom_density(fill = "burlywood") +
labs(title = "Distance between Stations",
x = "Distance (miles)")Most bike trips are around 1.25 miles long. Or, at least, that’s how far apart most bicycle sharing stations are, since users would be traveling to and from one bicycle sharing station to another. Very few stations were found to be more than 5.0 miles apart in distance, which makes sense if most registered users are using the stations during the week to go to work, presumably.
Graph 6: Mapping the Stations
stationMap <-
leaflet(Stations) %>%
addTiles() %>%
addCircleMarkers(radius = 2) %>%
addCircles(radius = ~ mid, color = "blue", opacity = 0.001)
stationMapThis graph shows where all the bike sharing stations are in Washington, D.C. by locating them through the longitude and latitude points found in the Stations data set. We can see that most of the stations are found in the center of Washington, D.C., most likely where a lot of people work. There is also a large cluster of stations near Arlington, and the Ronald Reagan Washington National Airport, which are two popular locations in Washington, D.C.. While the stations are fairly close together, this helps people who work and live around Washington, D.C. get around quickly.
Conclusion
In this study, I was able to conclude that a lot of people who use the bicycle sharing system in Washington, D.C. are people who work from 8-5 during the week, and mostly casual users on the weekend. I was able to conclude this because Washington, D.C. has a lot of jobs, and since a lot of the bicycle sharing stations are less than 5.0 miles apart from each other, it would make it easy for those who live and work in D.C. to travel by bike to work. The bicycle sharing is obviously taken advantage of in Washington, D.C., as shown by the amount of usage not only during the week, but on the weekend as well. Some problems that I ran into while trying to study this data was when I was attempting to find the distances from each bicycle sharing station. I had to pull an additional function titled haversine in order to find those distances by longitude and latitude location of each bicycle sharing station. By using the haversine function, it made finding my data a lot easier.
Some other questions that could be raised by this study could be:
- How does the bicycle sharing in Washington, D.C. compare to bicycle sharing in places like Chicago, Orlando, or Nashville?
- What types of places (ex: restaurants, parks, shopping centers, etc.) do users usually go on the bicycles?
- Why do some people choose to travel by bicycle instead of a car or bus?