Introduction

This is my first stab at a case study. I have chosen to work with the Cyclistic dataset after completing the Googlr professional certificate. I will also focus on doing some illustrative data visualizations along the way. I am new to data analysis and hoping to learn a lot, so feedback is very welcome!

There are four parts to my script as follows:

Preparing the Environment

The first thing done was to install all the necessary packages and libraries for data cleaning, analysis and visualization. To do this, the tidyverse, here, skimr, janitor, dplyr, lubridate, geodist, ggplot2 and ggmap packages were installed. Also their, libraries were loaded as well.

#Load packages
library("tidyverse") # General analysis
library("here") # data cleaning
library("skimr") # data cleaning
library("janitor") # data cleaning
library("dplyr") #  data manipulation
library("lubridate") # dates
library("geodist") # distances
library("ggmap") # maps
library("ggplot2") # visualizations
library(scales) # visualizations

Now that the packages are loaded, let’s read the data in and take a peek.

# Load the data
cyclistic_tripdata_2020_09 <- read_csv("data_sources/202009-divvy-tripdata.csv")
cyclistic_tripdata_2020_10 <- read_csv("data_sources/202010-divvy-tripdata.csv")
cyclistic_tripdata_2020_11 <- read_csv("data_sources/202011-divvy-tripdata.csv")
cyclistic_tripdata_2020_12 <- read_csv("data_sources/202012-divvy-tripdata.csv")
cyclistic_tripdata_2021_01 <- read_csv("data_sources/202101-divvy-tripdata.csv")
cyclistic_tripdata_2021_02 <- read_csv("data_sources/202102-divvy-tripdata.csv")
cyclistic_tripdata_2021_03 <- read_csv("data_sources/202103-divvy-tripdata.csv")
cyclistic_tripdata_2021_04 <- read_csv("data_sources/202104-divvy-tripdata.csv")
cyclistic_tripdata_2021_05 <- read_csv("data_sources/202105-divvy-tripdata.csv")
cyclistic_tripdata_2021_06 <- read_csv("data_sources/202106-divvy-tripdata.csv")
cyclistic_tripdata_2021_07 <- read_csv("data_sources/202107-divvy-tripdata.csv")
cyclistic_tripdata_2021_08 <- read_csv("data_sources/202108-divvy-tripdata.csv")

# combine data
bike_rides <- rbind(
  cyclistic_tripdata_2020_09, cyclistic_tripdata_2020_10,
  cyclistic_tripdata_2020_11, cyclistic_tripdata_2020_12, cyclistic_tripdata_2021_01,
  cyclistic_tripdata_2021_02, cyclistic_tripdata_2021_03, cyclistic_tripdata_2021_04,
  cyclistic_tripdata_2021_05, cyclistic_tripdata_2021_06, cyclistic_tripdata_2021_07,
  cyclistic_tripdata_2021_08
)

# check data
bike_rides %>% 
  skim_without_charts()
Data summary
Name Piped data
Number of rows 4913072
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 4912863 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 450045 0.91 3 53 0 757 0
start_station_id 450571 0.91 1 36 0 1293 0
end_station_name 491380 0.90 10 53 0 756 0
end_station_id 491764 0.90 1 36 0 1293 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.08
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -87.52
end_lat 5015 1 41.90 0.04 41.51 41.88 41.90 41.93 42.15
end_lng 5015 1 -87.65 0.03 -88.07 -87.66 -87.64 -87.63 -87.44

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-09-01 00:00:07 2021-08-31 23:59:35 2021-05-26 16:51:05 4137017
ended_at 0 1 2020-09-01 00:04:43 2021-09-01 17:37:35 2021-05-26 17:12:45 4124534

Data Cleaning

Now that we have an overview of the data, we can proceed to perform some data cleaning. Also, some variables are renamed to improve comprehensiveness and some new variables are added to the data frame for analysis. The ride_length and day_of_week variables are created to improve analysis.

bike_rides <- bike_rides%>%
  clean_names() %>% 
  remove_empty(which = c("rows","cols")) %>% 
  distinct() %>% 
  rename(rider_type = member_casual) %>% 
  rename(bike_type = rideable_type) %>% 
  mutate(ride_length = difftime(ended_at,started_at), #, units = "secs"
         day_of_week = weekdays(started_at)
         )

To perform ride duration calculations, ride_length has to be converted to a numeric variable.

bike_rides$ride_length <- as.numeric(as.character(bike_rides$ride_length))
is.numeric(bike_rides$ride_length)
## [1] TRUE

Next, separating the started_at and ended_at variables.

bike_rides <- bike_rides %>% 
  separate(started_at, into = c('start_date','start_time'), sep = ' ') %>%
  separate(ended_at, into = c('end_date','end_time'), sep = ' ')

Just like ride_length, start_date and end_date need to be converted from a character to date format. Also, start_time and end_time are converted from character to period and the variable start_hour is created.

# convert chr to date
bike_rides$start_date<- ymd(bike_rides$start_date)
bike_rides$end_date<- ymd(bike_rides$end_date)

# convert chr to period
bike_rides$start_time <- hms(bike_rides$start_time)
bike_rides$end_time <- hms(bike_rides$end_time)

# get hour from period
bike_rides$start_hour <- hour(bike_rides$start_time)

The day, month and year should also be added to the data.frame for a more precise analysis.

#Get day, month and year of the rides
bike_rides<-bike_rides %>% 
  mutate(
    day = day(start_date),
    month = month(start_date),
    year = year(start_date)
  )

The same process will be repeated to add season and ride ristance. The latter is converted to miles to form ride_dist_mile. bike_rides is then filtered for only positive distances and the initial distance variable ’ride_distance` is removed from the data.frame.

# Add season to data frame
bike_rides<-bike_rides %>% 
  mutate(
    season = ifelse(
      month %in% 9:11, "Fall",
      ifelse(month %in% 1:2, "Winter",
             ifelse(month %in% 12, "Winter",
                    ifelse(month %in% 3:5, "Spring",
                           "Summer")
             )
      )
         )
    )

# Add distance covered during rides in miles
bike_rides<- bike_rides %>% 
  mutate(ride_distance = geodist_vec(start_lng,
                              start_lat,
                              end_lng,
                              end_lat,
                              paired = TRUE,
                              sequential = TRUE, 
                              measure = "cheap"),
  ride_dist_mile = ride_distance/1609,
  
  ) %>% 
  filter(ride_dist_mile > 0) %>% 
  select(-ride_distance)

Not quite done yet. It’s frustrating I know. But trust me, all that’s left is converting ride_length from seconds to minutes, making an abbreviated month column and filtering out the data for rides with positive times.

#Make a minutes column for ride length
bike_rides$ride_length_m <-
  minute(seconds_to_period(bike_rides$ride_length))

# Make variable abbreviated month variable
bike_rides$month_abb <- month.abb[bike_rides$month]

#remove rows with negative ride_length
bike_rides<-bike_rides %>% 
  
  filter(ride_length_m > 0) %>% 
  select(-ride_length)

Now cleaning is done, time to head into data analysis.

Data Analysis

The first thing is to perform some descriptive analysis to get a feel of the data. Lets’ take a look at ride length.

mean(bike_rides$ride_length_m) # straight average 
## [1] 15.27478
median(bike_rides$ride_length_m) # midpoint number in the ascending array of ride lengths
## [1] 12
max(bike_rides$ride_length_m) # longest ride
## [1] 59
min(bike_rides$ride_length_m) # shortest ride
## [1] 1

On average, a bike ride lasts a little over 15 minutes. Let’s examine ride distance.

mean(bike_rides$ride_dist_mile) # straight average 
## [1] 1.491295
median(bike_rides$ride_dist_mile) # midpoint number in the ascending array of ride distances
## [1] 1.132573
max(bike_rides$ride_dist_mile) # farthest rides
## [1] 30.04696
min(bike_rides$ride_dist_mile) # closest rides
## [1] 1.150712e-06

Interestingly, the average distance covered during a bike ride is about a mile and a half.

Next, looking at rider behavior based on weekdays, months and seasons.However, they all have to be put in the right order first.

# Order days of the week
bike_rides$day_of_week <- ordered(bike_rides$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

# Order months
bike_rides$month_abb <- ordered(bike_rides$month_abb, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

# Order seasons
bike_rides$season <- ordered(bike_rides$season, levels=c("Spring", "Summer", "Fall", "Winter"))

First, Lets look at user activity.

bike_rides %>% 
  group_by(rider_type) %>% 
  summarise(number_rides = n(), average_ride_length = mean(ride_length_m),average_ride_distance = mean(ride_dist_mile)) %>% 
  arrange(desc(number_rides))
## # A tibble: 2 × 4
##   rider_type number_rides average_ride_length average_ride_distance
##   <chr>             <int>               <dbl>                 <dbl>
## 1 member          2554106                12.7                  1.44
## 2 casual          1987215                18.5                  1.56

Annual members go on more bike rides that casual riders but the latter spend more time during rides and ride longer distances.Now that’s done, let’s count the number of rides by the different user types and weekdays.

#Count the number of rides by weekday and by member
bike_rides %>% 
  group_by(rider_type, day_of_week) %>% 
  summarise(number_rides = n(), average_ride_length = mean(ride_length_m), 
            average_ride_distance =mean(ride_dist_mile)) %>% 
  arrange(desc(number_rides))
## # A tibble: 14 × 5
## # Groups:   rider_type [2]
##    rider_type day_of_week number_rides average_ride_length average_ride_distance
##    <chr>      <ord>              <int>               <dbl>                 <dbl>
##  1 casual     Saturday          447630                19.9                  1.62
##  2 member     Wednesday         388019                12.2                  1.42
##  3 member     Tuesday           381340                12.1                  1.41
##  4 member     Friday            377555                12.5                  1.42
##  5 member     Thursday          372578                12.1                  1.41
##  6 member     Saturday          370056                14.1                  1.51
##  7 casual     Sunday            368957                20.2                  1.61
##  8 member     Monday            346296                12.3                  1.40
##  9 member     Sunday            318262                14.2                  1.51
## 10 casual     Friday            291931                17.9                  1.54
## 11 casual     Thursday          224404                16.8                  1.52
## 12 casual     Monday            220769                18.3                  1.51
## 13 casual     Wednesday         217332                16.9                  1.53
## 14 casual     Tuesday           216192                17.2                  1.52

It’s apparent that weekends are the busiest part of the week, especially for casual riders. The longest and farthest bike rides also happen on the weekends. Let’s see how this pans out for time of day.

#Count the number of rides and average ride length at every start hour
bike_rides %>% 
  group_by(start_hour) %>% 
  summarise(number_rides = n(), average_ride_length = mean(ride_length_m), 
            average_ride_distance = mean(ride_dist_mile)) %>% 
  arrange(desc(number_rides))
## # A tibble: 24 × 4
##    start_hour number_rides average_ride_length average_ride_distance
##         <dbl>        <int>               <dbl>                 <dbl>
##  1         17       465359                15.5                  1.54
##  2         18       403888                15.2                  1.47
##  3         16       385624                16.0                  1.54
##  4         15       325356                16.5                  1.54
##  5         14       297153                17.0                  1.54
##  6         19       293319                15.0                  1.42
##  7         13       292989                16.6                  1.52
##  8         12       284692                16.1                  1.49
##  9         11       239380                15.9                  1.51
## 10         20       201095                15.0                  1.39
## # … with 14 more rows

Woah! seems peak rush hour happens from 4:00 pm to 6:00pm but the longest trips happen a bit earlier in the day. Finally, let’s see how seasons affect rider activity.

bike_rides %>% 
  group_by(rider_type, season) %>% 
  summarise(number_rides = n(), average_ride_length = mean(ride_length_m),
            average_ride_distance = mean(ride_dist_mile)) %>% 
  arrange(desc(number_rides))
## `summarise()` has grouped output by 'rider_type'. You can override using the `.groups` argument.
## # A tibble: 8 × 5
## # Groups:   rider_type [2]
##   rider_type season number_rides average_ride_length average_ride_distance
##   <chr>      <ord>         <int>               <dbl>                 <dbl>
## 1 casual     Summer      1107139                18.3                  1.58
## 2 member     Summer      1076649                12.8                  1.46
## 3 member     Fall         681606                12.9                  1.46
## 4 member     Spring       587660                12.8                  1.42
## 5 casual     Spring       414363                19.1                  1.52
## 6 casual     Fall         413820                18.8                  1.59
## 7 member     Winter       208191                11.5                  1.28
## 8 casual     Winter        51893                15.7                  1.41

Annual member are somewhat active throughout the year but it’s clear both users are most active during summer. Winter is the least active season for both users- especially casual riders.

That’s the end of the descriptive analysis. Time for the best part - Visualizations.

Data Visualizations

I decided to approach this on three levels. First, we’ll look at visualizations around the number of rides, second ride duration and last, ride location. But before all that, I want to make values to represent the time period of the data used. These values will be used in the plots.

mindate <- min(bike_rides$start_date)
maxdate <- max(bike_rides$start_date)
mindate
## [1] "2020-09-01"
maxdate
## [1] "2021-08-31"

Number of Rides

We can start by looking at the total number of bike rides. For example, the total number of rides by users and bike type.

bike_rides %>% 
  group_by(bike_type,rider_type) %>% 
  summarise(number_of_rides = n()) %>% 
  ggplot(mapping = aes(x = bike_type, y = number_of_rides, fill = rider_type))+
  geom_col(position = "dodge")+
  scale_y_continuous(labels = comma)+
  labs(title="Number of Rides for Users by Bike Type",
       subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
       x = "Bike Type",
       y = "Number of Rides")

Classic bikes seem to be the most popular choice followed by electric bikes and then docked bikes. Let’s see how activity varies over the time of the day.

# Rider Type by Time of Day
bike_rides %>%
  filter(rider_type %in% c('casual','member')) %>%
  count(rider_type, start_hour) %>%
  ggplot(aes(x=start_hour,y=n,color=rider_type,group=rider_type)) + 
  geom_point() + 
  geom_line(linetype='dotted')+
  scale_y_continuous(labels = comma)+
  theme(axis.text.x  = element_text(size=8,angle=90)) + 
  labs(title="Number of Rides for Users by Time of Day", 
       subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
       x = "Time of Day", 
       y = "Number of Rides")

Casual member are usually active late at night and in the afternon while annual members are active in the morning and in the afternoon. Rush hour is from 4 to 6 pm. What about activity during the week?

#Rider Type by Weekday
bike_rides %>%
  group_by(day, rider_type) %>%
  summarize(number_of_rides = n(),
            wday = day_of_week[1]) %>%
  group_by(wday, rider_type) %>%
  summarize(average_num_rides=mean(number_of_rides)) %>%
  ggplot(aes(x=wday,y=average_num_rides, fill = rider_type)) + 
  geom_bar(stat='identity', position = "dodge")+
  scale_y_continuous(labels = comma)+
  labs(title=" Average Number of Rides for Users by Weekday", 
       subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
       x = "Day of Week", 
       y = "Average Number of Rides")

The average number of rides was used to cater of the difference in the number of each day of the week.

Sunday is the busiest day of the week for both annual member and casual riders but generally the mid-week is busier than the weekend. But how does activity vary throughout the entire day, every day of the week?

#number of rides for users by time of day for each day of week
bike_rides %>%
  filter(rider_type %in% c('casual','member')) %>%
  group_by(day_of_week, rider_type, start_hour) %>%
  summarize(number_of_rides = n(),
            wday = day_of_week[1]) %>%
  group_by(wday, rider_type, start_hour) %>%
  summarize(average_num_rides=mean(number_of_rides)) %>%
  ggplot(aes(x=start_hour,y=average_num_rides, group = rider_type,
             color = rider_type, linetype = rider_type, shape = rider_type)) + 
  geom_point(size=2) + 
  geom_line(size=0.5) + 
  facet_wrap(~wday,nrow=1)+
  scale_y_continuous(labels = comma)+
  labs(title=" Average Number of Rides for Users by Time of Day and Weekday", 
       subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
       x = "Time of Day", 
       y = "Average Number of Rides")+
  theme(axis.text.x  = element_text(size=8,angle=90),
        legend.position="none")

The rush hour trend and the pattern of annual member and casual rider activity seems to be consistent throughout the mid-week. On weekends it appears both users have the same pattern of activity. How does this translate over different seasons?

#total number of rides by season
bike_rides %>%
  filter(rider_type %in% c('casual','member')) %>%
  count(rider_type, season) %>%
  ggplot(aes(x=season,y=n,fill=rider_type,group=rider_type)) + 
  geom_col(position = "dodge") + 
  scale_y_continuous(labels = comma)+
  labs(title="Number of Rides for Users by Season", 
       subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
       x = "Season", 
       y = "Number of Rides")

As mentioned earlier, summer is the most active season and it’s the only season casual riders are more active that annual members. However, casual riders are barely even active during the winter.

Ride Duration

First, let’s determine the distribution of the ride lengths. A histogram would be best suited for this.

par(mar = c(4, 4, .1, .1))

#Histogram to see distribution of ride length
bike_rides %>%
  ggplot(aes(x=ride_length_m))+ 
  geom_histogram()+
  scale_y_continuous(labels = comma)+
  labs(
    title="Distribution of Ride Duration", 
    subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
    x = "Ride Duration (Mins.)", 
    y = "Number of Rides")

This is a skewed distribution with a long right tail. Most trips are less than 30 minutes. Do the two users have similar duration distributions?

#Density plot
bike_rides %>%
  ggplot(aes(x=ride_length_m, fill = rider_type))+ 
  geom_density(alpha = 0.5)+
  scale_y_continuous(labels = comma)+
  labs(
    title="Distribution of Ride Duration for Users", 
    subtitle=paste0("Data from: ", mindate, " to ", maxdate), 
    x = "Ride Duration (Mins.)", 
    y = "Number of Rides") 

These distributions are different - the casual rider distribution is much less skewed with more weight on longer trips. Also, we clearly see that casual riders take longer trips than annual members.

Since this is a skewed distribution with a long right tail, if we want to characterize the “typical” ride duration we should probably not use the average. In the following we will focus on the median ride duration.

#Median Trip duration by day of month
bike_rides %>%
  group_by(day) %>%
  summarize(med.duration=median(ride_length_m),
            weekday=day_of_week[1]) %>%
  ggplot(aes(x=day,y=med.duration,group=1)) + 
  geom_point(aes(color=weekday),size=5) + 
  geom_line(linetype='dotted')+
  labs(x='Day of Month',
       y='Median Ride Duration (Mins.)',
       title='Median Ride Duration by Day of Month',
       subtitle=paste0("Data from: ", mindate, " to ", maxdate)
       )

In terms of duration, the longest trips happen on weekends, while the shortest happen mid-week.

Now let’s look at median trip duration by time of day for each weekday and for each type of user.

bike_rides %>%
  filter(rider_type %in% c('casual','member')) %>%
  group_by(day_of_week,start_hour,rider_type) %>%
  summarize(med.duration=median(ride_length_m)) %>%
  ggplot(aes(x=start_hour,y=med.duration,group=rider_type,
             color=rider_type,linetype=rider_type,shape=rider_type)) + 
  geom_point(size=2) + 
  geom_line(size=0.5) + 
  facet_wrap(~day_of_week,nrow=1) + 
  labs(x='Time of Day',
       y='Median Ride Duration (Mins.)',
       title='Ride Duration for Users by Time of Day and Weekday',
       subtitle=paste0("Data from: ", mindate, " to ", maxdate)
  )+
  scale_x_discrete(breaks=c(0,6,12,18))
## `summarise()` has grouped output by 'day_of_week', 'start_hour'. You can override using the `.groups` argument.

It’s clear that throughout the week casual riders take longer bike rides - especially on weekends. Let’s take a look at how seasons play a part on ride duration.

#Rider type by season
bike_rides %>% 
  group_by(season,rider_type) %>% 
  filter(ride_length_m > 0) %>% 
  summarise(med.duration=median(ride_length_m)) %>% 
  ggplot(mapping = aes(x = season, y = med.duration, fill = rider_type))+
  geom_col(position = "dodge")+
  scale_y_continuous(labels = comma)+
  labs(title="Ride Duration for Users by Season",
       subtitle=paste0("Data from: ", mindate, " to ", maxdate),
       x = "Season", 
       y = "Average Ride Duration (Mins.)")

It appears ride duration is fairly consistent throughout the year for both casual riders and annual members. However, bike rides are shorter during winter for both users.

Ride Location

This was actually the trickiest and most interesting part of the project for me mainly because I have never worked with digital maps before. It was tricky because I had to figure out how to get a Google API that would give me access to the map static API and a geocoding API.

Now let’s begin by getting the API key and registering it with Google.

# Get API key
api<- "AIzaSyAEPZ0uPNpowo4CkbpSa25yJBpSwRUVjUg"

# Register API with Google
register_google(key = api)

Now that’s done, we can start by mapping the bike stations in the system. First, load the data, then extract latitude and longitude for each station (along with number of rides originating from the station). Then plot station locations:

## get station info
station.info <- bike_rides %>%
  drop_na() %>% 
  group_by(start_station_id) %>%
  summarise(lat=as.numeric(start_lat[1]),
            long=as.numeric(start_lng[1]),
            name=start_station_name[1],
            n.trips=n())


## get map and plot station locations 
chicago.map <- get_map(location= 'Chicago', 
                       maptype='roadmap', color='bw',source='google',zoom=13)

#Mapping station activity
ggmap(chicago.map) + 
  geom_point(data=station.info,aes(x=long,y=lat,color=n.trips),size=3,alpha=0.75)+
  scale_colour_gradient(high="red",low='green')+ 
  theme(axis.ticks = element_blank(),axis.text = element_blank())+
  xlab('')+ylab('')+
  ggtitle("Cyclistic Bike Ride Activity")

It appears majority of bike rides happen near the coast. Let’s take a look at where casual riders and annual members are active.

#Mapping Rider location
rider.location<- bike_rides %>%
  drop_na() %>% 
  distinct() %>% 
  filter(ride_dist_mile>0 & ride_length_m > 0) %>% 
  select(start_station_id,start_station_name, start_lat, start_lng,rider_type)


ggmap(chicago.map) + 
  geom_point(data = rider.location,
             aes(x=start_lng,y=start_lat,color=rider_type),size=.10,alpha=0.5) + 
  theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        legend.position="none")+
  ggtitle("Rider Activity Locations")

For the most part, annual members and casual riders bike in the same areas but the former is more spread out throughout Chicago and rely on Cyclistic bikes more often. However, casual riders are not as spread out and are more active near the coastal areas.

Finally let’s finish by taking the busiest station in the system (in terms of starting trips). Then visualize to where the most frequently occurring trips are.

We need to find the busiest station first

#to find busiest station
top.station <- bike_rides %>%
  drop_na() %>% 
  group_by(start_station_id) %>%
  summarise(n.trips=n(),
            name=start_station_name[1],
            lat=start_lat[1],
            lon=start_lng[1]) %>%            
  arrange(desc(n.trips)) %>%
  slice(1)

top.station
## # A tibble: 1 × 5
##   start_station_id n.trips name                      lat   lon
##   <chr>              <int> <chr>                   <dbl> <dbl>
## 1 13022              51111 Streeter Dr & Grand Ave  41.9 -87.6

So the most active station is station 13022. Now extract trips originating here and find the top 20 trips:

#to find top 20 trip from busiest station
busy.station.out <- bike_rides %>%
  drop_na() %>% 
  filter(start_station_id== as.numeric(top.station$start_station_id)) %>%
  group_by(end_station_id) %>%
  summarise(n.trips=n(),
            name=end_station_name[1],
            start.lat = as.numeric(start_lat[1]),
            start.lon = as.numeric(start_lng[1]),
            end.lat = as.numeric(end_lat[1]),
            end.lon = as.numeric(end_lng[1])) %>%
  arrange(desc(n.trips)) %>% 
  slice(1:20)

Now plot the extracted routes:

map_busiest <- get_map(location = c(lon = top.station$lon, 
                               lat = top.station$lat), color='bw',source='google',zoom=14)


ggmap(map_busiest) + 
  geom_segment(data=busy.station.out,aes(x=start.lon,y=start.lat,
                                         xend=end.lon,yend=end.lat,
                                         color=n.trips),size=1,alpha=0.75) +
  geom_point(data=busy.station.out,aes(x=end.lon,y=end.lat,color=n.trips), size=3,alpha=0.75) + 
  geom_point(data=top.station, aes(x=lon,y=lat), size=4, alpha=0.5) +
  scale_colour_gradient(high="red",low='green') + 
  theme(axis.ticks = element_blank(),
        axis.text = element_blank()) +
  xlab('')+ylab('') +
  ggtitle(paste0('Top 20 Trips starting at ', top.station$name))

Conclusion

Thank you for taking the time to read through my first exploration of a case study. I look forward to doing more. Again, I welcomes comments and suggestions!