Introduction

In this paper, we will be analyzing made-up data from a fictional company called Cyclistic. This is a bike sharing company in Chicago, Illinois. This company offers a subscription/membership, but it also offers bike sharing for casual customers. These two groups of people will be referred to as “casual riders” and “members”.

This company’s goal is to maximize the number of annual memberships. Therefore, the research question we will be attempting to answer is: how can Cyclistic maximize the number of memberships? To answer this, we need to know how casual riders and members differ from one another. Using this data, we will then be able to come up with business strategies and marketing tactics for the company.


Data

We are given two sets of cross-sectional data. One is from the first quarter of 2019 and the second is from the first quarter of 2020. Both of these datasets share variables such as starting time, ending time, start station name, end station name, and type of customer (i.e. member or casual). However, only the first dataset includes variables such as gender and birth year, and only the second dataset includes the latitude and longitude of the start and end stations.


Quick Look at the Data

Here is a quick look at the 2019 and 2020 data, respectively:


Here are all the column names in each dataset:

Both datasets have similar data, but ‘trips_2020’ has latitude and longitude of the stations, and ‘trips_2019’ has gender and birth year data. Both contain data on the first quarter on each respective year


Data Cleaning

To clean the data, we check for outliers, misspellings, missing values, and inconsistencies in variable formats.

Getting rid of outliers

As we can see in the graph below, there are many trips that are lasting longer than 24 hours long. These are mistakes in the data.

Though messy due to all the data, the points in red on this plot show all the bike rides in the dataset that were over 24 hours long. Therefore, we are getting rid of all the data points that are above the third quartile.

Q3 <- quantile(trips_2019$tripduration,0.75,na.rm=TRUE)
trips_2019 <- trips_2019 %>%
  filter(tripduration <= Q3) 


Now we are going to do the same thing with the second dataset, but first we need to add a trip duration variable:

trips_2020 <- trips_2020 %>%
  mutate(
    start_time_dt = as.POSIXct(started_at),
    end_time_dt   = as.POSIXct(ended_at),
    tripduration = as.numeric(difftime(end_time_dt, start_time_dt, units = "secs")), 
    tripduration_hrs = tripduration/60/60,
    start_time_only = format(start_time_dt, "%H:%M:%S"), 
    end_time_only   = format(end_time_dt, "%H:%M:%S"))

Just as we did with the first dataset, we will filter out all the values that are above the third quartile.

Q3 <- quantile(trips_2020$tripduration,0.75,na.rm=TRUE)
trips_2020 <- trips_2020 %>%
  filter(tripduration <= Q3) 


Joining the Two Datasets

To join these datasets, we will need to change some variables names:

trips_2020 <- trips_2020 %>%
  rename(
    trip_id = ride_id,
    start_time = started_at,
    end_time = ended_at,
    usertype = member_casual,
    from_station_name = start_station_name,
    from_station_id=start_station_id,
    to_station_name = end_station_name,
    to_station_id = end_station_id)

Now we can join the two datasets together into a new one:

combined_trips <- bind_rows(trips_2019, trips_2020)

Since many of the station names are the same across the two datasets, we can use the latitude and longitude from the 2020 data and fill in many of the blanks of the 2019 data.

combined_trips <- combined_trips %>%
  group_by(from_station_name) %>%
  mutate(
    start_lat = first(na.omit(start_lat)),
    start_lng = first(na.omit(start_lng))
  ) %>%
  ungroup() %>%
  group_by(to_station_name) %>%
  mutate(
    end_lat = first(na.omit(end_lat)),
    end_lng = first(na.omit(end_lng))
  ) %>%
  ungroup()

We need to change the variable names in the second dataset to match those in the first dataset.

combined_trips <- combined_trips %>%
  mutate(usertype = recode(usertype, "Customer" = "casual", "Subscriber" = "member"))


Analyzing the data

In this section, we will be performing regressions using the logit model. The dependent variable will be ‘usertype’. We are going to make usertype into a dummy variable.

combined_trips <- combined_trips %>% 
  mutate(usertype_dummy=ifelse(usertype=="member",1,0))


Starting and Ending Coordinates

Here is a map of the coordinates of where each person first picked up the rental bike. Casual customers are shows in blue and subscribers in red.

It shows that there may be a relationship between starting location and usertype. However, when we divide this map in half horizontally, we find that the percentage of trips started in the top half and bottom half are roughly the same across the two groups.

prop.table(table(combined_trips$start_zone, combined_trips$usertype), margin = 2)
##    
##        casual    member
##   0 0.2347928 0.2427287
##   1 0.7652072 0.7572713

After this investigation, we concluded that starting zone has little to no effect on usertype and should not be included in the regression.


We also suspect that people who are members of this bike sharing system are commuting to work by bike. So we are creating a dummy variable that is equal to 1 when it is around the typical time to commute to and from work, and it is equal to 0 otherwise.

combined_trips <- combined_trips %>%
  mutate(start_hour = hour(start_time),
    commute_time = ifelse((start_hour >= 6 & start_hour < 8) |  
                            (start_hour >= 17 & start_hour < 19), 
                          1, 0))

The commute hours are chosen roughly based on the graph showing how many departures there are per hour across the data:

## Saving 7 x 5 in image


Here is the model we will be running. The independent variables include trip duration, birth year, commute time, and starting location. I am using robust standard errors to control for heteroscedasticity.

combined_model <- glm(usertype_dummy ~ tripduration + birthyear + commute_time, family = binomial,
  data = combined_trips)
robust_se <- coeftest(combined_model, vcov = sandwich)
robust_se
## 
## z test of coefficients:
## 
##                 Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)   2.0785e+02  8.6513e+00  24.026 < 2.2e-16 ***
## tripduration -2.1355e-03  1.1045e-04 -19.334 < 2.2e-16 ***
## birthyear    -1.0178e-01  4.3484e-03 -23.407 < 2.2e-16 ***
## commute_time  3.1878e-01  5.0737e-02   6.283 3.321e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Logistic Regression Results with Robust Standard Errors: Predicting Membership
Term Estimate Robust.SE Pr…z.. Odds.Ratio
(Intercept) (Intercept) 207.854 8.651 0 1.861963e+90
tripduration tripduration -0.002 0.000 0 9.980000e-01
birthyear birthyear -0.102 0.004 0 9.030000e-01
commute_time commute_time 0.319 0.051 0 1.375000e+00


Interpretation of Results:

  • Each additional second of trip duration reduces the odds of being a member of Cyclistic by 0.15%
  • Every year younger a customer is, the odds of being a member decrease by 11%
  • If the customer is using the bike within the typical commute time hours, they are 49% more likely to be a member


Conclusions and Recommendations

The biggest impact on whether someone is a member or not appears to be the variables commute time and birth year. Our recommendation for this company would be to make ads targeting younger people and possibly encouraging them to use this bike system to commute to work and school.

I think this study could be extended and improved by having more information. For example the company can collect more data from the whole year instead of the first quarter. More data points, such as income, could be collected.