In this paper, we will be analyzing made-up data from a fictional company called Cyclistic. This is a bike sharing company in Chicago, Illinois. This company offers a subscription/membership, but it also offers bike sharing for casual customers. These two groups of people will be referred to as “casual riders” and “members”.
This company’s goal is to maximize the number of annual memberships. Therefore, the research question we will be attempting to answer is: how can Cyclistic maximize the number of memberships? To answer this, we need to know how casual riders and members differ from one another. Using this data, we will then be able to come up with business strategies and marketing tactics for the company.
We are given two sets of cross-sectional data. One is from the first quarter of 2019 and the second is from the first quarter of 2020. Both of these datasets share variables such as starting time, ending time, start station name, end station name, and type of customer (i.e. member or casual). However, only the first dataset includes variables such as gender and birth year, and only the second dataset includes the latitude and longitude of the start and end stations.
Here is a quick look at the 2019 and 2020 data, respectively:
Here are all the column names in each dataset:
Both datasets have similar data, but ‘trips_2020’ has latitude and longitude of the stations, and ‘trips_2019’ has gender and birth year data. Both contain data on the first quarter on each respective year
To clean the data, we check for outliers, misspellings, missing values, and inconsistencies in variable formats.
As we can see in the graph below, there are many trips that are lasting longer than 24 hours long. These are mistakes in the data.
Though messy due to all the data, the points in red on this plot show all the bike rides in the dataset that were over 24 hours long. Therefore, we are getting rid of all the data points that are above the third quartile.
Q3 <- quantile(trips_2019$tripduration,0.75,na.rm=TRUE)
trips_2019 <- trips_2019 %>%
filter(tripduration <= Q3)
Now we are going to do the same thing with the second dataset, but first we need to add a trip duration variable:
trips_2020 <- trips_2020 %>%
mutate(
start_time_dt = as.POSIXct(started_at),
end_time_dt = as.POSIXct(ended_at),
tripduration = as.numeric(difftime(end_time_dt, start_time_dt, units = "secs")),
tripduration_hrs = tripduration/60/60,
start_time_only = format(start_time_dt, "%H:%M:%S"),
end_time_only = format(end_time_dt, "%H:%M:%S"))
Just as we did with the first dataset, we will filter out all the values that are above the third quartile.
Q3 <- quantile(trips_2020$tripduration,0.75,na.rm=TRUE)
trips_2020 <- trips_2020 %>%
filter(tripduration <= Q3)
To join these datasets, we will need to change some variables names:
trips_2020 <- trips_2020 %>%
rename(
trip_id = ride_id,
start_time = started_at,
end_time = ended_at,
usertype = member_casual,
from_station_name = start_station_name,
from_station_id=start_station_id,
to_station_name = end_station_name,
to_station_id = end_station_id)
Now we can join the two datasets together into a new one:
combined_trips <- bind_rows(trips_2019, trips_2020)
Since many of the station names are the same across the two datasets, we can use the latitude and longitude from the 2020 data and fill in many of the blanks of the 2019 data.
combined_trips <- combined_trips %>%
group_by(from_station_name) %>%
mutate(
start_lat = first(na.omit(start_lat)),
start_lng = first(na.omit(start_lng))
) %>%
ungroup() %>%
group_by(to_station_name) %>%
mutate(
end_lat = first(na.omit(end_lat)),
end_lng = first(na.omit(end_lng))
) %>%
ungroup()
We need to change the variable names in the second dataset to match those in the first dataset.
combined_trips <- combined_trips %>%
mutate(usertype = recode(usertype, "Customer" = "casual", "Subscriber" = "member"))
In this section, we will be performing regressions using the logit model. The dependent variable will be ‘usertype’. We are going to make usertype into a dummy variable.
combined_trips <- combined_trips %>%
mutate(usertype_dummy=ifelse(usertype=="member",1,0))
Here is a map of the coordinates of where each person first picked up the rental bike. Casual customers are shows in blue and subscribers in red.
It shows that there may be a relationship between starting location and usertype. However, when we divide this map in half horizontally, we find that the percentage of trips started in the top half and bottom half are roughly the same across the two groups.
prop.table(table(combined_trips$start_zone, combined_trips$usertype), margin = 2)
##
## casual member
## 0 0.2347928 0.2427287
## 1 0.7652072 0.7572713
After this investigation, we concluded that starting zone has little to no effect on usertype and should not be included in the regression.
We also suspect that people who are members of this bike sharing system are commuting to work by bike. So we are creating a dummy variable that is equal to 1 when it is around the typical time to commute to and from work, and it is equal to 0 otherwise.
combined_trips <- combined_trips %>%
mutate(start_hour = hour(start_time),
commute_time = ifelse((start_hour >= 6 & start_hour < 8) |
(start_hour >= 17 & start_hour < 19),
1, 0))
The commute hours are chosen roughly based on the graph showing how many departures there are per hour across the data:
## Saving 7 x 5 in image
Here is the model we will be running. The independent variables include trip duration, birth year, commute time, and starting location. I am using robust standard errors to control for heteroscedasticity.
combined_model <- glm(usertype_dummy ~ tripduration + birthyear + commute_time, family = binomial,
data = combined_trips)
robust_se <- coeftest(combined_model, vcov = sandwich)
robust_se
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0785e+02 8.6513e+00 24.026 < 2.2e-16 ***
## tripduration -2.1355e-03 1.1045e-04 -19.334 < 2.2e-16 ***
## birthyear -1.0178e-01 4.3484e-03 -23.407 < 2.2e-16 ***
## commute_time 3.1878e-01 5.0737e-02 6.283 3.321e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Term | Estimate | Robust.SE | Pr…z.. | Odds.Ratio | |
|---|---|---|---|---|---|
| (Intercept) | (Intercept) | 207.854 | 8.651 | 0 | 1.861963e+90 |
| tripduration | tripduration | -0.002 | 0.000 | 0 | 9.980000e-01 |
| birthyear | birthyear | -0.102 | 0.004 | 0 | 9.030000e-01 |
| commute_time | commute_time | 0.319 | 0.051 | 0 | 1.375000e+00 |
The biggest impact on whether someone is a member or not appears to be the variables commute time and birth year. Our recommendation for this company would be to make ads targeting younger people and possibly encouraging them to use this bike system to commute to work and school.
I think this study could be extended and improved by having more information. For example the company can collect more data from the whole year instead of the first quarter. More data points, such as income, could be collected.