This is my first Data Analysis Project. I did it as part of Google Data Analytics course. There are three documents, Prepare and Process Data, Analysis, and Findings and Recommendations.
The hypothetical situation is Divvy want to convert casual riders to annual members. The Marketing Executive want a new marketing strategy to do that.
Before we can formulate the strategy, we need to understand our target market first. We have annual members who purchase annual membership and enjoy member’s benefit. Casual riders who purchase single-ride or full-day passes. In addition, there are potential riders who never use Divvy Bikes before.
The RACE model (Reach, Act, Convert, Engage) will be used as a framework to deliver the recommendation. However, we fore we can do that we need to know about our customers. The main goal of this analysis is to answer the following questions:
How do annual members and casual riders use Divvy bikes differently?
From the behaviors, can we imply who are the casual riders?
“Divvy is the bicycle sharing system in the Chicago metropolitan
area, currently serving the cities of Chicago and Evanston. The system
is owned by the Chicago Department of Transportation and has been
operated by Lyft since 2019.”
Source: Wikipedia
To understand the problem context, I used web search to find as much as information about Divvy Bikes. I also, gathered some general information about Chicago to better understand the city.
Data download from https://divvy-tripdata.s3.amazonaws.com/index.html.
(202310-divvy-tripdata.csv, 202311-divvy-tripdata.csv,
202312-divvy-tripdata.csv, 202401-divvy-tripdata.csv,
202402-divvy-tripdata.csv, 202403-divvy-tripdata.csv,
202404-divvy-tripdata.csv, 202405-divvy-tripdata.csv,
202406-divvy-tripdata.csv, 202407-divvy-tripdata.csv,
202408-divvy-tripdata.csv, 202409-divvy-tripdata.csv,
202410-divvy-tripdata.csv, 202411-divvy-tripdata.csv,
202412-divvy-tripdata.csv, 202501-divvy-tripdata.csv,
202502-divvy-tripdata.csv, 202503-divvy-tripdata.csv,
202504-divvy-tripdata.csv, 202505-divvy-tripdata.csv,
202506-divvy-tripdata.csv, 202507-divvy-tripdata.csv,
202508-divvy-tripdata.csv, 202509-divvy-tripdata.csv,
202510-divvy-tripdata.csv, 202511-divvy-tripdata.csv,
202512-divvy-tripdata.csv)
More information about the data and the Data License Agreement on Divvy Data page on Divvy’s website.
The website mentions that the data has been processed to remove service and maintanance trips as well as trips that were consider false start (less than 60 seconds).
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:11394065 | Length:11394065 | Min. :2023-10-01 00:00:05 | Min. :2023-10-01 00:02:02 | Length:11394065 | Length:11394065 | Length:11394065 | Length:11394065 | Min. :41.64 | Min. :-87.94 | Min. :16.06 | Min. :-144.05 | Length:11394065 | |
| Class :character | Class :character | 1st Qu.:2024-06-02 19:57:52 | 1st Qu.:2024-06-02 20:20:05 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:41.88 | 1st Qu.:-87.66 | 1st Qu.:41.88 | 1st Qu.: -87.66 | Class :character | |
| Mode :character | Mode :character | Median :2024-09-24 23:56:53 | Median :2024-09-25 00:14:10 | Mode :character | Mode :character | Mode :character | Mode :character | Median :41.90 | Median :-87.64 | Median :41.90 | Median : -87.64 | Mode :character | |
| NA | NA | Mean :2024-11-01 11:34:51 | Mean :2024-11-01 11:51:33 | NA | NA | NA | NA | Mean :41.90 | Mean :-87.65 | Mean :41.90 | Mean : -87.65 | NA | |
| NA | NA | 3rd Qu.:2025-06-06 12:34:50 | 3rd Qu.:2025-06-06 12:54:26 | NA | NA | NA | NA | 3rd Qu.:41.93 | 3rd Qu.:-87.63 | 3rd Qu.:41.93 | 3rd Qu.: -87.63 | NA | |
| NA | NA | Max. :2025-09-30 23:57:30 | Max. :2025-09-30 23:59:57 | NA | NA | NA | NA | Max. :42.07 | Max. :-87.52 | Max. :87.96 | Max. : 152.53 | NA | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :12883 | NA’s :12883 | NA |
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54B0151D028CEF39 | classic_bike | 2024-11-12 19:35:48 | 2024-11-12 19:45:34 | Franklin St & Jackson Blvd | TA1305000025 | LaSalle St & Illinois St | 13430 | 41.87771 | -87.63532 | 41.89076 | -87.63170 | casual |
| 4C8D5876B7CA6317 | classic_bike | 2025-06-13 09:10:16 | 2025-06-13 09:12:37 | Dearborn St & Adams St | CHI00438 | LaSalle St & Jackson Blvd | CHI00447 | 41.87936 | -87.62979 | 41.87817 | -87.63193 | member |
| A57D8045A89C7DC4 | electric_bike | 2024-12-06 14:02:33 | 2024-12-06 14:11:56 | Lincoln Ave & Belle Plaine Ave | TA1309000026 | Clark St & Berwyn Ave | KA1504000146 | 41.95600 | -87.68016 | 41.97803 | -87.66856 | member |
| F881EF3B4CD18D3C | classic_bike | 2024-04-15 15:59:21 | 2024-04-15 16:20:45 | New St & Illinois St | TA1306000013 | Michigan Ave & Pearson St | 13034 | 41.89085 | -87.61862 | 41.89766 | -87.62351 | casual |
| 6DAFEEDF713A6BDF | electric_bike | 2024-12-30 19:20:18 | 2024-12-30 19:41:48 | Clark St & Lake St | KA1503000012 | Logan Blvd & Elston Ave | TA1308000031 | 41.88602 | -87.63088 | 41.92947 | -87.68416 | member |
| 560171AA17AC124E | classic_bike | 2023-10-10 07:37:06 | 2023-10-10 07:50:03 | Clybourn Ave & Division St | TA1307000115 | Halsted St & Fulton St | 23003 | 41.90461 | -87.64055 | 41.88687 | -87.64809 | member |
| Rider Type | No of Rides |
|---|---|
| member | 7,267,014 |
| casual | 4,127,051 |
| Total | 11,394,065 |
The number of rides by member riders were significant higher than casual riders. The ratio of casual:member was about 1:1.76.
There were three type of vehicle avaliable for rent, classic bicycle, electric bicycle, and electric scooter.
| Bike Type | No of Rides |
|---|---|
| electric_bike | 6,387,197 |
| classic_bike | 4,862,531 |
| electric_scooter | 144,337 |
| Total | 11,394,065 |
The number of rides for electric scooter seem to be very low compared to other types. The daily used plot show that elecric scooter has data only from August 31, 2024 to September 30, 2024.
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:144337 | Length:144337 | Min. :2024-08-31 21:07:36 | Min. :2024-09-01 00:00:05 | Length:144337 | Length:144337 | Length:144337 | Length:144337 | Min. :41.74 | Min. :-87.81 | Min. :41.72 | Min. :-87.81 | Length:144337 | |
| Class :character | Class :character | 1st Qu.:2024-09-08 11:39:37 | 1st Qu.:2024-09-08 11:47:59 | Class :character | Class :character | Class :character | Class :character | 1st Qu.:41.88 | 1st Qu.:-87.66 | 1st Qu.:41.88 | 1st Qu.:-87.66 | Class :character | |
| Mode :character | Mode :character | Median :2024-09-15 12:45:19 | Median :2024-09-15 12:56:13 | Mode :character | Mode :character | Mode :character | Mode :character | Median :41.89 | Median :-87.64 | Median :41.89 | Median :-87.64 | Mode :character | |
| NA | NA | Mean :2024-09-15 15:45:26 | Mean :2024-09-15 15:55:51 | NA | NA | NA | NA | Mean :41.90 | Mean :-87.65 | Mean :41.90 | Mean :-87.65 | NA | |
| NA | NA | 3rd Qu.:2024-09-22 11:03:11 | 3rd Qu.:2024-09-22 11:09:26 | NA | NA | NA | NA | 3rd Qu.:41.91 | 3rd Qu.:-87.63 | 3rd Qu.:41.91 | 3rd Qu.:-87.63 | NA | |
| NA | NA | Max. :2024-09-30 23:54:05 | Max. :2024-09-30 23:57:13 | NA | NA | NA | NA | Max. :42.02 | Max. :-87.55 | Max. :42.02 | Max. :-87.55 | NA |
There are missing data on 6 variables out of total 13 variables. The station name and id for both start and end have almost 20% of missing. The end latitude and longtitude have about 0.11% missing data.
| Missing | Percent | |
|---|---|---|
| ride_id | 0 | 0.00% |
| rideable_type | 0 | 0.00% |
| started_at | 0 | 0.00% |
| ended_at | 0 | 0.00% |
| start_station_name | 2,185,564 | 19.18% |
| start_station_id | 2,185,564 | 19.18% |
| end_station_name | 2,267,627 | 19.90% |
| end_station_id | 2,267,627 | 19.90% |
| start_lat | 0 | 0.00% |
| start_lng | 0 | 0.00% |
| end_lat | 12,883 | 0.11% |
| end_lng | 12,883 | 0.11% |
| member_casual | 0 | 0.00% |
The percentages of missing start_station_name are about the same for member rider (~19%) and casual rider (~20%).
| Rider Type | Total | Percent |
|---|---|---|
| casual | 819,247 | 19.85% |
| member | 1,366,317 | 18.80% |
The missing start_station_name almost never happen on classic bike (less than 0.001%). However, they happen a lot more on electric bike (33%) and electric scooter (47%).
| Bike Type | Total | Percent |
|---|---|---|
| classic_bike | 34 | 0.00% |
| electric_bike | 2,117,881 | 33.16% |
| electric_scooter | 67,649 | 46.87% |
The table below show than all missing start_station_id are missing togather with start_station_name. For the end_station_name and end_station_id, around 47% missing togather with start_station_name. None of end_lat and end_long missing togather with start_station_name.
| Missing | |
|---|---|
| ride_id | 0 |
| rideable_type | 0 |
| started_at | 0 |
| ended_at | 0 |
| start_station_name | 2,185,564 |
| start_station_id | 2,185,564 |
| end_station_name | 1,057,366 |
| end_station_id | 1,057,366 |
| start_lat | 0 |
| start_lng | 0 |
| end_lat | 0 |
| end_lng | 0 |
| member_casual | 0 |
The percentages of missing end_station_name are about the same for member rider (19%) and casual rider (22%).
| Rider Type | Total | Percent |
|---|---|---|
| casual | 913,966 | 22.15% |
| member | 1,353,661 | 18.63% |
The missing end_station_name are very low for classic bike (0.28%). However, they happen a lot more on electric bike (34%) and electric scooter (49%).
| Bike Type | Total | Percent |
|---|---|---|
| classic_bike | 13,467 | 0.28% |
| electric_bike | 2,183,820 | 34.19% |
| electric_scooter | 70,340 | 48.73% |
All missing end_station_id are missing togather with end_station_name. This also comfirm that the end_station_name and start_station_name share around 47% missing data. All of end_lat and end_long missing when the end_station_name missing.
| Missing | |
|---|---|
| ride_id | 0 |
| rideable_type | 0 |
| started_at | 0 |
| ended_at | 0 |
| start_station_name | 1,057,366 |
| start_station_id | 1,057,366 |
| end_station_name | 2,267,627 |
| end_station_id | 2,267,627 |
| start_lat | 0 |
| start_lng | 0 |
| end_lat | 12,883 |
| end_lng | 12,883 |
| member_casual | 0 |
The percentages of missing end_station_name are low for both member rider (~0.03%) and casual rider (~0.26%).
| Rider Type | Total | Percent |
|---|---|---|
| casual | 10,614 | 0.26% |
| member | 2,269 | 0.03% |
Missing end_lat happen exclusively on classic bike (0.26%)
| Bike Type | Total | Percent |
|---|---|---|
| classic_bike | 12,883 | 0.26% |
Yes. The station name and station id can be match with longitude and latitude from other rows. However, longtitude and latitude data is inconsistant as show in the next section. The station name and id also have the same problem. The best way is to check with Devvy Data page. It has the station infomation that show station name, id, longitude and latitude for every station in the system. The current number of station and public rack on Divvy Data is 1,906 stations (November 11, 2025).
However, the missing station name and id will not be filled here, as they will not be used in the analysis to answer the main question “How do annual members and casual riders use Divvy bikes differently?”.
| Column | No. of Unique Value |
|---|---|
| ride_id | 11,393,854 |
| rideable_type | 3 |
| started_at | 11,043,801 |
| ended_at | 11,048,022 |
| start_station_name | 1,989 |
| start_station_id | 3,521 |
| end_station_name | 1,993 |
| end_station_id | 3,534 |
| start_lat | 690,494 |
| start_lng | 662,456 |
| end_lat | 3,066 |
| end_lng | 3,086 |
| member_casual | 2 |
The ride_id should be unique. However, the number of unique values for ride_id is less than the nuber of records. This indicates there is duplicate data. Upon inspect other column of this duplicated ids, all value are the same for each pair of duplcated rows. The reason that the simple duplicate fuction cannot detect these duplication is the date formats of duplication pair are different. The duplcated rows will be removed before analysis.
rideable_type correctly has 3 unique values.
NUmber of unique values do not indicate anything for time scale. However, the different between started_at and ended_at (riding time) have some problems. This issue will be investigated later.
member_casual correctly has 2 unique values.
start_station_name (1,989) and end_station_name (1,993) have unique values a little higher than actual number of stations (1,906). This may happen because some relocation of stations and inconsistant data entry.
start_station_id (3,521) and end_station_id (3,534) have unique values almost double actual number of stations (1,906). After closely inspect the unique value, it shows that the station_id system were change overtime result in inconsistant station_id.
start_Lat and start_lng should have unique value about the same as actual number of stations. However, start_lat and start_lng have large number of unique value. These happen because both columns do not have consistant format. The decimal points of the value are vary from 0 to 16 decimal points. However, these will not create any problems for the analysis.
The approximate boundary of Divvy service area are lat 41.644519 to 42.071768 and long -87.683825 to -87.86225. The maximum values and minimum values for both start_lat (41.64, 42.07) and start_lng (-87.94,-87.52) are with in Divvy dervice area.
end_Lat and end_lng should have unique value about the same as actual number of stations. The end_lat and end_lng has the same problem as the start counterparts. These happen because they do not have consistant format. The decimal points of the value are vary from 0 to 16 decimal points. However, these will not create any problems for the analysis.
The maximum values and minimum values for both end_lat (16.06, 87.96) and end_lng (-144.05, 152.53) are outside Divvy dervice area. The trips that out of bound values will be removed when do location based analysis.
The ride duration is calculated from interval between started_at and ended_at.
| Rider Type | min | max | mean | IQR | sd | Total |
|---|---|---|---|---|---|---|
| casual | -16,656.52 | 1,574.9 | 24.09 | 15.52 | 87.39 | 4,127,051 |
| member | -16,635.22 | 1,559.8 | 12.50 | 9.57 | 34.76 | 7,267,014 |
| Total | -16,656.52 | 1,574.9 | 16.70 | 11.43 | 59.73 | 11,394,065 |
The ride duration show that there were some trip that had negative duration. The ended_at was less than started_at. This should mean there are errors in data.
The Divvy Data website indicates that they consider any rides that shorter than 1 miniute to be false start. The total number of rides that have duration less that 1 minute is 272,265 (2.39% of all data). These records will be filter out.
After rides that less than a minute (137,427 casual and 134,838 member) wwere removed, the mean and standard deviation change just a little.
| Rider Type | min | max | mean | IQR | sd | Total |
|---|---|---|---|---|---|---|
| casual | 1 | 1,574.9 | 24.92 | 15.64 | 87.84 | 3,989,624 |
| member | 1 | 1,559.8 | 12.74 | 9.56 | 33.68 | 7,132,176 |
| Total | 1 | 1,574.9 | 17.11 | 11.45 | 59.41 | 11,121,800 |
At the higher end, there are a lot of outliner. There are some ridings that took longer than a day. The maximum ride duration is 1,574.9 minute (1 day 2 hours 15 minutes). The distribution of riding time is very skew to the right with very long tail.
The IQR method (upper_bound = Q3 + 1.5IQR) has a cutting point at 34.35 minute. This will remove 7.45% of Data. However, it seem has negative bias toward casual rider as it remove 13.87% from casual compare to just 3.86 from member.
| Quantile | Minute |
|---|---|
| 0% | 1.000000 |
| 25% | 5.729783 |
| 50% | 9.805817 |
| 75% | 17.177688 |
| 100% | 1574.900183 |
| Rider Type | min | max | mean | IQR | sd | Total |
|---|---|---|---|---|---|---|
| casual | 1 | 34.35 | 12.47 | 10.69 | 7.77 | 3,436,283 |
| member | 1 | 34.35 | 10.37 | 8.69 | 6.98 | 6,856,714 |
| Total | 1 | 34.35 | 11.07 | 9.43 | 7.32 | 10,292,997 |
According to Divvy policy, classic bike that not return by a day is consider stolen. Another point to consider is the day pass has maximum 3 hour of usage per 24 hours. The 1 day cutting point seem to be too high. Therefore the 3 hours cutting point for outliner will be used. This will remove 0.33% of data (0.69% from casual and 0.14% from member).
| Rider Type | min | max | mean | IQR | sd | Total |
|---|---|---|---|---|---|---|
| casual | 1 | 180.00 | 19.04 | 15.32 | 21.12 | 3,962,088 |
| member | 1 | 179.97 | 11.79 | 9.53 | 10.68 | 7,122,043 |
| Total | 1 | 180.00 | 14.38 | 11.34 | 15.64 | 11,084,131 |
The following data will be removed.
1. Record with duplicate ride_id.
2. Record with riding time less than 1 minute.
3. Record with riding time more than 3 hours.
Final data has 11,083,982 records (97.28% of original data).
New columns will be added to the Data to facilitate data analysis:
Season - Spring, Summer, Fall and Winter computed from started_at.
Day of Week (wday) Sunday, Monday, Tuesday, Wednesday, Thursday,
Friday and Saturday. This column is computed from started_at.
Riding Time in minute (length_min) computed from interval between started_at and ended_at.
The following columns will be removed because they will not be used:
The folowing columns will be transform to new data type:
| ride_id | started_at | ended_at | member_casual | rideable_type | start_lat | start_lng | end_lat | end_lng | start_station_name | end_station_name | season | wday | length_min | week_day | morning_afternoon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:11083982 | Min. :2023-10-01 00:00:05 | Min. :2023-10-01 00:02:02 | casual:3962010 | classic_bike :4808871 | Min. :41.64 | Min. :-87.91 | Min. :16.06 | Min. :-144.05 | Length:11083982 | Length:11083982 | Spring:2901652 | Sun:1469951 | Min. : 1.000 | weekday:7912152 | morning :3307662 | |
| Class :character | 1st Qu.:2024-06-02 18:38:44 | 1st Qu.:2024-06-02 18:57:36 | member:7121972 | electric_bike :6137575 | 1st Qu.:41.88 | 1st Qu.:-87.66 | 1st Qu.:41.88 | 1st Qu.: -87.66 | Class :character | Class :character | Summer:4563105 | Mon:1498388 | 1st Qu.: 5.717 | weekend:3171830 | afternoon:7776320 | |
| Mode :character | Median :2024-09-24 20:55:02 | Median :2024-09-24 21:04:22 | NA | electric_scooter: 137536 | Median :41.90 | Median :-87.64 | Median :41.90 | Median : -87.64 | Mode :character | Mode :character | Fall :2531749 | Tue:1576856 | Median : 9.772 | NA | NA | |
| NA | Mean :2024-11-01 03:15:12 | Mean :2024-11-01 03:29:35 | NA | NA | Mean :41.90 | Mean :-87.65 | Mean :41.90 | Mean : -87.65 | NA | NA | Winter:1087476 | Wed:1605352 | Mean : 14.383 | NA | NA | |
| NA | 3rd Qu.:2025-06-05 17:33:38 | 3rd Qu.:2025-06-05 17:48:10 | NA | NA | 3rd Qu.:41.93 | 3rd Qu.:-87.63 | 3rd Qu.:41.93 | 3rd Qu.: -87.63 | NA | NA | NA | Thu:1619773 | 3rd Qu.: 17.060 | NA | NA | |
| NA | Max. :2025-09-30 23:57:30 | Max. :2025-09-30 23:59:57 | NA | NA | Max. :42.07 | Max. :-87.52 | Max. :87.96 | Max. : 152.53 | NA | NA | NA | Fri:1611783 | Max. :180.000 | NA | NA | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :53 | NA’s :53 | NA | NA | NA | Sat:1701879 | NA | NA | NA |
| ride_id | started_at | ended_at | member_casual | rideable_type | start_lat | start_lng | end_lat | end_lng | start_station_name | end_station_name | season | wday | length_min | week_day | morning_afternoon |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A6C0954494524AC5 | 2024-08-17 18:33:00 | 2024-08-17 19:31:22 | casual | classic_bike | 41.90940 | -87.67769 | 41.94707 | -87.74001 | Damen Ave & Pierce Ave | Kilbourn Ave & Milwaukee Ave | Summer | Sat | 58.372983 | weekend | afternoon |
| 062932ADBEF9EB9F | 2023-10-06 10:28:31 | 2023-10-06 10:33:09 | member | classic_bike | 41.93625 | -87.65266 | 41.92955 | -87.64312 | Sheffield Ave & Wellington Ave | Clark St & Wrightwood Ave | Fall | Fri | 4.633333 | weekday | morning |
| 598FA2D89B4F197D | 2025-09-23 13:25:08 | 2025-09-23 13:41:10 | member | electric_bike | 41.71971 | -87.64302 | 41.72554 | -87.68197 | Halsted St & 96th St | Public Rack - Western Ave & 92nd Pl | Fall | Tue | 16.025717 | weekday | afternoon |
| D3D4564815380968 | 2024-11-25 08:42:27 | 2024-11-25 08:58:21 | casual | classic_bike | 41.80241 | -87.58692 | 41.78938 | -87.59648 | Cornell Ave & Hyde Park Blvd | Woodlawn Ave & 58th St | Fall | Mon | 15.902483 | weekday | morning |
| 6913F4A83D0B8A67 | 2025-06-24 16:13:41 | 2025-06-24 16:23:34 | member | classic_bike | 41.87926 | -87.63990 | 41.89391 | -87.64174 | Canal St & Adams St | Kingsbury St & Erie St | Summer | Tue | 9.892450 | weekday | afternoon |
| 2393928A587E74B9 | 2025-08-09 12:05:45 | 2025-08-09 12:18:49 | casual | electric_bike | 41.92676 | -87.63443 | 41.91468 | -87.64332 | Cannon Dr & Fullerton Ave | Larrabee St & Menomonee St | Summer | Sat | 13.067050 | weekend | afternoon |
| Rider Type | min | max | mean | IQR | sd | Total |
|---|---|---|---|---|---|---|
| casual | 1 | 180.00 | 19.04 | 15.32 | 21.12 | 3,962,010 |
| member | 1 | 179.97 | 11.79 | 9.53 | 10.68 | 7,121,972 |
| Total | 1 | 180.00 | 14.38 | 11.34 | 15.64 | 11,083,982 |