This is my first Data Analysis Project. I did it as part of Google Data Analytics course. There are three documents, Prepare and Process Data, Analysis, and Findings and Recommendations.

Business Task

The hypothetical situation is Divvy want to convert casual riders to annual members. The Marketing Executive want a new marketing strategy to do that.

Before we can formulate the strategy, we need to understand our target market first. We have annual members who purchase annual membership and enjoy member’s benefit. Casual riders who purchase single-ride or full-day passes. In addition, there are potential riders who never use Divvy Bikes before.

Marketing Strategy

The RACE model (Reach, Act, Convert, Engage) will be used as a framework to deliver the recommendation. However, we fore we can do that we need to know about our customers. The main goal of this analysis is to answer the following questions:

  • How do annual members and casual riders use Divvy bikes differently?

  • From the behaviors, can we imply who are the casual riders?

Divvy Bikes

“Divvy is the bicycle sharing system in the Chicago metropolitan area, currently serving the cities of Chicago and Evanston. The system is owned by the Chicago Department of Transportation and has been operated by Lyft since 2019.”
Source: Wikipedia

To understand the problem context, I used web search to find as much as information about Divvy Bikes. I also, gathered some general information about Chicago to better understand the city.

Divvy Bike Trip Data

October 2023 to September 2025

Data download from https://divvy-tripdata.s3.amazonaws.com/index.html.
(202310-divvy-tripdata.csv, 202311-divvy-tripdata.csv, 202312-divvy-tripdata.csv, 202401-divvy-tripdata.csv, 202402-divvy-tripdata.csv, 202403-divvy-tripdata.csv, 202404-divvy-tripdata.csv, 202405-divvy-tripdata.csv, 202406-divvy-tripdata.csv, 202407-divvy-tripdata.csv, 202408-divvy-tripdata.csv, 202409-divvy-tripdata.csv, 202410-divvy-tripdata.csv, 202411-divvy-tripdata.csv, 202412-divvy-tripdata.csv, 202501-divvy-tripdata.csv, 202502-divvy-tripdata.csv, 202503-divvy-tripdata.csv, 202504-divvy-tripdata.csv, 202505-divvy-tripdata.csv, 202506-divvy-tripdata.csv, 202507-divvy-tripdata.csv, 202508-divvy-tripdata.csv, 202509-divvy-tripdata.csv, 202510-divvy-tripdata.csv, 202511-divvy-tripdata.csv, 202512-divvy-tripdata.csv)

More information about the data and the Data License Agreement on Divvy Data page on Divvy’s website.

The website mentions that the data has been processed to remove service and maintanance trips as well as trips that were consider false start (less than 60 seconds).

How Does the Data Look Like?

ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
Length:11394065 Length:11394065 Min. :2023-10-01 00:00:05 Min. :2023-10-01 00:02:02 Length:11394065 Length:11394065 Length:11394065 Length:11394065 Min. :41.64 Min. :-87.94 Min. :16.06 Min. :-144.05 Length:11394065
Class :character Class :character 1st Qu.:2024-06-02 19:57:52 1st Qu.:2024-06-02 20:20:05 Class :character Class :character Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.: -87.66 Class :character
Mode :character Mode :character Median :2024-09-24 23:56:53 Median :2024-09-25 00:14:10 Mode :character Mode :character Mode :character Mode :character Median :41.90 Median :-87.64 Median :41.90 Median : -87.64 Mode :character
NA NA Mean :2024-11-01 11:34:51 Mean :2024-11-01 11:51:33 NA NA NA NA Mean :41.90 Mean :-87.65 Mean :41.90 Mean : -87.65 NA
NA NA 3rd Qu.:2025-06-06 12:34:50 3rd Qu.:2025-06-06 12:54:26 NA NA NA NA 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.: -87.63 NA
NA NA Max. :2025-09-30 23:57:30 Max. :2025-09-30 23:59:57 NA NA NA NA Max. :42.07 Max. :-87.52 Max. :87.96 Max. : 152.53 NA
NA NA NA NA NA NA NA NA NA NA NA’s :12883 NA’s :12883 NA
Random Sample of Raw Data
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
54B0151D028CEF39 classic_bike 2024-11-12 19:35:48 2024-11-12 19:45:34 Franklin St & Jackson Blvd TA1305000025 LaSalle St & Illinois St 13430 41.87771 -87.63532 41.89076 -87.63170 casual
4C8D5876B7CA6317 classic_bike 2025-06-13 09:10:16 2025-06-13 09:12:37 Dearborn St & Adams St CHI00438 LaSalle St & Jackson Blvd CHI00447 41.87936 -87.62979 41.87817 -87.63193 member
A57D8045A89C7DC4 electric_bike 2024-12-06 14:02:33 2024-12-06 14:11:56 Lincoln Ave & Belle Plaine Ave TA1309000026 Clark St & Berwyn Ave KA1504000146 41.95600 -87.68016 41.97803 -87.66856 member
F881EF3B4CD18D3C classic_bike 2024-04-15 15:59:21 2024-04-15 16:20:45 New St & Illinois St TA1306000013 Michigan Ave & Pearson St 13034 41.89085 -87.61862 41.89766 -87.62351 casual
6DAFEEDF713A6BDF electric_bike 2024-12-30 19:20:18 2024-12-30 19:41:48 Clark St & Lake St KA1503000012 Logan Blvd & Elston Ave TA1308000031 41.88602 -87.63088 41.92947 -87.68416 member
560171AA17AC124E classic_bike 2023-10-10 07:37:06 2023-10-10 07:50:03 Clybourn Ave & Division St TA1307000115 Halsted St & Fulton St 23003 41.90461 -87.64055 41.88687 -87.64809 member

Total Number of Trips by Rider Type

Total Number of Rides by Each Rider Type
Rider Type No of Rides
member 7,267,014
casual 4,127,051
Total 11,394,065

The number of rides by member riders were significant higher than casual riders. The ratio of casual:member was about 1:1.76.

Bicycle Type

There were three type of vehicle avaliable for rent, classic bicycle, electric bicycle, and electric scooter.

Total Number of Rides for Each Bike Type
Bike Type No of Rides
electric_bike 6,387,197
classic_bike 4,862,531
electric_scooter 144,337
Total 11,394,065

The number of rides for electric scooter seem to be very low compared to other types. The daily used plot show that elecric scooter has data only from August 31, 2024 to September 30, 2024.

ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
Length:144337 Length:144337 Min. :2024-08-31 21:07:36 Min. :2024-09-01 00:00:05 Length:144337 Length:144337 Length:144337 Length:144337 Min. :41.74 Min. :-87.81 Min. :41.72 Min. :-87.81 Length:144337
Class :character Class :character 1st Qu.:2024-09-08 11:39:37 1st Qu.:2024-09-08 11:47:59 Class :character Class :character Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 Class :character
Mode :character Mode :character Median :2024-09-15 12:45:19 Median :2024-09-15 12:56:13 Mode :character Mode :character Mode :character Mode :character Median :41.89 Median :-87.64 Median :41.89 Median :-87.64 Mode :character
NA NA Mean :2024-09-15 15:45:26 Mean :2024-09-15 15:55:51 NA NA NA NA Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65 NA
NA NA 3rd Qu.:2024-09-22 11:03:11 3rd Qu.:2024-09-22 11:09:26 NA NA NA NA 3rd Qu.:41.91 3rd Qu.:-87.63 3rd Qu.:41.91 3rd Qu.:-87.63 NA
NA NA Max. :2024-09-30 23:54:05 Max. :2024-09-30 23:57:13 NA NA NA NA Max. :42.02 Max. :-87.55 Max. :42.02 Max. :-87.55 NA

Any Missing Value?

There are missing data on 6 variables out of total 13 variables. The station name and id for both start and end have almost 20% of missing. The end latitude and longtitude have about 0.11% missing data.

Number of Missing Data on Each Column
Missing Percent
ride_id 0 0.00%
rideable_type 0 0.00%
started_at 0 0.00%
ended_at 0 0.00%
start_station_name 2,185,564 19.18%
start_station_id 2,185,564 19.18%
end_station_name 2,267,627 19.90%
end_station_id 2,267,627 19.90%
start_lat 0 0.00%
start_lng 0 0.00%
end_lat 12,883 0.11%
end_lng 12,883 0.11%
member_casual 0 0.00%

Missing Start Station Name

The percentages of missing start_station_name are about the same for member rider (~19%) and casual rider (~20%).

Compare Bike Type Data That Has Missing Start Station Name
Rider Type Total Percent
casual 819,247 19.85%
member 1,366,317 18.80%

The missing start_station_name almost never happen on classic bike (less than 0.001%). However, they happen a lot more on electric bike (33%) and electric scooter (47%).

Compare Bike Type Data That Has Missing Start Station Name
Bike Type Total Percent
classic_bike 34 0.00%
electric_bike 2,117,881 33.16%
electric_scooter 67,649 46.87%

Missing Start Station Name and Other Missing Values

The table below show than all missing start_station_id are missing togather with start_station_name. For the end_station_name and end_station_id, around 47% missing togather with start_station_name. None of end_lat and end_long missing togather with start_station_name.

Missing Data: When Start Station Name is Missing
Missing
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 2,185,564
start_station_id 2,185,564
end_station_name 1,057,366
end_station_id 1,057,366
start_lat 0
start_lng 0
end_lat 0
end_lng 0
member_casual 0

Missing End Station Name

The percentages of missing end_station_name are about the same for member rider (19%) and casual rider (22%).

Compare Bike Type Data That Has Missing End Station Name
Rider Type Total Percent
casual 913,966 22.15%
member 1,353,661 18.63%

The missing end_station_name are very low for classic bike (0.28%). However, they happen a lot more on electric bike (34%) and electric scooter (49%).

Compare Bike Type Data That Has Missing Start Station Name
Bike Type Total Percent
classic_bike 13,467 0.28%
electric_bike 2,183,820 34.19%
electric_scooter 70,340 48.73%

All missing end_station_id are missing togather with end_station_name. This also comfirm that the end_station_name and start_station_name share around 47% missing data. All of end_lat and end_long missing when the end_station_name missing.

Missing Data: When End Station Name is Missing
Missing
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 1,057,366
start_station_id 1,057,366
end_station_name 2,267,627
end_station_id 2,267,627
start_lat 0
start_lng 0
end_lat 12,883
end_lng 12,883
member_casual 0

Missing End Lat and End Lng

The percentages of missing end_station_name are low for both member rider (~0.03%) and casual rider (~0.26%).

Compare Bike Type Data That Has Missing End Lat
Rider Type Total Percent
casual 10,614 0.26%
member 2,269 0.03%

Missing end_lat happen exclusively on classic bike (0.26%)

Compare Bike Type Data That Has Missing End Lat
Bike Type Total Percent
classic_bike 12,883 0.26%

Can We Fill Missing Data?

Yes. The station name and station id can be match with longitude and latitude from other rows. However, longtitude and latitude data is inconsistant as show in the next section. The station name and id also have the same problem. The best way is to check with Devvy Data page. It has the station infomation that show station name, id, longitude and latitude for every station in the system. The current number of station and public rack on Divvy Data is 1,906 stations (November 11, 2025).

However, the missing station name and id will not be filled here, as they will not be used in the analysis to answer the main question “How do annual members and casual riders use Divvy bikes differently?”.

Quality of Data

Column No. of Unique Value
ride_id 11,393,854
rideable_type 3
started_at 11,043,801
ended_at 11,048,022
start_station_name 1,989
start_station_id 3,521
end_station_name 1,993
end_station_id 3,534
start_lat 690,494
start_lng 662,456
end_lat 3,066
end_lng 3,086
member_casual 2

ride_id

The ride_id should be unique. However, the number of unique values for ride_id is less than the nuber of records. This indicates there is duplicate data. Upon inspect other column of this duplicated ids, all value are the same for each pair of duplcated rows. The reason that the simple duplicate fuction cannot detect these duplication is the date formats of duplication pair are different. The duplcated rows will be removed before analysis.

rideable_type

rideable_type correctly has 3 unique values.

started_at and ended_at

NUmber of unique values do not indicate anything for time scale. However, the different between started_at and ended_at (riding time) have some problems. This issue will be investigated later.

member_casual

member_casual correctly has 2 unique values.

start_station_name and end_station_name

start_station_name (1,989) and end_station_name (1,993) have unique values a little higher than actual number of stations (1,906). This may happen because some relocation of stations and inconsistant data entry.

start_station_id and end_station_id

start_station_id (3,521) and end_station_id (3,534) have unique values almost double actual number of stations (1,906). After closely inspect the unique value, it shows that the station_id system were change overtime result in inconsistant station_id.

start_lat and start_lng

start_Lat and start_lng should have unique value about the same as actual number of stations. However, start_lat and start_lng have large number of unique value. These happen because both columns do not have consistant format. The decimal points of the value are vary from 0 to 16 decimal points. However, these will not create any problems for the analysis.

The approximate boundary of Divvy service area are lat 41.644519 to 42.071768 and long -87.683825 to -87.86225. The maximum values and minimum values for both start_lat (41.64, 42.07) and start_lng (-87.94,-87.52) are with in Divvy dervice area.

end_lat and end_lng

end_Lat and end_lng should have unique value about the same as actual number of stations. The end_lat and end_lng has the same problem as the start counterparts. These happen because they do not have consistant format. The decimal points of the value are vary from 0 to 16 decimal points. However, these will not create any problems for the analysis.

The maximum values and minimum values for both end_lat (16.06, 87.96) and end_lng (-144.05, 152.53) are outside Divvy dervice area. The trips that out of bound values will be removed when do location based analysis.

Ride Duration

All Data

The ride duration is calculated from interval between started_at and ended_at.

Ride Duration (miniute) by Each Rider Type
Rider Type min max mean IQR sd Total
casual -16,656.52 1,574.9 24.09 15.52 87.39 4,127,051
member -16,635.22 1,559.8 12.50 9.57 34.76 7,267,014
Total -16,656.52 1,574.9 16.70 11.43 59.73 11,394,065

The ride duration show that there were some trip that had negative duration. The ended_at was less than started_at. This should mean there are errors in data.

Removed False Start (Ride Shorter Than 1 Minute)

The Divvy Data website indicates that they consider any rides that shorter than 1 miniute to be false start. The total number of rides that have duration less that 1 minute is 272,265 (2.39% of all data). These records will be filter out.

After rides that less than a minute (137,427 casual and 134,838 member) wwere removed, the mean and standard deviation change just a little.

Ride Duration Statistic After Removed False Start
Rider Type min max mean IQR sd Total
casual 1 1,574.9 24.92 15.64 87.84 3,989,624
member 1 1,559.8 12.74 9.56 33.68 7,132,176
Total 1 1,574.9 17.11 11.45 59.41 11,121,800

Outliner Data

At the higher end, there are a lot of outliner. There are some ridings that took longer than a day. The maximum ride duration is 1,574.9 minute (1 day 2 hours 15 minutes). The distribution of riding time is very skew to the right with very long tail.

The IQR method (upper_bound = Q3 + 1.5IQR) has a cutting point at 34.35 minute. This will remove 7.45% of Data. However, it seem has negative bias toward casual rider as it remove 13.87% from casual compare to just 3.86 from member.

Ride Duration Statistic After Removed Outliner with IQR Method
Quantile Minute
0% 1.000000
25% 5.729783
50% 9.805817
75% 17.177688
100% 1574.900183
Rider Type min max mean IQR sd Total
casual 1 34.35 12.47 10.69 7.77 3,436,283
member 1 34.35 10.37 8.69 6.98 6,856,714
Total 1 34.35 11.07 9.43 7.32 10,292,997

According to Divvy policy, classic bike that not return by a day is consider stolen. Another point to consider is the day pass has maximum 3 hour of usage per 24 hours. The 1 day cutting point seem to be too high. Therefore the 3 hours cutting point for outliner will be used. This will remove 0.33% of data (0.69% from casual and 0.14% from member).

Ride Duration Statistic After Removed Time More Than 3 Hours
Rider Type min max mean IQR sd Total
casual 1 180.00 19.04 15.32 21.12 3,962,088
member 1 179.97 11.79 9.53 10.68 7,122,043
Total 1 180.00 14.38 11.34 15.64 11,084,131

Data Cleaning

The following data will be removed.
1. Record with duplicate ride_id.
2. Record with riding time less than 1 minute.
3. Record with riding time more than 3 hours.

Final data has 11,083,982 records (97.28% of original data).

Data Tranformation

New columns will be added to the Data to facilitate data analysis:

  1. Season - Spring, Summer, Fall and Winter computed from started_at.

    • Spring started March 19, 2024 and March 20, 2025
    • Summer started June 20, 2024 and June 20, 2025
    • Fall started September 20, 2023, September 22, 2024, and September 22, 2025
    • Winter started December 21, 2023, and December 21, 2024
  2. Day of Week (wday) Sunday, Monday, Tuesday, Wednesday, Thursday, Friday and Saturday. This column is computed from started_at.

  3. Riding Time in minute (length_min) computed from interval between started_at and ended_at.

The following columns will be removed because they will not be used:

  1. start_station_id
  2. end_station_id

The folowing columns will be transform to new data type:

  1. member_casual from string to factor
  2. rideable_type from string to factor
  3. wday from string to factor
  4. season from string to factor
  5. length_min form interval to numeric

Summary of Final Data

ride_id started_at ended_at member_casual rideable_type start_lat start_lng end_lat end_lng start_station_name end_station_name season wday length_min week_day morning_afternoon
Length:11083982 Min. :2023-10-01 00:00:05 Min. :2023-10-01 00:02:02 casual:3962010 classic_bike :4808871 Min. :41.64 Min. :-87.91 Min. :16.06 Min. :-144.05 Length:11083982 Length:11083982 Spring:2901652 Sun:1469951 Min. : 1.000 weekday:7912152 morning :3307662
Class :character 1st Qu.:2024-06-02 18:38:44 1st Qu.:2024-06-02 18:57:36 member:7121972 electric_bike :6137575 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.: -87.66 Class :character Class :character Summer:4563105 Mon:1498388 1st Qu.: 5.717 weekend:3171830 afternoon:7776320
Mode :character Median :2024-09-24 20:55:02 Median :2024-09-24 21:04:22 NA electric_scooter: 137536 Median :41.90 Median :-87.64 Median :41.90 Median : -87.64 Mode :character Mode :character Fall :2531749 Tue:1576856 Median : 9.772 NA NA
NA Mean :2024-11-01 03:15:12 Mean :2024-11-01 03:29:35 NA NA Mean :41.90 Mean :-87.65 Mean :41.90 Mean : -87.65 NA NA Winter:1087476 Wed:1605352 Mean : 14.383 NA NA
NA 3rd Qu.:2025-06-05 17:33:38 3rd Qu.:2025-06-05 17:48:10 NA NA 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.: -87.63 NA NA NA Thu:1619773 3rd Qu.: 17.060 NA NA
NA Max. :2025-09-30 23:57:30 Max. :2025-09-30 23:59:57 NA NA Max. :42.07 Max. :-87.52 Max. :87.96 Max. : 152.53 NA NA NA Fri:1611783 Max. :180.000 NA NA
NA NA NA NA NA NA NA NA’s :53 NA’s :53 NA NA NA Sat:1701879 NA NA NA

Sample of Final Data

ride_id started_at ended_at member_casual rideable_type start_lat start_lng end_lat end_lng start_station_name end_station_name season wday length_min week_day morning_afternoon
A6C0954494524AC5 2024-08-17 18:33:00 2024-08-17 19:31:22 casual classic_bike 41.90940 -87.67769 41.94707 -87.74001 Damen Ave & Pierce Ave Kilbourn Ave & Milwaukee Ave Summer Sat 58.372983 weekend afternoon
062932ADBEF9EB9F 2023-10-06 10:28:31 2023-10-06 10:33:09 member classic_bike 41.93625 -87.65266 41.92955 -87.64312 Sheffield Ave & Wellington Ave Clark St & Wrightwood Ave Fall Fri 4.633333 weekday morning
598FA2D89B4F197D 2025-09-23 13:25:08 2025-09-23 13:41:10 member electric_bike 41.71971 -87.64302 41.72554 -87.68197 Halsted St & 96th St Public Rack - Western Ave & 92nd Pl Fall Tue 16.025717 weekday afternoon
D3D4564815380968 2024-11-25 08:42:27 2024-11-25 08:58:21 casual classic_bike 41.80241 -87.58692 41.78938 -87.59648 Cornell Ave & Hyde Park Blvd Woodlawn Ave & 58th St Fall Mon 15.902483 weekday morning
6913F4A83D0B8A67 2025-06-24 16:13:41 2025-06-24 16:23:34 member classic_bike 41.87926 -87.63990 41.89391 -87.64174 Canal St & Adams St Kingsbury St & Erie St Summer Tue 9.892450 weekday afternoon
2393928A587E74B9 2025-08-09 12:05:45 2025-08-09 12:18:49 casual electric_bike 41.92676 -87.63443 41.91468 -87.64332 Cannon Dr & Fullerton Ave Larrabee St & Menomonee St Summer Sat 13.067050 weekend afternoon

Rider Type Statistic

Ride Duration Statistic After Removed Time More Than 3 Hours
Rider Type min max mean IQR sd Total
casual 1 180.00 19.04 15.32 21.12 3,962,010
member 1 179.97 11.79 9.53 10.68 7,121,972
Total 1 180.00 14.38 11.34 15.64 11,083,982

Density of Riding Duration

Next: Analysis