Background

This is the final Capstone project for the Google Careers Certification in Data Analytics, offered by Coursera. This will be a great opportunity for me to apply the knowledge and skills gained throughout my study of 7 previous courses. I will work to demonstrate my ability to handle real-life scenarios as a junior data analyst.

Scenario

In this hypothetical scenario I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From the insights gained, the team will design a new marketing strategy to convert casual riders into annual members.

Following the structure of the certification program, I will utilize Google’s data analysis process known as APPASA (Ask, Prepare, Process, Analyze, Share, and Act) in order to methodically work through the case study and provide data-driven recommendations that answer the business task. Each step in the process is detailed below.

Ask

First I need to identify the stakeholders for this project. The stakeholders include the project manager, who will coordinate the team and help me resolve any minor issues throughout my analysis. They also include Cyclistic company executives, who will have the final say on any recommendations I come up with. I must make sure that these recommendations are backed up with compelling data insights and professional data visualizations. Now that I understand the stakeholders for the project I will interpret their desired outcomes for this project by establishing a business task that I will refer to often throughout the project.

Business Task:
“Recommend digital marketing strategies aimed at converting casual Cyclistic riders into annual members, and to understand how member and casual riders use Cyclistic bikes differently.”

I will look to provide 3 to 6 KPI’s (key performance indicators) that will help to answer the business task for stakeholders.

Prepare

The data is provided through this link https://divvy-tripdata.s3.amazonaws.com/index.html, under this license https://www.divvybikes.com/data-license-agreement. I felt that analyzing a full 12 months of data (I picked data from 2021) would give me a large sample size in order to reinforce my insights. I downloaded the data from January 2021 to December 2021, 12 Excel files in total, totaling close to 5.6 million records of collected data. I saved them in a separate new folder named “Raw Data” to keep the original dataset as a backup in case anything should happen while working with the data throughout the analysis.

I’m working on a PC with Microsoft Excel 2013 installed, I plan to use it to clean all 12 files, and then import them into SQL Big Query in CSV format. Then I will write a query to combine all 12 files into one considering it will be too large to handle as a regular spreadsheet in Excel. Once combined, I will export it from the Big Query database into RStudio, where R can handle the volume of my analysis and build basic visuals. I will add more visuals using Tableau. And finally, I will create this final documentation using RMarkdown, where I will present my recommendations that address the business task. But first, in order to combine all files into one all of the spreadsheets must be cleaned, have identical column names and matching data types.

Process

I followed the data cleaning checklist below to clean all 12 spreadsheets.

Data Cleaning Checklist: I checked for duplicate data, null data, misspelled words, incorrectly typed numbers, extra spaces and characters, mismatched data types, ‘messy’ strings or date formats, misleading column labels, and truncated data using Excel.
1. Removed irrelevant data by deleting middle columns.

2. Removed duplicate data with the Remove Duplicates feature in Excel.

3. Checked for null data by creating a formula using the COUNTBLANK function, run for each column A-E and reported in column F.

4. Checked for misspelled words and messy strings using column filtering for any incorrect string spellings. I discovered a few IDs in the “ride_id” column that were of a different format than normal. For instance, in this image the ride ID is the normal 16 digit ID with numbers and letters.

This next image shows the filtered IDs that have the format of a very large number used for its ID.

In this case, I would consult with the project manager on if these IDs are formatted correctly, and whether I can use the rows they represent in my analysis or not. For the purposes of this case study, so long as these IDs are unique, I will leave them in.

5. Checked for incorrectly typed numbers using column filtering for any records not matching the MM/DD/YYYY format for dates.

I found several dates that had the wrong month (March dates were found in February data). I plan to join all 12 datasets into one table using SQL later, where the different months will be grouped together in any case, so for now I will leave them. For time stamps, I split them away from the date-time values using the INT function, shown here,

then used an IF-AND statement with a LEN condition to check that each time stamp fell into the correct time formats of hh:mm and h:mm (military time). I filtered column J to check for any records that reported the “WRONG” result.

For ride IDs I used a LEN function to check that each ID matched the correct 16 digit format (even the numbered IDs I found earlier should only be 16 digits long. I filtered Column G to check for any records that reported the “Too many characters” result.

6. Used the TRIM function to remove any extra spaces and characters for each column.

7. Assigned each column with the correct corresponding data type by formatting each column using the “Format Cells” feature.

8. Made sure date formats were of the correct data type (m/d/yyyy h:mm) using the “Format Cells” feature.

9. Changed misleading column labels to more appropriate names (“rideable_type” to “bike_type”, “member_casual” to “customer_type”).

10. Checked for truncated data. The time datatypes in each dataset were in a truncated date-time format, but I decided to not to split out the time from the date, as aggregations I plan to run on the data later are simple ones that can work despite the format.

Now that my data is clean, I converted the .xlsx files into .csv files, and then imported them into SQL in order to join the 12 datasets into one final table. I ran into an issue while importing datasets where the resulting tables would appear with many null records as well as extra null columns along with the dataset. I wrote a SQL query to each table to remove the nulls and save the results to a new, ready to join table.

Next, I wrote the query that would join all 12 tables together into a single one. Though normally I would use a JOIN clause to accomplish this, I only need to join the tables vertically, as they all have the same 5 columns within them. In this situation, I decided a UNION ALL clause would be simplest to combine them.

After the tables were combined, I exported the final result from Big Query into an R dataframe cleaned and ready for analysis in RStudio.

Analyze

Once the data was imported into R, I did a final check on the data frame to make sure there were no data cleaning errors using the Skim_without_charts function.

Next, I referred back to the business task and decided to identify several KPIs to look for within the data that could help accomplish the business task. I came up with these questions to answer with the data:

1. What months of the year and times of the day do most casual riders use Cyclistic bikes, when do the least people use them? How does this apply to members?

To answer this first question, I wanted to create a detailed visual that looked at bike demand throughout 2021. First, I wrote a script that would split up the date-time column “started_at” into its various parts (year, month, day, time).


This resulted in an updated dataframe.

I then created a plot in R to show an aggregated count of the number of riders (broken up for member/casual riders) using Cyclistic bikes, for each month. The script below first performs the aggregation and then creates the following plot.


From looking at the above bar graph, I concluded that in the spring and winter seasons members dominate the demand for bikes, but in the summer and fall not only do casual members equal or exceed member demand, bike ride demand in general skyrockets.
While this is a useful visualization, I decided to create any future visuals using Tableau Public, as that software is far more intuitive than R for creating them.

Now that I have a visual for total bike ride demand throughout the year, I will create a visual that shows bike ride demand throughout different times of the day, this time using Tableau to build it. After using the “Save” function in R to create an .Rdata file of my data frame, I imported it into Tableau. Within the “Data Source” page, I checked each of my data types to ensure they were being interpreted correctly by Tableau; several weren’t and needed to be updated.

Once done, I chose a histogram to visualize bike demand for each hour of the day.
From this visual, I gathered two insights: First, most riders regardless of the customer type enjoy bike rides in the afternoon and early evening (around 60% of all rides). Second, in the morning hours (5am-9am) there were far more bike rides made by members than casual riders.

2. How long, on average, are casual and member bike rides throughout the year?
For this question, I needed to preform another aggregation on the dataframe to add a “ride_length” column totaling the duration in minutes that a bike ride took. I wrote this script to accomplish the task.

Around 80 records of the new column came back as negative time lengths. Here I would ask a project manager the meaning of these negative lengths and if these records can still be used, but in this case considering only 80 records out of 5.6 million had this error, I wrote a second script to remove them.

This resulted in a final updated dataframe.

I decided that 2 side by side bar graphs for each customer type would best help to answer this question. I then built the visual in Tableau using an average aggregation of the “Ride Length” column of my data. I realized that when I added all 3 bike types to the plot that docked bikes (6% of bikes that were reported as having quality issues and were taken out of circulation by Cyclistic) had much larger average ride lengths than the other 2. Since including bikes that were never ridden in this visual would skew my data, I created a filter to exclude them. Here is the resulting plot.

From these graphs, I quickly discerned that casual riders (25 minute average) always took longer bike trips than members (14 minute average), no matter what time of year.

3. What types of bikes do casual riders and members use for their trips? What percentage of total bike rides does each bike type and customer type command?

I picked a pie chart this time, as I believed it would best help visualize these questions. Again building the visualization in Tableau, I used a count of all total bike rides broken up into pie slices containing the 2 different bike types (classic/electric) as well as the 2 customer types. In actuality there are 3 bike types, the third being named “docked bike,” but as I discussed earlier, these bikes had quality issues and were not truly used for bike trips, and so I filtered them out of this visual.

From this chart I learned that there is a significant portion of casual customers that use electric bikes (about 18%).

Share

Building visuals in RStudio and Tableau for the purpose of answering specific questions about the data helped lead me to 5 insightful KPIs that will combine to address the business task. In this way, much of the “Share” phase of the data analysis process was done in tandem with the “Analyze” phase. But I will still need to compose a final report that will document my entire process, work, and results. Up until this point I have been using Microsoft Word to keep a rough draft of this documentation for my own convenience, but now I will use it to create a polished final report with RMarkdown, a document writing software housed within RStudio.

The report you are reading now is that finished final report. To view any of the Tableau visuals with interactive numerical data follow this link to the project on my Tableau Public Page here.

Act

To recap, the 5 insights that I uncovered throughout my analysis on how member and casual riders use Cyclistic bikes differently are as follows:

• During the spring and winter months members dominate the demand for Cyclistic bikes, but in the summer and fall not only do casual riders equal or exceed member demand, bike ride demand in general skyrockets.
• The majority of riders, regardless of the customer type enjoy bike rides in the afternoon and early evening (around 60% of all rides).
• In the mornings (5am-9am) there were far more bike rides made by members than casual riders.
• Casual riders on average took longer bike rides (25 minutes) than members (14 minutes), no matter what time of year.
• There is a significant portion (950,000 or so) of casual customers that use electric bikes, about 18% of all customers.

Keeping these differences in mind, here are my recommendations for digital marketing strategies that will help to encourage casual Cyclistic riders to sign up for annual memberships:

1. Advertisements for Cyclistic bike memberships will have the largest audience during the summer and fall seasons, during those seasons offer sign up bonuses to attract new members (lower fees, rewards bonuses, etc.).
2. Offer new exclusive member deals for bike rides that take place in the afternoon and evenings, and advertise these deals in Cyclistic ads. Target marketing campaigns to run during these times, when the majority of casual riders are looking to start a new trip.
3. Offer membership benefits for leisure bike rides (around 20 minutes or longer) throughout the year.
4. Offer exclusive member deals targeted at electric bike use. There are nearly a million casual riders that enjoy using them.

In conclusion, the goal of these recommendations is to show casual Cyclistic riders that many of the aspects they enjoy about bike rides (long afternoon rides in the summer and fall) can return valuable benefits if done with a membership. In this way I am confident that the implementation of these recommendations, in whatever form Cyclistic chooses, will encourage membership growth and enjoyment, and by extension, a net profit increase for Cyclistic bikes.
If I was to improve on this case study, I think I would find a way to use the latitude and longitude columns of the raw data in some sort of heat map. It may be worth looking into what geographical locations are most popular among member and casual riders.