This is a part of the Capstone project for the professional certificate on Google Data Analytics offered by Coursera. This will be a great chance to apply the knowledge and new skills gained with the help of the 7 previous courses. I’ll try to demonstrate my ability to handle real-life scenario as a Junior Data Analyst and this will be the first part of my online portfolio.
I chose the provided template for my capstone project, Case Study 1 - “How Does a Bike-Share Navigate Speedy Success”? The imaginary company “Cyclistic” runs bike-share business in Chicago. The Marketing Team believes that the profits might be maximized through increased conversion of one time customers (casual) to regular customers (members) and my task is to analyze the provided data to answer the question: “How do annual members and casual riders use Cyclistic bikes differently?” Full details of the case study can be found by clicking here: Case Study 1
The data is provided through the link https://divvy-tripdata.s3.amazonaws.com/index.html, under the license https://www.divvybikes.com/data-license-agreement. I need to analyze the past 12 months of data, so I downloaded the data from August 2022 to July 2023, 12 zip files in total. I saved them in a separate new folder to keep as the original data, as a backup, in case any data goes missing through the analysis.
I’m using PC with Microsoft 11 with installed Excel, my plan is to use it to inspect and clean each file saved in CSV format. Then I’d like to use use RStudio to combine all 12 files into one as it will be too big to handle as a regular spreadsheet in Excel and run my analysis and build the visuals using RStudio. I’ll create the presentation as RMarkdown too. In order to combine all files into one, all documents must be cleaned and have identical column names and matching data types in them.
In case my PC is not able to handle the volume to produce the graphs, I also plan to use Tableau Public Desktop version as a back up.
I open the CSV files one by one to check if the data is ROCCC (reliable, original, comprehensive, current, cited). The data is open sourced, from the first source (created within the company), already anonymized, deidentified, unbiased. I checked for null data, misspelled words, mistyped numbers, extra spaces and characters, mismatched data types, ‘messy’ strings or date formats, misleading column labels, truncated data. Also I checked for the duplicated data with very easy ‘Remove Duplicates’ function in Excel.
For the purpose of this exercise, I’ll only work with the specific data and delete the columns I don’t need. This will make it easier to work with and it will also decrease the volume of the total data which should speed up the processing by the PC.
I’ll leave four columns at this stage. I’ll also add the fifth column calculating the ride length in seconds (easiest to process and work with further) as shown in the screenshot below. From my experience, the final combined document will contain 5M+ rows and RStudio on my PC might not handle this kind of volume of calculations correctly, that’s why I decided to insert the calculations at this stage. Also it is important that the names of the columns would be exactly the same as below and double check the data type. For example, I found that the data for October 2022 in the ‘started_at’ and ‘ended_at’ columns was partially saved as dates, and partially as string text. In Excel it is a very simple action to change format through ‘format cells’ function.
After creating the last column and performing calculations I need to ensure that the data is saved as numbers without any formulas, which is also very simple step in Excel by copying and pasting as numbers the entire column.
At this stage I’ll also clean the data again by removing the negative ride length as it can distort the entire calculations and there are only several of such rows in each file. This is easily done by inserting the filters in the columns and sorting the ‘ride_length’ in ascending order as shown below.
The question that came up after checking the maximum length rides, ‘docked_bike’ has the biggest usage time but is that real bikes or is it an actual bike that is not in use and therefore ‘docked’? There are also many entries that show that the service was used for zero to several seconds, do I need to include this data too? In real life situation I’d try to clarify these with the manager or the team responsible for the data generation. For the purpose of this exercise, I’ll leave this data, as there were no instructions to ignore this information but I’ll keep it mind during my further analysis.
And this is it, all of our 12 files are properly cleaned and ready to be processed further.
At this stage I’ll be processing the data in RStudio using R as the total number of rows in the combined document should exceed 5.7M rows which Excel can not handle. I need to make sure that all the necessary packages are installed (tidyverse, skimr) prior to the start. I’ll use the following code to combine all twelve files into one. It is important to save all cleaned CSV files into separate subfolder and copy the path to it.
library(tidyverse)
library(skimr)
combined_data_v3 <- list.files(path = "E:/Education/Coursera/Google Data Analytics Professional Certificate/8 Capstone/Case Study/Extracted Data Working Copy/Cleaned Data", pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
The R code above ‘binds’ all twelve files into single one. To view the parameters of the combined data we’ll run the following code from ‘skimr’ package to double check if we missed any ‘dirty’ data.
skim_without_charts(combined_data_v3)
The result of the code above is as follows:
── Data Summary ────────────────────────
Values
Name combined_data_v3
Number of rows 5723485
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 rideable_type 0 1 11 13 0 3 0
2 started_at 0 1 13 17 0 462330 0
3 ended_at 0 1 13 17 0 463371 0
4 member_casual 0 1 6 6 0 2 0
── Variable type: numeric ──────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
1 ride_length 0 1 1102. 10422. 0 327 577 1022 3087684
This is a great way to ensure that your final document contains only the relevant and clean data, anything that you miss will show in this summary. As you can tell this was my third try (hence ‘combined_data_v3’) and it was successful and now it’s ready for further analysis.
In order to create compelling visuals I need to insert some more columns to create more parameters for the graphs. I’ll add the following columns: date (in R format), month (as number), day, year, and also day of the week with the help of the following code:
library(lubridate)
combined_data_v3$date <- as.Date(combined_data_v3$started_at, format = "%d-%m-%y")
combined_data_v3$month <- format(as.Date(combined_data_v3$date), "%m")
combined_data_v3$day <- format(as.Date(combined_data_v3$date), "%d")
combined_data_v3$year <- format(as.Date(combined_data_v3$date), "%Y")
combined_data_v3$day_of_week <- format(as.Date(combined_data_v3$date), "%A")
Our data looks as below and is ready for the visuals:
It’s important not to forget to save the file in order to ensure our data stays safe. This is also done in R with the easy code:
write.csv(combined_data_v3, file = "E:/Education/Coursera/Google Data Analytics Professional Certificate/8 Capstone/Case Study/Extracted Data Working Copy/Cleaned Data/combined_data_v3.csv")
A lot of interesting insights were uncovered while analyzing the data for the case study. Casual and Member groups use Cyclistic service in different ways. Overall, the members use it more often but for a shorter periods of time, more during the work week days and during the colder seasons (rain or shine, just like when you need to show up for work or for study at the university) and more often in the busiest parts of Chicago in Downtown or Hyde Park area. All this information tells us that members use it more regularly for specific purposes, probably mostly to commute to to work/study or as a form of physical exercise (or both, why not?) as a lot of them use classic bicycles. One-timers usually use it more for casual leisure rides as the usage increases on weekends and during warmer seasons. They also use it less often but for longer time, probably a lot of them living in the outer city or in the suburbs and also a lot of them are most likely to be tourists too as it makes a lot of sense just to buy single use rides to sight see, which also explains the increased usage in the Downtown area. They are also most likely to use the service as part of the developed and well run Chicago public transportation system. Based on all of the above, here are my recommendations:
The marketing campaign would likely be more targeted at casual users if it is run on the weekends in the areas with the most used docking stations among the one-timers. It is also possible to filter our the locations by the weekend usage to pinpoint the busiest bike stations.
The data clearly shows the seasonality of the service usage among casual riders, it will make sense to run promo campaigns during warmer months when the usage peaks. This way it is possible to cover more potential customers to convert to buy the membership.
We can increase the conversion of the casual customers if we promote and popularize the idea of ‘green’, eco-friendly and healthy transportation when used together with the city’s public transport system. The last graph with the geo data suggest that the busiest docking stations are situated along the subway and other busy public transportation routes, most likely it’s already happening, just more casual riders need to know about it. Many other big world cities use ‘Park and Ride’ systems successfully to help decongest the city center traffic and improve ecological situation. For example, suburban dwellers can use the service to get to the subway station and from the subway station to get to their work. In this case it will be also economically reasonable to buy membership and use it unlimited times instead of buying several single uses.
This was a fun exercise where I had an opportunity to use my Data Analytics skills taught in Coursera Google Data Analytics professional certificate program, and to use different tools like Excel, RStudio and Tableau. This was my first ever data analysis exercise and of course it can always be improved in hindsight but it is all part of the valuable learning and experience. It is always encouraging to know that this process of learning Data Analysis can be intelelctually challenging and fun at the same time.