Case Study Capstone Google DA v2

Introduction

This is a part of the Capstone project for the professional certificate on Google Data Analytics offered by Coursera. This will be a great chance to apply the knowledge and new skills gained with the help of the 7 previous courses. I’ll try to demonstrate my ability to handle real-life scenario as a Junior Data Analyst and this will be the first part of my online portfolio.

Ask

Understanding the Task

I chose the provided template for my capstone project, Case Study 1 - “How Does a Bike-Share Navigate Speedy Success”? The imaginary company “Cyclistic” runs bike-share business in Chicago. The Marketing Team believes that the profits might be maximized through increased conversion of one time customers (casual) to regular customers (members) and my task is to analyze the provided data to answer the question: “How do annual members and casual riders use Cyclistic bikes differently?” Full details of the case study can be found by clicking here: Case Study 1

Prepare

Getting the Data

The data is provided through the link https://divvy-tripdata.s3.amazonaws.com/index.html, under the license https://www.divvybikes.com/data-license-agreement. I need to analyze the past 12 months of data, so I downloaded the data from August 2022 to July 2023, 12 zip files in total. I saved them in a separate new folder to keep as the original data, as a backup, in case any data goes missing through the analysis.

I’m using PC with Microsoft 11 with installed Excel, my plan is to use it to inspect and clean each file saved in CSV format. Then I’d like to use use RStudio to combine all 12 files into one as it will be too big to handle as a regular spreadsheet in Excel and run my analysis and build the visuals using RStudio. I’ll create the presentation as RMarkdown too. In order to combine all files into one, all documents must be cleaned and have identical column names and matching data types in them.

In case my PC is not able to handle the volume to produce the graphs, I also plan to use Tableau Public Desktop version as a back up.

Process

Cleaning Data (using Excel)

I open the CSV files one by one to check if the data is ROCCC (reliable, original, comprehensive, current, cited). The data is open sourced, from the first source (created within the company), already anonymized, deidentified, unbiased. I checked for null data, misspelled words, mistyped numbers, extra spaces and characters, mismatched data types, ‘messy’ strings or date formats, misleading column labels, truncated data. Also I checked for the duplicated data with very easy ‘Remove Duplicates’ function in Excel.

For the purpose of this exercise, I’ll only work with the specific data and delete the columns I don’t need. This will make it easier to work with and it will also decrease the volume of the total data which should speed up the processing by the PC.

I’ll leave four columns at this stage. I’ll also add the fifth column calculating the ride length in seconds (easiest to process and work with further) as shown in the screenshot below. From my experience, the final combined document will contain 5M+ rows and RStudio on my PC might not handle this kind of volume of calculations correctly, that’s why I decided to insert the calculations at this stage. Also it is important that the names of the columns would be exactly the same as below and double check the data type. For example, I found that the data for October 2022 in the ‘started_at’ and ‘ended_at’ columns was partially saved as dates, and partially as string text. In Excel it is a very simple action to change format through ‘format cells’ function.

After creating the last column and performing calculations I need to ensure that the data is saved as numbers without any formulas, which is also very simple step in Excel by copying and pasting as numbers the entire column.

At this stage I’ll also clean the data again by removing the negative ride length as it can distort the entire calculations and there are only several of such rows in each file. This is easily done by inserting the filters in the columns and sorting the ‘ride_length’ in ascending order as shown below.

The question that came up after checking the maximum length rides, ‘docked_bike’ has the biggest usage time but is that real bikes or is it an actual bike that is not in use and therefore ‘docked’? There are also many entries that show that the service was used for zero to several seconds, do I need to include this data too? In real life situation I’d try to clarify these with the manager or the team responsible for the data generation. For the purpose of this exercise, I’ll leave this data, as there were no instructions to ignore this information but I’ll keep it mind during my further analysis.

And this is it, all of our 12 files are properly cleaned and ready to be processed further.

Analyze

Calculating Data (using R)

At this stage I’ll be processing the data in RStudio using R as the total number of rows in the combined document should exceed 5.7M rows which Excel can not handle. I need to make sure that all the necessary packages are installed (tidyverse, skimr) prior to the start. I’ll use the following code to combine all twelve files into one. It is important to save all cleaned CSV files into separate subfolder and copy the path to it.

library(tidyverse)
library(skimr)

combined_data_v3 <- list.files(path = "E:/Education/Coursera/Google Data Analytics Professional Certificate/8 Capstone/Case Study/Extracted Data Working Copy/Cleaned Data", pattern = "*.csv", full.names = TRUE) %>%
  lapply(read_csv) %>%
  bind_rows

The R code above ‘binds’ all twelve files into single one. To view the parameters of the combined data we’ll run the following code from ‘skimr’ package to double check if we missed any ‘dirty’ data.

skim_without_charts(combined_data_v3)

The result of the code above is as follows:

── Data Summary ────────────────────────
Values          
Name                       combined_data_v3
Number of rows             5723485         
Number of columns          5               
_______________________                    
Column type frequency:                     
  character                4               
  numeric                  1               
________________________                   
Group variables            None            

── Variable type: character ────────────────────────────────────────────────────────────────────
skim_variable   n_missing complete_rate  min max empty n_unique whitespace
1 rideable_type         0             1   11  13     0        3          0
2 started_at            0             1   13  17     0   462330          0
3 ended_at              0             1   13  17     0   463371          0
4 member_casual         0             1    6   6     0        2          0

── Variable type: numeric ──────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate  mean     sd p0 p25 p50  p75    p100
1 ride_length         0             1  1102. 10422. 0 327 577 1022 3087684

This is a great way to ensure that your final document contains only the relevant and clean data, anything that you miss will show in this summary. As you can tell this was my third try (hence ‘combined_data_v3’) and it was successful and now it’s ready for further analysis.

In order to create compelling visuals I need to insert some more columns to create more parameters for the graphs. I’ll add the following columns: date (in R format), month (as number), day, year, and also day of the week with the help of the following code:

library(lubridate)
combined_data_v3$date <- as.Date(combined_data_v3$started_at, format = "%d-%m-%y")
combined_data_v3$month <- format(as.Date(combined_data_v3$date), "%m")
combined_data_v3$day <- format(as.Date(combined_data_v3$date), "%d")
combined_data_v3$year <- format(as.Date(combined_data_v3$date), "%Y")
combined_data_v3$day_of_week <- format(as.Date(combined_data_v3$date), "%A")

Our data looks as below and is ready for the visuals:

It’s important not to forget to save the file in order to ensure our data stays safe. This is also done in R with the easy code:

write.csv(combined_data_v3, file = "E:/Education/Coursera/Google Data Analytics Professional Certificate/8 Capstone/Case Study/Extracted Data Working Copy/Cleaned Data/combined_data_v3.csv")

Creating Visualizations (using R and Tableau)

It’s time to create some compelling visuals to answer our main question to complete our task: “How do annual members and casual riders use Cyclistic bikes differently?”

We’ll try the following code from the ‘ggplot2’ package from the ‘tidyverse’ which is already loaded at this stage. Before that I need to order the days of the week becasue by default R sorts in in alphabetical order which doesn’t make much sense.

combined_data_v3$day_of_week <- ordered(combined_data_v3$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",))

In order for the graph to look more appealing I’ll make sure the weekdays are displayed in short form. Now we’ll create the plot with the following code:

combined_data_v3 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

It produces the following plot:

Let’s generate one more with the following code:

combined_data_v3 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

It took some time to generate the graphs in R on my PC but they look pretty good, however there seem to be some inconsistencies. The average duration is far more for the casual riders which contradicts the case study instructions. Also, the second graph doesn’t reflect number of rides correctly. There are ways to rectify the problems in R of course, but I’ll try one more tool which would be just right for this job, Tableau Public for Desktop.

After opening a new document in Tableau and uploading our combined CSV file (as text file in Tableau) we create the visual. I try different configurations and because it is so intuitive I’m able to create the following graph:

We see that docked bike in ‘rideable_type’ column skews greatly our data and it falls under the ‘casual’ category. With application of an easy filter we can exclude it and get more relevant data for our task:

Now this looks much better and gets us closer to answering our business task. The graph above shows average ride length. Casual riders on average use it longer, especially on the weekends, almost twice as long as the members. Casual riders tend to use it for longer rides on the weekends whereas the members use it pretty steadily throughout the week with slight increase on the weekend. Let’s generate more graphs and look at total time usage per week (in seconds) in each category:

According to the graph above, casual riders use bikes for longer time on weekends, for example, almost twice as long on Sundays than on Tuesdays or Wednesdays, but less than members during the weekdays. Members use more on weekdays than casual riders and pretty steadily throughout the week with slight dip on Sundays. Let’s see on the total rides count:

This graph above shows interesting information, the members use the bicycles twice as many times during the weekdays than casual riders. So overall, if we consider all 3 plots above, it shows that casual riders use the service less times but for longer rides, especially on the weekends. Members use the bicycles almost twice as many times but for shorter periods of time. They also tend to use them more times during the weekdays, and less times but slightly longer rides during the weekends. This shows that members are most likely to use the service more to commute to work as they use them more on the weekdays. There might be an opportunity for the casual riders to buy the annual membership if they decide to use it to commute to work, too.

Now let’s see how the usage is different throughout the whole year. For these graphs I had to combine the ‘month’ and ‘year’ columns in Tableau itself (again, it’s very easy, fast and intuitive to do) as I am interested to see it in chronological order from August last year up to date. I also tweaked the order of columns to show the groups side by side for better visualization. Here are the results:

Again it shows that one-time users tend to take longer rides on average, especially longer, twice as long, during the last four months. Let’s see the total time (in seconds) for each group:

It’s also an interesting insight. Members use it for longer time in total during the colder months from October to April, and during warmer months the casual riders use it longer than members in total time. Both groups use Cyclistic a lot longer in total during warmer months. Let’s see the situation with the total ride counts:

It’s the same situation as with weekly ride counts, members use it for shorter periods of time but a lot more often than one-time users. This is even more clear during the winter months when members use three times more than the casual riders. All the three graphs above show the seasonality of the bicycle-share business with busy periods in warmer months and sustained mostly by the members during the colder months, which makes a lot of sense as the business runs during winter in Chicago and members probably still use it to commute to work/study for various reasons (convenience/time or money saving/health).

There might be more opportunities for conversion to the membership in the geographical data, to target more specific areas for the specific marketing campaigns, as a suggestion, for the deeper case study for more exercise. Let’s take a quick glance at the Geo data:

The graph shows very extensive coverage area of Cyclistic docking stations. The usage in the downtown area and Hyde Park is the heaviest among both groups. However, membership owners use it more often in the business areas and the one-timers use it slightly more in the outer areas and suburbs.

Act

Providing Recommendations

A lot of interesting insights were uncovered while analyzing the data for the case study. Casual and Member groups use Cyclistic service in different ways. Overall, the members use it more often but for a shorter periods of time, more during the work week days and during the colder seasons (rain or shine, just like when you need to show up for work or for study at the university) and more often in the busiest parts of Chicago in Downtown or Hyde Park area. All this information tells us that members use it more regularly for specific purposes, probably mostly to commute to to work/study or as a form of physical exercise (or both, why not?) as a lot of them use classic bicycles. One-timers usually use it more for casual leisure rides as the usage increases on weekends and during warmer seasons. They also use it less often but for longer time, probably a lot of them living in the outer city or in the suburbs and also a lot of them are most likely to be tourists too as it makes a lot of sense just to buy single use rides to sight see, which also explains the increased usage in the Downtown area. They are also most likely to use the service as part of the developed and well run Chicago public transportation system. Based on all of the above, here are my recommendations:

The marketing campaign would likely be more targeted at casual users if it is run on the weekends in the areas with the most used docking stations among the one-timers. It is also possible to filter our the locations by the weekend usage to pinpoint the busiest bike stations.
The data clearly shows the seasonality of the service usage among casual riders, it will make sense to run promo campaigns during warmer months when the usage peaks. This way it is possible to cover more potential customers to convert to buy the membership.
We can increase the conversion of the casual customers if we promote and popularize the idea of ‘green’, eco-friendly and healthy transportation when used together with the city’s public transport system. The last graph with the geo data suggest that the busiest docking stations are situated along the subway and other busy public transportation routes, most likely it’s already happening, just more casual riders need to know about it. Many other big world cities use ‘Park and Ride’ systems successfully to help decongest the city center traffic and improve ecological situation. For example, suburban dwellers can use the service to get to the subway station and from the subway station to get to their work. In this case it will be also economically reasonable to buy membership and use it unlimited times instead of buying several single uses.

Conclusion

This was a fun exercise where I had an opportunity to use my Data Analytics skills taught in Coursera Google Data Analytics professional certificate program, and to use different tools like Excel, RStudio and Tableau. This was my first ever data analysis exercise and of course it can always be improved in hindsight but it is all part of the valuable learning and experience. It is always encouraging to know that this process of learning Data Analysis can be intelelctually challenging and fun at the same time.