data analytics cyclistic project

Cyclistic project from the Google analytics course.

After completing the Google analytics cert, we were recommended to do a capstone project. This project was done using R and Rstudio.

1- How do annual (users) members and casual riders use Cyclistic bikes differently?

This question will be expanded as follows:

How the user type usage differs
How the different members use the different bikes
The duration(time) the bikes are used for by each membership class
some visiualisations that i think add value to the stakeholder

Table of content:

Importing libraries
Importing data
Creating Data frames
Preparing and Processing Data
Analysing and creating Visualisations
Suggestions and recommendations
Personal takeaways/conclusions

Prepare and Process stages of Analysis:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(base)
library(dplyr)
library(ggplot2)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate)
library(readr)
library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

importing the tables from divvy. make the data imported consistent, make sure that it is all in the correct format

c202201 <- read_csv("1_year_of_cyclistic_data/202201-divvy-tripdata.csv")
c202202 <- read_csv("1_year_of_cyclistic_data/202202-divvy-tripdata.csv")
c202203 <- read_csv("1_year_of_cyclistic_data/202203-divvy-tripdata.csv")
c202204 <- read_csv("1_year_of_cyclistic_data/202204-divvy-tripdata.csv")
c202205 <- read_csv("1_year_of_cyclistic_data/202205-divvy-tripdata.csv")
c202206 <- read_csv("1_year_of_cyclistic_data/202206-divvy-tripdata.csv")
c202207 <- read_csv("1_year_of_cyclistic_data/202207-divvy-tripdata.csv")
c202208 <- read_csv("1_year_of_cyclistic_data/202208-divvy-tripdata.csv")
c202209 <- read_csv("1_year_of_cyclistic_data/202209-divvy-publictripdata.csv")
c202210 <- read_csv("1_year_of_cyclistic_data/202210-divvy-tripdata.csv")
c202211 <- read_csv("1_year_of_cyclistic_data/202211-divvy-tripdata.csv")
c202212 <- read_csv("1_year_of_cyclistic_data/202212-divvy-tripdata.csv")

View each table can individually by running

c202201

## # A tibble: 103,770 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 C2F7DD78E82E… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
##  2 A6CF8980A652… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
##  3 BD0F91DFF741… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
##  4 CBB80ED41910… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
##  5 DDC963BFDDA5… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
##  6 A39C6F6CC058… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
##  7 BDC4AB637EDF… classi… 2022-01-30 18:32:52 2022-01-30 18:49:26 Oakley… KA1504…
##  8 81751A3186E5… classi… 2022-01-22 12:20:02 2022-01-22 12:32:06 Sheffi… TA1306…
##  9 154222B86A33… electr… 2022-01-17 07:34:41 2022-01-17 08:00:08 Racine… 13304  
## 10 72DC25B2DD46… classi… 2022-01-28 15:27:53 2022-01-28 15:35:16 LaSall… TA1309…
## # … with 103,760 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹rideable_type, ²start_station_name, ³start_station_id

The issue with c202211 which i will call “november” is that it’s columns are correctly named, but some of their formats (data types) are incorrect. We cans see that the columns started_at and ended_at, are in chr not date/dattime format. we will have to change this.

so for now i will format the combination of the trip data into tripDataCombi and format it to the final version i want. then i will format the november data frame and once that is consistent with the tripDataCombi data i will bind them together “rbind”.

so to start with the tripDataCombi: Bringing all the tables together, as we seen before they are in similar order and in the same formats:

tripDataCombi <- rbind(c202201,c202202,c202203,c202204,c202205,c202206,c202207,c202208,c202209,c202210,c202212)

Clean and format tripDataCombi. we have 8 columns we will add before binding november to the combined data: Start_date, Start_time, End_time, Duration, Hour, Day, Phase, Month.

Start by adding the columns first thing is to split the time and date..then make sure the date and time columns are in the correct date type.then remove the non conforming rows..split time and date into 2 variables

# Start_date
tripDataCombi$Start_date <- as.Date(tripDataCombi$started_at)
# Start_time
tripDataCombi$Start_time <- format(as.POSIXct(tripDataCombi$started_at), format = "%H:%M:%S")

# End_time
tripDataCombi$End_time <- format(as.POSIXct(tripDataCombi$ended_at), format = "%H:%M:%S")

# Duration
tripDataCombi$Duration <- difftime(tripDataCombi$ended_at, tripDataCombi$started_at)

# Hour
tripDataCombi$Hour <- as.numeric(substr(tripDataCombi$Start_time,1,2))

# Day
tripDataCombi$Day <- format(as.Date(tripDataCombi$Start_date), "%A")

# Phase
tripDataCombi$Phase <- cut(tripDataCombi$Hour,breaks = c(0,6,12,18,24),include.lowest =TRUE,labels = c("Night","Morning","Afternoon","Evening"))

# Month
tripDataCombi$Month<- as.numeric(format(tripDataCombi$Start_date,"%m"))

Change start and end time to time class format:

tripDataCombi$Start_time <- as.ITime(tripDataCombi$Start_time)
tripDataCombi$End_time <- as.ITime(tripDataCombi$End_time)

We have finished reformating the tripDataCombi dataframe, now we will remove the na, null or blank values. Note: I would ask the owner, what does it mean to have a 1 second trip? is this a real-life use of service? do they charge by the second? but i will leave every result with a duration of equal to or greater than 0 seconds, as to not impose my rational on the stakeholders.

Another Note: Removing all the na values and rows containing the na. i tried to reconcile them using the info i have, the long and lat data, the station names and station id, but there was no way of verifying this. the reason being the lat and longtidude values are too inaccurate. the way this is calculated is 111’111m 10-n, n being the number of decimal spaces. the dataframe uses only 1 or 2 decimal spaces, which mean we are within between 11’100 m (11.1km) - 1’100m’s (1.1km) accurate.. which is ridiculously bad… so to remove them is the only option. http://wiki.gis.com/wiki/index.php/Decimal_degrees

remove:

# remove na values
tripDataCombi_cleaned<- na.omit(tripDataCombi)

# find and remove durations less than 0 i.e negative durations
filter(tripDataCombi_cleaned,Duration < 0)

## # A tibble: 37 × 21
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 2D97E3C98E16… classi… 2022-03-05 11:00:57 2022-03-05 10:55:01 DuSabl… TA1307…
##  2 7407049C5D89… electr… 2022-03-05 11:38:04 2022-03-05 11:37:57 Sheffi… TA1307…
##  3 072E947E156D… electr… 2022-06-07 19:14:46 2022-06-07 17:07:45 W Armi… 20254.0
##  4 BF114472ABA0… electr… 2022-06-07 19:14:47 2022-06-07 17:05:42 Base -… Hubbar…
##  5 029D853B5C38… classi… 2022-07-26 20:07:33 2022-07-26 19:59:34 Lincol… chargi…
##  6 C1D6D749139C… classi… 2022-07-26 20:08:04 2022-07-26 19:59:34 Lincol… chargi…
##  7 D3E7C0B68EFE… classi… 2022-07-26 20:20:31 2022-07-26 19:59:34 Lincol… chargi…
##  8 48EA91B86A42… classi… 2022-07-26 18:35:57 2022-07-26 18:32:30 Lincol… chargi…
##  9 035C91D5B31A… electr… 2022-07-30 09:36:02 2022-07-30 09:35:53 Southp… 13229  
## 10 461CC55C9B00… electr… 2022-07-09 20:31:40 2022-07-09 20:30:17 Leavit… 18058  
## # … with 27 more rows, 15 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, Start_date <date>, Start_time <ITime>,
## #   End_time <ITime>, Duration <drtn>, Hour <dbl>, Day <chr>, Phase <fct>,
## #   Month <dbl>, and abbreviated variable names ¹rideable_type,
## #   ²start_station_name, ³start_station_id

tripDataCombi_cleaned <- 
    tripDataCombi_cleaned %>% filter(Duration > 0 )

We are now finished cleaning the tripDataCombi. the cleaned version is called tripDataCombi_cleaned.

Now we will reformat and clean the “november” dataset and make it conform with tripDataCombi_cleaned. then once consistent we will merge them together. We follow the same steps as above. Add the correct columns, then reformat the columns to the correct data type then remove any na/null/blanks and <0 durations.

november <- c202211

# Start_date
november$Start_date <- as.Date(november$started_at)

# change the started_at and ended_at columns to the correct format <dttm>
november$started_at <- dmy_hm(november$started_at,tz=Sys.timezone())
november$ended_at <- dmy_hm(november$ended_at,tz=Sys.timezone())

# Start_time
november$Start_time <- format(as.POSIXct(november$started_at), format = "%H:%M:%S")

# End_time
november$End_time <- format(as.POSIXct(november$ended_at), format = "%H:%M:%S")

# Duration
november$Duration <- difftime(november$ended_at, november$started_at)

# Hour
november$Hour <- as.numeric(substr(november$Start_time,1,2))

# Day
november$Day <- format(as.Date(november$Start_date), "%A")

# Phase
november$Phase <- cut(november$Hour,breaks = c(0,6,12,18,24),include.lowest =TRUE,labels = c("Night","Morning","Afternoon","Evening"))

# Month
november$Month<- as.numeric(format(november$Start_date,"%m"))

change the type again..

# change the times into the time data type
november$Start_time<- as.ITime(november$Start_time)
november$End_time<- as.ITime(november$End_time)

Next we remove all the na, null or blank values in the november dataframe. The rational is mentioned above for the removal of na’s from the tripDataCombo dataframe.

# remove na values
november_cleaned <- na.omit(november)

# find and remove durations less than 0 i.e negative durations
november_cleaned <- filter(november_cleaned,Duration > 0)

so now everything has been corrected in the november dataframe we can now combine the 2.

tdcc <- rbind(tripDataCombi_cleaned,november_cleaned)

Quick check to see that all the values have been combined nicely.

colSums(!is.na(tdcc))

##            ride_id      rideable_type         started_at           ended_at 
##            4366265            4366265            4366265            4366265 
## start_station_name   start_station_id   end_station_name     end_station_id 
##            4366265            4366265            4366265            4366265 
##          start_lat          start_lng            end_lat            end_lng 
##            4366265            4366265            4366265            4366265 
##      member_casual         Start_date         Start_time           End_time 
##            4366265            4366265            4366265            4366265 
##           Duration               Hour                Day              Phase 
##            4366265            4366265            4366265            4366265 
##              Month 
##            4366265

colSums(is.na(tdcc)) #this should be 0

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                  0                  0 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##      member_casual         Start_date         Start_time           End_time 
##                  0                  0                  0                  0 
##           Duration               Hour                Day              Phase 
##                  0                  0                  0                  0 
##              Month 
##                  0

Analyse phase and Visualisation

A quick view of our table tdcc and how the users interact with Cyclistic.

table(tdcc$member_casual)  # how the many of each user-type their are, member and casual

## 
##  casual  member 
## 1757371 2608894

table(tdcc$rideable_type) # the amount each of the 3 different types are used

## 
##  classic_bike   docked_bike electric_bike 
##       2596134        174827       1595304

table(tdcc$Phase) # the times of the that the bikes are used.

## 
##     Night   Morning Afternoon   Evening 
##    271800   1218519   2037328    838618

tabyl(tdcc$Day) # the days the bikes are most used

##   tdcc$Day      n   percent
##     Friday 602905 0.1380825
##     Monday 587505 0.1345555
##   Saturday 720990 0.1651274
##     Sunday 605020 0.1385669
##   Thursday 642800 0.1472196
##    Tuesday 605921 0.1387733
##  Wednesday 601124 0.1376746

tabyl(tdcc, Hour) # the hour spread how many journeys per hour and what percentage overall.

##  Hour      n     percent
##     0  58494 0.013396805
##     1  36854 0.008440624
##     2  21250 0.004866860
##     3  12433 0.002847514
##     4  10814 0.002476716
##     5  34484 0.007897826
##     6  97471 0.022323656
##     7 180976 0.041448698
##     8 219093 0.050178585
##     9 166937 0.038233364
##    10 176408 0.040402495
##    11 219528 0.050278213
##    12 255577 0.058534468
##    13 257382 0.058947865
##    14 263776 0.060412275
##    15 307357 0.070393574
##    16 384290 0.088013439
##    17 450803 0.103246825
##    18 373720 0.085592606
##    19 273504 0.062640266
##    20 193426 0.044300105
##    21 156520 0.035847572
##    22 127707 0.029248568
##    23  87461 0.020031079

count(tdcc, Month,member_casual) #the number of riders per Month and by user type.

## # A tibble: 24 × 3
##    Month member_casual      n
##    <dbl> <chr>          <int>
##  1     1 casual         12605
##  2     1 member         67523
##  3     2 casual         15143
##  4     2 member         74031
##  5     3 casual         67150
##  6     3 member        148821
##  7     4 casual         91889
##  8     4 member        180657
##  9     5 casual        220232
## 10     5 member        282284
## # … with 14 more rows

# put the week days in order.
tdcc$Day <- ordered(tdcc$Day, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

# Some descriptive analysis

# Descriptive analysis of users by day of the week.(average, median, max, min )
tdcc %>%
  group_by(Day,member_casual) %>%
  summarise(Mean = mean(Duration), Median=median(Duration),Longest = max(Duration),Shortest = min(Duration))

## `summarise()` has grouped output by 'Day'. You can override using the `.groups`
## argument.

## # A tibble: 14 × 6
## # Groups:   Day [7]
##    Day       member_casual Mean           Median   Longest      Shortest
##    <ord>     <chr>         <drtn>         <drtn>   <drtn>       <drtn>  
##  1 Monday    casual        1494.0764 secs 835 secs 1922127 secs 1 secs  
##  2 Monday    member         724.2126 secs 519 secs   89575 secs 1 secs  
##  3 Tuesday   casual        1288.1142 secs 730 secs  654358 secs 1 secs  
##  4 Tuesday   member         709.1724 secs 517 secs   89640 secs 1 secs  
##  5 Wednesday casual        1253.8955 secs 730 secs  835693 secs 1 secs  
##  6 Wednesday member         714.2554 secs 526 secs   88277 secs 1 secs  
##  7 Thursday  casual        1285.0835 secs 742 secs 1624968 secs 1 secs  
##  8 Thursday  member         721.9301 secs 527 secs   88538 secs 1 secs  
##  9 Friday    casual        1344.7004 secs 795 secs  597741 secs 1 secs  
## 10 Friday    member         735.1320 secs 533 secs   87318 secs 1 secs  
## 11 Saturday  casual        1598.7870 secs 954 secs 2061244 secs 1 secs  
## 12 Saturday  member         834.6596 secs 600 secs   86997 secs 1 secs  
## 13 Sunday    casual        1626.4804 secs 958 secs  648433 secs 1 secs  
## 14 Sunday    member         822.4934 secs 583 secs   88602 secs 1 secs

# the overall average duration length:
mean(tdcc$Duration)

## Time difference of 1026.465 secs

# means by the different usership.
aggregate(tdcc$Duration ~ tdcc$member_casual, FUN = mean)

##   tdcc$member_casual  tdcc$Duration
## 1             casual 1440.2300 secs
## 2             member  747.7504 secs

# this shows the average per phase in each month. so the morning phase of the whole month of june would be morning/ 6/ the mean value shown.
aggregate(tdcc$Duration, list(tdcc$Phase,tdcc$Month), FUN = mean)

##      Group.1 Group.2              x
## 1      Night       1  945.5807 secs
## 2    Morning       1  691.2455 secs
## 3  Afternoon       1  797.1180 secs
## 4    Evening       1  822.6566 secs
## 5      Night       2  792.8829 secs
## 6    Morning       2  737.5689 secs
## 7  Afternoon       2  815.8888 secs
## 8    Evening       2  776.0862 secs
## 9      Night       3  817.5433 secs
## 10   Morning       3  894.4439 secs
## 11 Afternoon       3 1113.1935 secs
## 12   Evening       3 1019.2609 secs
## 13     Night       4  870.0239 secs
## 14   Morning       4  912.5134 secs
## 15 Afternoon       4 1055.3329 secs
## 16   Evening       4  968.3434 secs
## 17     Night       5 1069.9311 secs
## 18   Morning       5 1141.5598 secs
## 19 Afternoon       5 1218.9223 secs
## 20   Evening       5 1155.3714 secs
## 21     Night       6 1035.5252 secs
## 22   Morning       6 1092.9324 secs
## 23 Afternoon       6 1187.7302 secs
## 24   Evening       6 1130.1517 secs
## 25     Night       7  988.7555 secs
## 26   Morning       7 1133.2006 secs
## 27 Afternoon       7 1198.2024 secs
## 28   Evening       7 1105.5433 secs
## 29     Night       8  925.4655 secs
## 30   Morning       8 1047.2581 secs
## 31 Afternoon       8 1099.7598 secs
## 32   Evening       8 1026.6595 secs
## 33     Night       9  897.3660 secs
## 34   Morning       9  984.3258 secs
## 35 Afternoon       9 1023.8696 secs
## 36   Evening       9  919.2509 secs
## 37     Night      10  731.9523 secs
## 38   Morning      10  912.1244 secs
## 39 Afternoon      10  949.4316 secs
## 40   Evening      10  749.0787 secs
## 41     Night      11  648.3066 secs
## 42   Morning      11  745.3525 secs
## 43 Afternoon      11  813.5900 secs
## 44   Evening      11  730.7207 secs
## 45     Night      12  617.5877 secs
## 46   Morning      12  641.0157 secs
## 47 Afternoon      12  702.3938 secs
## 48   Evening      12  708.0837 secs

Plots and identification of trends.

We can see the bikes that are used by the users. and we can see clearly that a tiny portion use the docked bikes. A suggestion would be to look at the “cost of sale” for the docked bike, this should include management and maintenance fees and any other expense incurred in the process of making a sale, and then see if it is a profitable function for the company.

 ggplot(data = tdcc,mapping= aes(x= rideable_type, fill=rideable_type)) +geom_bar() + labs(title="Bike Types used")

Here we can see a breakdown of the riders by usertype (memeber or casual) and the bike type used (electric, docked or classic bike)

# The type of bike used per rider.

ggplot(data = as.data.frame(tdcc%>%group_by(member_casual,rideable_type) %>% 
                              summarise(n=n())),
       mapping= aes(x= member_casual, y=n, fill =rideable_type)) +
  geom_bar(stat = 'identity') + labs(title="Type of bike used by users",
                                     y ="Duration in seconds",x="Month", caption = "By: Nabeel el Habbash")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Here we have the different users by day. we can clearly see that the members use Cyclistic services more and more consistently. we can also see that members use it during the work week

  ggplot(data = tdcc) + geom_bar(mapping = aes(x = Day )) +facet_wrap(~tdcc$member_casual)+ 
     labs(title = "Usertype by Day ", subtitle = ("use of bikes by day by the different users"), 
          caption = "By: Nabeel el Habbash")

I’m feeling generous: here’s another way of representing the previous plot.. much clearer i think.

  ggplot(data= as.data.frame(tdcc %>% group_by(member_casual, Day) %>%  arrange(member_casual, Day) %>% 
                               summarise(number_of_rides = n())),  
         mapping= aes(x = Day, y = number_of_rides , fill = member_casual)) +
  geom_col(position = "dodge")+ labs(title = "Rides per day by each user ", subtitle = ("The number of rides by the different users"),
                                     y ="Number of rides",x="Day", caption = "By: Nabeel el Habbash")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Next we can see how the 2 user types, use the service on average per use. We can clearly see that casual members use the bikes for much longer than the members. One limitation to this is that we cannot see the users account details and so we cannot see if there are repeat users in either the user types. this would make it very useful to be able to identify the number of uses per customer per different constraints (day, week, month etc) and also see how frequency relates to duration. but this info isn’t available to us.

tdcc %>% 
  group_by(member_casual, Month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(Duration)) %>% 
  arrange(member_casual, Month)  %>% 
  ggplot(aes(x = Month, y = average_duration , fill = member_casual)) +
  geom_col(position = "dodge")+ labs(title = "Analytic Nabeeling", subtitle = ("The average ride length(Duration) by the different Users per month"),
                                     y ="Duration in seconds",x="Month", caption = "By: Nabeel el Habbash") +scale_x_continuous(breaks=seq(1,12,1))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

A remedy to the questions above mentioned problem poses is below. Aggregate all the rides per Day by usership. We can clearly see that members use Cyclistic services far more frequently.

  tdcc %>% 
  group_by(member_casual, Day) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(Duration)) %>% 
  arrange(member_casual, Day)  %>% 
  ggplot(aes(x = Day, y = average_duration , fill = member_casual)) +
  geom_col(position = "dodge")+ labs(title = "Analytic Nabeeling", subtitle = ("The average ride length(Duration) by the different Users per day"),
                                     y ="Duration in seconds",x="Day", caption = "By: Nabeel el Habbash")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

by month and we find that casaul users are on average using bikes for longer every month

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

To see exact numbers we can run this simple table. to show the riders per month by the different users.

  tabyl(tdcc,Month, member_casual)

##  Month casual member
##      1  12605  67523
##      2  15143  74031
##      3  67150 148821
##      4  91889 180657
##      5 220232 282284
##      6 292053 328258
##      7 311649 330980
##      8 270074 335201
##      9 220905 314214
##     10 151312 262926
##     11  72857 180108
##     12  31502 103891

And finally i want to add 1 more plot which is plotting the months against the number of rides, but with the phases displayed, on separate lines. I’m not sure why but i am having many issues trying to get this correctly displayed.. I’ll get it though.. sooner or later it will come. 1 day later, it came!

tdcc_final_G<- count(tdcc, Phase, Month, member_casual) 

ggplot(data=tdcc_final_G, mapping = aes(x= Month, y= n, fill=Phase, color= Phase))+
  scale_x_continuous(breaks=seq(1,12,1))+
  geom_point()+geom_line(linewidth=1)+ facet_grid(~member_casual) + 
  labs(title = "Phase by users per month" , caption="By: Nabeel el Habbash")

Answering the questions initially posed:

How do annual (users) members and casual riders use Cyclistic bikes differently?

From the graphs we can see a few trends:

Casual users use bikes for a lot longer on average. we can see that in the above graphs “analytic nabeeling” that casual members use it more longer every day and month of the year. Infact, from the code below we can see that its almost 100% higher than members.

aggregate(tdcc$Duration ~ tdcc$member_casual, FUN = mean)

##   tdcc$member_casual  tdcc$Duration
## 1             casual 1440.2300 secs
## 2             member  747.7504 secs

Members user the bikes more frequently, as can be seen in the graph “Rides per month by user group”. we can see from the table below that they make almost 50% more rides than casual users do.

count(tdcc, member_casual)

## # A tibble: 2 × 2
##   member_casual       n
##   <chr>           <int>
## 1 casual        1757371
## 2 member        2608894

Suggestions/ Recommendations to stakeholders:

The stakeholder is clearly correct from the data for wanting to onboard casual riders, so it is a wise decision to focus resources on this business task.
We can see that casuals are over represented on the weekends so perhaps this could be something they could exploit this to wean the casual users into members.
Perhaps changing the pricing, offering weekday discounts, or making it that each member can see how much the service would have cost them if they were members.
Charging casuals per duration length, if this is not already being done, is essential. it would capitalist on their almost 100% use over members and be a strong case for them to become members and not abuse their rider length as they currently are.
A method of saving money would be to perhaps phase out the docked bikes as they appear to not be widely used.
The fact that we see massive drop-offs in users in the winter months, perhaps they could introduce 2 pricings for the users. Peak and off-peak(January,Feburary,March, November, December) months.

Lessons/Takeaways from this project:

You don’t have to know everything before going in. Persist, chip away at it continuously and you will overcome the obstacles.
Rely on google and learn how to write questions on stack overflow because the users there have no chill and will down-vote you mercilessly! Here’s my account https://stackoverflow.com/users/5805610/nabeel-el-habbash , I asked 2 questions related to this project.
Decide on a naming convention before you start making changes to dataframes and making new data sets, variables etc. It will save you time and make your work more legible later.
Always keep an original file, that you dont alter. if you make mistakes, delete or mess up, this will save you later.
Making the Rmarkdown was the most difficult part, as it had to clearly organised, correct mistakes, make it consistent and the aim isn’t to show off the coding, but to show the insights gained form the analysis.
Over 500 lines of code where written, only about 100 made it to the Rmarkdown. Again the process isn’t the main concern for the stakeholder. The hour put in, the stackoverflow questions, the googling questions, the research, the trial and error, the personal achievements are personal to you and in the long run will stand to you, but the results are what is seen and spoken about. “The tip of the iceberg is notifies the sailor of the iceberg.” - nabeel

data analytics cyclistic project

Nabeel El Habbash

2023-09-12

Cyclistic project from the Google analytics course.

Prepare and Process stages of Analysis:

Analyse phase and Visualisation

Suggestions/ Recommendations to stakeholders:

Lessons/Takeaways from this project:

Finally: Thanks for reading ;)