The R Markdown

This is the RStudio portion of the Cyclistic Capstone project in R Markdown format. This includes the additional processing and analysis prior to graphical visualization of the findings for the study.

library(tidyverse)
library(ggplot2)
library(stringr)
library(lubridate)

Adding the Data

The study imports the previously aggregated Cyclistic table with the new columns month, day, and ride_length_minutes.

setwd('/Users/shasy/OneDrive/Documents/Google Capstone Case Study')
cyclistic_r <- read.csv("cyclistic_r.csv")

Reformatting Data types and Fixing Column Names

Additional processing was completed on the dataframe to convert the date columns to date types and also to fix the “ride_id” column name.

cyclistic_r$started_at <- as.POSIXlt(cyclistic_r$started_at, tz = "US/Central", format ="%Y-%m-%d %H:%M:%OS")
cyclistic_r$ended_at <- as.POSIXlt(cyclistic_r$ended_at, tz = "US/Central", format ="%Y-%m-%d %H:%M:%OS")
cyclistic_r <- cyclistic_r %>% 
  rename(ride_id = ï..ride_id)

Data Validation Checks

Intel from the company alerted us that a few hundred entries of bikes were taken out of docks and checked for quality. The study checks for the station in question and ride length of less than 0 seconds. It was also observed that the rider type column as NULL values, besides the two “casual” and “member” types. These observations were removed prior to analysis.

cyclistic_r %>% 
  filter(start_station_name == "HQ QR" | (cyclistic_r$ride_length_minutes * 60) < 0)
##  [1] ride_id             rideable_type       member_casual      
##  [4] month               day                 year               
##  [7] started_at          ended_at            ride_length_minutes
## [10] start_station_name  start_station_id    end_station_name   
## [13] end_station_id     
## <0 rows> (or 0-length row.names)
cyclistic_r <- cyclistic_r %>% 
  filter(member_casual != "NULL")
nrow(cyclistic_r)
## [1] 5428696

Viewing the Dataframe

Viewing the column names and dimensions of the dataframe after completing additional processing.

colnames(cyclistic_r)
##  [1] "ride_id"             "rideable_type"       "member_casual"      
##  [4] "month"               "day"                 "year"               
##  [7] "started_at"          "ended_at"            "ride_length_minutes"
## [10] "start_station_name"  "start_station_id"    "end_station_name"   
## [13] "end_station_id"
dim(cyclistic_r)
## [1] 5428696      13
head(cyclistic_r,5)
##            ride_id rideable_type member_casual month       day year
## 1 45C3CC5CE03D28D4  classic_bike        member March  Thursday 2021
## 2 CA87D1880FADBA7B  classic_bike        member March  Thursday 2021
## 3 4316389A8645B70E  classic_bike        member March Wednesday 2021
## 4 DE6B73DE99EFEF95  classic_bike        member March    Sunday 2021
## 5 D63F84C50C094DB5  classic_bike        member March Wednesday 2021
##            started_at            ended_at ride_length_minutes
## 1 2021-03-11 17:24:00 2021-03-11 17:32:00                   8
## 2 2021-03-25 08:50:00 2021-03-25 08:59:00                   9
## 3 2021-03-03 11:28:00 2021-03-03 11:35:00                   7
## 4 2021-03-07 17:20:00 2021-03-07 17:44:00                  24
## 5 2021-03-24 20:42:00 2021-03-24 20:51:00                   9
##               start_station_name start_station_id       end_station_name
## 1       Wells St & Evergreen Ave     TA1308000049  Wells St & Hubbard St
## 2       Wells St & Evergreen Ave     TA1308000049  Wells St & Hubbard St
## 3       Wells St & Evergreen Ave     TA1308000049  Wells St & Hubbard St
## 4 Lake Shore Dr & Wellington Ave     TA1307000041  Wells St & Hubbard St
## 5  Lakeview Ave & Fullerton Pkwy     TA1309000019 Clark St & Schiller St
##   end_station_id
## 1   TA1307000151
## 2   TA1307000151
## 3   TA1307000151
## 4   TA1307000151
## 5   TA1309000024

Descriptive Analysis

The analysis finds the mean, median, maximum and minimum of the ride_length_minutes column. These statistics are then compared across both rider types.

aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = mean)
##   cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1                    casual                        31.97685
## 2                    member                        13.65379
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = median)
##   cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1                    casual                              16
## 2                    member                              10
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = max)
##   cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1                    casual                           55692
## 2                    member                            1560
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = min)
##   cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1                    casual                               0
## 2                    member                               0

Number of Rides by Day of Week

The analysis examines the number of rides by rider type and weekday. This table was imported from previous work in Microsoft SQL Server.

rider_by_day <- read.csv("rider by weekday.csv")
rider_by_day <- rider_by_day %>% 
  rename(member_casual = ï..member_casual, number_of_rides = riders_by_type)

The following is a bar plot illustrating the “rider_by_day” dataframe:

ggplot(data = rider_by_day, aes(x = day, y = number_of_rides, fill = member_casual)) + 
  geom_col(position = "dodge") +
  labs(title = "Number of Rides by Weekday", subtitle = "From the Cyclistic Bike Trips data set", x = "Weekdays", y = 
         "Number of Rides")

Average Ride Length by Weekday and Ride Type

The analysis examines the average ride length by weekday and sorted by rider type.

ride_averages <- cyclistic_r %>% 
  select(member_casual, day, ride_length_minutes) %>% 
  group_by(member_casual, day) %>% 
  summarise(number_of_rides = n(), mean_rides = mean(ride_length_minutes)) %>%
  arrange(member_casual, desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

The following is a bar plot illustrating the “rider_averages” dataframe:

ggplot(data = ride_averages, aes(x = day, y = mean_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Ride Length Averages by Weekday", subtitle = "From the Cyclistic Bike Trips data set", x = "Weekdays", y = "Average Length of Rides")

The “counts” object was created to write a csv file for further visualization in Tableau

counts <- aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual + cyclistic_r$day, FUN = mean)
write.csv(x = counts, file = "avg_ride_length.csv")