This is the RStudio portion of the Cyclistic Capstone project in R Markdown format. This includes the additional processing and analysis prior to graphical visualization of the findings for the study.
library(tidyverse)
library(ggplot2)
library(stringr)
library(lubridate)
The study imports the previously aggregated Cyclistic table with the new columns month, day, and ride_length_minutes.
setwd('/Users/shasy/OneDrive/Documents/Google Capstone Case Study')
cyclistic_r <- read.csv("cyclistic_r.csv")
Additional processing was completed on the dataframe to convert the date columns to date types and also to fix the “ride_id” column name.
cyclistic_r$started_at <- as.POSIXlt(cyclistic_r$started_at, tz = "US/Central", format ="%Y-%m-%d %H:%M:%OS")
cyclistic_r$ended_at <- as.POSIXlt(cyclistic_r$ended_at, tz = "US/Central", format ="%Y-%m-%d %H:%M:%OS")
cyclistic_r <- cyclistic_r %>%
rename(ride_id = ï..ride_id)
Intel from the company alerted us that a few hundred entries of bikes were taken out of docks and checked for quality. The study checks for the station in question and ride length of less than 0 seconds. It was also observed that the rider type column as NULL values, besides the two “casual” and “member” types. These observations were removed prior to analysis.
cyclistic_r %>%
filter(start_station_name == "HQ QR" | (cyclistic_r$ride_length_minutes * 60) < 0)
## [1] ride_id rideable_type member_casual
## [4] month day year
## [7] started_at ended_at ride_length_minutes
## [10] start_station_name start_station_id end_station_name
## [13] end_station_id
## <0 rows> (or 0-length row.names)
cyclistic_r <- cyclistic_r %>%
filter(member_casual != "NULL")
nrow(cyclistic_r)
## [1] 5428696
Viewing the column names and dimensions of the dataframe after completing additional processing.
colnames(cyclistic_r)
## [1] "ride_id" "rideable_type" "member_casual"
## [4] "month" "day" "year"
## [7] "started_at" "ended_at" "ride_length_minutes"
## [10] "start_station_name" "start_station_id" "end_station_name"
## [13] "end_station_id"
dim(cyclistic_r)
## [1] 5428696 13
head(cyclistic_r,5)
## ride_id rideable_type member_casual month day year
## 1 45C3CC5CE03D28D4 classic_bike member March Thursday 2021
## 2 CA87D1880FADBA7B classic_bike member March Thursday 2021
## 3 4316389A8645B70E classic_bike member March Wednesday 2021
## 4 DE6B73DE99EFEF95 classic_bike member March Sunday 2021
## 5 D63F84C50C094DB5 classic_bike member March Wednesday 2021
## started_at ended_at ride_length_minutes
## 1 2021-03-11 17:24:00 2021-03-11 17:32:00 8
## 2 2021-03-25 08:50:00 2021-03-25 08:59:00 9
## 3 2021-03-03 11:28:00 2021-03-03 11:35:00 7
## 4 2021-03-07 17:20:00 2021-03-07 17:44:00 24
## 5 2021-03-24 20:42:00 2021-03-24 20:51:00 9
## start_station_name start_station_id end_station_name
## 1 Wells St & Evergreen Ave TA1308000049 Wells St & Hubbard St
## 2 Wells St & Evergreen Ave TA1308000049 Wells St & Hubbard St
## 3 Wells St & Evergreen Ave TA1308000049 Wells St & Hubbard St
## 4 Lake Shore Dr & Wellington Ave TA1307000041 Wells St & Hubbard St
## 5 Lakeview Ave & Fullerton Pkwy TA1309000019 Clark St & Schiller St
## end_station_id
## 1 TA1307000151
## 2 TA1307000151
## 3 TA1307000151
## 4 TA1307000151
## 5 TA1309000024
The analysis finds the mean, median, maximum and minimum of the ride_length_minutes column. These statistics are then compared across both rider types.
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = mean)
## cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1 casual 31.97685
## 2 member 13.65379
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = median)
## cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1 casual 16
## 2 member 10
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = max)
## cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1 casual 55692
## 2 member 1560
aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual, FUN = min)
## cyclistic_r$member_casual cyclistic_r$ride_length_minutes
## 1 casual 0
## 2 member 0
The analysis examines the number of rides by rider type and weekday. This table was imported from previous work in Microsoft SQL Server.
rider_by_day <- read.csv("rider by weekday.csv")
rider_by_day <- rider_by_day %>%
rename(member_casual = ï..member_casual, number_of_rides = riders_by_type)
The following is a bar plot illustrating the “rider_by_day” dataframe:
ggplot(data = rider_by_day, aes(x = day, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Number of Rides by Weekday", subtitle = "From the Cyclistic Bike Trips data set", x = "Weekdays", y =
"Number of Rides")
The analysis examines the average ride length by weekday and sorted by rider type.
ride_averages <- cyclistic_r %>%
select(member_casual, day, ride_length_minutes) %>%
group_by(member_casual, day) %>%
summarise(number_of_rides = n(), mean_rides = mean(ride_length_minutes)) %>%
arrange(member_casual, desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
The following is a bar plot illustrating the “rider_averages” dataframe:
ggplot(data = ride_averages, aes(x = day, y = mean_rides, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Ride Length Averages by Weekday", subtitle = "From the Cyclistic Bike Trips data set", x = "Weekdays", y = "Average Length of Rides")
The “counts” object was created to write a csv file for further visualization in Tableau
counts <- aggregate(cyclistic_r$ride_length_minutes ~ cyclistic_r$member_casual + cyclistic_r$day, FUN = mean)
write.csv(x = counts, file = "avg_ride_length.csv")