Citi Bike is the nation’s largest bike share program, with 20,000 bikes and over 1,300 stations across Manhattan, Brooklyn, Queens, the Bronx and Jersey City. It was designed for quick trips with convenience in mind, and it’s a fun and affordable way to get around town. Riding a bike in NYC has never been better! With more than 1,200 miles of bike lanes, pedaling around is more convenient than ever. Biking is the best way to see NYC. It’s a quick and affordable way to get all around the city, and it even allows you to sightsee along the way.
Citi Bike publishes downloadable files of Citi Bike trip data. The data includes:
my_path <- getwd()
.libPaths(my_path)
install.packages("dplyr",repos = "http://cran.us.r-project.org")
install.packages("tidyr",repos = "http://cran.us.r-project.org")
install.packages("readr",repos = "http://cran.us.r-project.org")
install.packages("ggplot2",repos = "http://cran.us.r-project.org")
install.packages("tidyverse",repos = "http://cran.us.r-project.org")
install.packages("rlang",repos = "http://cran.us.r-project.org")library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(tidyverse)
library(rlang)citi_bikes <- read_csv('202001-citibike-tripdata.csv')
dim(citi_bikes)## [1] 1240596 15
# Replacing 0, 1 and 2 in 'gender' column with 'UNKNOWN', 'MALE' and 'FEMALE' respectively
citi_bikes$gender[citi_bikes$gender == 0]<- "UNKNOWN"
citi_bikes$gender[citi_bikes$gender == 1]<- "MALE"
citi_bikes$gender[citi_bikes$gender == 2]<- "FEMALE"
# Converting 'tripduration' from seconds to hours
citi_bikes$tripduration <- citi_bikes$tripduration / (60*60)
# Adding column 'age' and calculating the same using 'birth year' and current year
citi_bikes$`birth year` <- as.Date(ISOdate(citi_bikes$`birth year`, 1, 1))
citi_bikes$age <- as.numeric(difftime(Sys.Date(), citi_bikes$`birth year`, units = "weeks"))/52.25
# Filtering out rows with 'age' > 100 years
citi_bikes <- citi_bikes %>% filter(age < 100)
# Adding column 'age_group' to partition customers according to their age
partition <-c(-1,18,30,60,100)
classifier_tags<-c('UNDER 18','ADULT (18+)','MIDDLE AGE(30+)', 'SENIOR CITIZEN (60+)')
citi_bikes$age_group<-cut(citi_bikes$age, breaks=partition, labels=classifier_tags)
# Checking if there are any missing values in the dataset
colSums(is.na(citi_bikes))## tripduration starttime stoptime
## 0 0 0
## start station id start station name start station latitude
## 0 0 0
## start station longitude end station id end station name
## 0 0 0
## end station latitude end station longitude bikeid
## 0 0 0
## usertype birth year gender
## 0 0 0
## age age_group
## 0 0
Here I am visualizing the Citi Bikes Trip Data. Trip durations based on gender and user type is depicted in each plot. Each different colour point on the plots represents in which age group the user type falls.
ggplot(data= citi_bikes, mapping = aes(x= tripduration, y= gender, colour = age_group))+
geom_point()+
scale_x_continuous(breaks = seq(0, 300, 50),
limits=c(0,300)) +
scale_size(range = c(1,4)
)+
facet_wrap(~ usertype)+
labs(title = "CITI BIKES ",
subtitle = "Trip Rides in NYC",
y = "Gender",
x = "Trip Duration (hours)",
colour = "Age Group") +
theme_minimal()+
theme(legend.position = "bottom",
legend.text = element_text(color = "black", size = 6)) From the above visualization following are the insights I gathered.