BANA7025 Data Wrangling

CITIBIKES - EXPERIENCE NYC IN A WHOLE NEW WAY

Citi Bike is the nation’s largest bike share program, with 20,000 bikes and over 1,300 stations across Manhattan, Brooklyn, Queens, the Bronx and Jersey City. It was designed for quick trips with convenience in mind, and it’s a fun and affordable way to get around town. Riding a bike in NYC has never been better! With more than 1,200 miles of bike lanes, pedaling around is more convenient than ever. Biking is the best way to see NYC. It’s a quick and affordable way to get all around the city, and it even allows you to sightsee along the way.

Citi Bike publishes downloadable files of Citi Bike trip data. The data includes:

Trip Duration (seconds)
Start Time and Date
Stop Time and Date
Start Station Name
End Station Name
Station ID
Station Lat/Long
Bike ID
User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
Gender (Zero=unknown; 1=male; 2=female)
Year of Birth

INSTALLING PACKAGES

my_path <- getwd()
.libPaths(my_path)

install.packages("dplyr",repos = "http://cran.us.r-project.org")
install.packages("tidyr",repos = "http://cran.us.r-project.org")
install.packages("readr",repos = "http://cran.us.r-project.org")
install.packages("ggplot2",repos = "http://cran.us.r-project.org")
install.packages("tidyverse",repos = "http://cran.us.r-project.org")
install.packages("rlang",repos = "http://cran.us.r-project.org")

USING LIBRARIES

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(tidyverse)
library(rlang)

IMPORTING DATA

citi_bikes <- read_csv('202001-citibike-tripdata.csv')
dim(citi_bikes)

## [1] 1240596      15

DATA CLEANING AND MANIPULATION

# Replacing 0, 1 and 2 in 'gender' column with 'UNKNOWN', 'MALE' and 'FEMALE' respectively
citi_bikes$gender[citi_bikes$gender == 0]<- "UNKNOWN"
citi_bikes$gender[citi_bikes$gender == 1]<- "MALE"
citi_bikes$gender[citi_bikes$gender == 2]<- "FEMALE"

# Converting 'tripduration' from seconds to hours
citi_bikes$tripduration <- citi_bikes$tripduration / (60*60)

# Adding column 'age' and calculating the same using 'birth year' and current year
citi_bikes$`birth year` <- as.Date(ISOdate(citi_bikes$`birth year`, 1, 1))
citi_bikes$age <- as.numeric(difftime(Sys.Date(), citi_bikes$`birth year`, units = "weeks"))/52.25

# Filtering out rows with 'age' > 100 years
citi_bikes <- citi_bikes %>% filter(age < 100)

# Adding column 'age_group' to partition customers according to their age
partition <-c(-1,18,30,60,100)
classifier_tags<-c('UNDER 18','ADULT (18+)','MIDDLE AGE(30+)', 'SENIOR CITIZEN (60+)')
citi_bikes$age_group<-cut(citi_bikes$age, breaks=partition, labels=classifier_tags)

# Checking if there are any missing values in the dataset
colSums(is.na(citi_bikes))

##            tripduration               starttime                stoptime 
##                       0                       0                       0 
##        start station id      start station name  start station latitude 
##                       0                       0                       0 
## start station longitude          end station id        end station name 
##                       0                       0                       0 
##    end station latitude   end station longitude                  bikeid 
##                       0                       0                       0 
##                usertype              birth year                  gender 
##                       0                       0                       0 
##                     age               age_group 
##                       0                       0

VISUALIZATION

Here I am visualizing the Citi Bikes Trip Data. Trip durations based on gender and user type is depicted in each plot. Each different colour point on the plots represents in which age group the user type falls.

ggplot(data= citi_bikes, mapping = aes(x= tripduration, y= gender, colour = age_group))+
  geom_point()+
  scale_x_continuous(breaks = seq(0, 300, 50), 
                     limits=c(0,300)) +
  scale_size(range = c(1,4)
             )+
  facet_wrap(~ usertype)+
  labs(title = "CITI BIKES ",
       subtitle = "Trip Rides in NYC",
       y = "Gender",
       x = "Trip Duration (hours)",
       colour = "Age Group") +
  theme_minimal()+
  theme(legend.position = "bottom",
        legend.text = element_text(color = "black", size = 6))

INSIGHTS

From the above visualization following are the insights I gathered.

There are not many Senior Citizens (60+ years) who prefer to ride Citi Bike in NYC.
Trip duration of a ride does not depend on the gender of the user type. A male and a female can ride a bike for the same amount of time.
Adults (18 - 30 years) prefer to be a Customer of the Citi Bike and buy a 24-hour pass or 3-day pass but Middle Age people (30 - 60 years) prefer to Subscribe to Citi Bike and be an Annual Member.

BANA7025 Data Wrangling - Homework 4

SURABHI KULKARNI

02/18/2022