This analysis is based on the Divvy case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study). The purpose of this analysis is to consolidate downloaded Divvy data (found here : https://divvy-tripdata.s3.amazonaws.com/index.html), wrangle the data, and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Divvy bikes differently?”
knitr::include_graphics("C:/Users/Sam/Desktop/Capstone/Pivot_202004.PNG")
Called the above summary into R for further analysis and visualization.
Setting up the working environment
loaded tidyverse and readxl libraries
library(“tidyverse”)
library(readxl)
Set the working directory
setwd("C:/Users/Sam/Desktop/Capstone")
getwd()
## [1] "C:/Users/Sam/Desktop/Capstone"
Imported data from the summary_data excel file into R
Q2_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "A2:E9")
Q3_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "G2:K9")
Q4_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "A14:E21")
Q1_2021 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "G14:K21")
Combined Q2_2020, Q3_2020, Q4_2020,and Q1_2021 to form consolidated annual data
annual_data <- rbind(Q2_2020, Q3_2020, Q4_2020,Q1_2021)
Converted weekday to an ordered factor
annual_data$day_of_week <- factor(annual_data$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
Grouped by weekday and summarized columns number_of_rides_casual, average_duration_casual, number_of_rides_member, average_duration_member by sum.
all_trips <- annual_data %>%
group_by(day_of_week) %>%
summarize(Annual_number_of_rides_casual = sum(number_of_rides_casual),
Annual_number_of_rides_member = sum(number_of_rides_member),
Annual_average_duration_casual = sum(average_duration_casual),
Annual_average_duration_member = sum(average_duration_member))
Created the clustered column char using the ggplot2 library to visualize day_of_week vs number of rides.
Reshaped the data using the gather() function from the tidyr library to convert the wide format to a long format.
all_trips_v2 <- gather(all_trips, key = "Column", value = "Number_of_rides", Annual_number_of_rides_casual, Annual_number_of_rides_member)
Created clustered column chart using the ggplot() function and day_of_week as x-axis, Number_of_rides as y-axis, fill color, and other chart labels.
ggplot(all_trips_v2, aes(x = day_of_week, y = Number_of_rides, fill = Column)) +
geom_col(position = "dodge") +
labs(x = "day_of_week", y = "Number_of_rides", fill = "") +
scale_fill_manual(values = c("#C40000", "#29C6D7")) +
theme_classic()+
scale_y_continuous(labels = scales::comma)
Created the clustered column char using the ggplot2 library to visualize day_of_week vs annual average duration of rides.
Reshaped the data using the gather() function from the tidyr library to convert the wide format to a long format.
all_trips_v3 <- gather(all_trips, key = "Column", value = "Average_duration", Annual_average_duration_casual, Annual_average_duration_member)
Created clustered column chart using the ggplot() function and day_of_week as x-axis, Average_duration(in sec) as y-axis, fill color, and other chart labels.
ggplot(all_trips_v3, aes(x = day_of_week, y = Average_duration, fill = Column)) +
geom_col(position = "dodge") +
labs(x = "day_of_week", y = "Average_duration(sec)", fill = "") +
scale_fill_manual(values = c("#ED7D31", "#799AD5")) +
theme_classic()+
scale_y_continuous(labels = scales::comma)