Bookdown contribution by S. Tinapunan
tidyverse
to process dataThis workflow tutorial demonstrates how to process data from start to finish using tidyverse
.
In this tutorial, we will do the following:
readr
package to read a csv file.tidyr
package to transform data into tidy data.dplyr
package to group and summarize data.ggplot2
package to visualize data.This data file was taken from https://data.fivethirtyeight.com/ under data set uber-tlc-foil-response
. This specific data was taken from an Excel file named Uber Weekday-Hour AverageTrips.xlsx
under a sheet entitled Trips Per Hour and Weekday
.
file <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Bookdown/Uber%20Weekday-Hour%20AverageTrips.csv"
library(tidyverse)
library(knitr)
readr
packageThe code below demonstrates how to use the readr::read_csv
, which is part of thetidyverse
.
file
is a variable that contains the file path of the data set.skip
indicates the number of lines to skip before reading data.col_names
is either TRUE or FALSE or a character vector of column names.The code below skips the first two rows of data. If you take a look at the data file, the first row provides a description of the five columns into two groups: Time
and Average trips per hour and day of week
. The second row provides the column names; however, the code below will explicitly provide the column names. These two rows from the raw data file are skipped.
col_names <- c("weekday", "hour", "other_8_bases", "uber", "lyft")
data <- readr::read_csv(file, skip = 2, col_names)
## Parsed with column specification:
## cols(
## weekday = col_character(),
## hour = col_integer(),
## other_8_bases = col_number(),
## uber = col_number(),
## lyft = col_integer()
## )
Below is a preview of the data set.
weekday | hour | other_8_bases | uber | lyft |
---|---|---|---|---|
1 | 0 | 769 | 1458 | 323 |
1 | 1 | 645 | 1078 | 344 |
1 | 2 | 497 | 757 | 362 |
1 | 3 | 385 | 513 | 362 |
1 | 4 | 435 | 308 | 333 |
1 | 5 | 438 | 308 | 297 |
The code below uses tidyr
package to transform data so that each row represents a car service. The transformation will take the columns uber
, lyft
, and other_8_bases
and assign them to a column named car_service
. It will also create a column called average_trip_per_hrday
to store the values that used to be stored in the three columns mentioned. The last parameter (3:5
) indicates the columns that will be renamed as car_service
.
data_transform <- tidyr::gather(data, "car_service", "average_trip_per_hrday", 3:5)
Below is a preview of the data that has been transformed.
weekday | hour | car_service | average_trip_per_hrday |
---|---|---|---|
1 | 0 | other_8_bases | 769 |
1 | 1 | other_8_bases | 645 |
1 | 2 | other_8_bases | 497 |
1 | 3 | other_8_bases | 385 |
1 | 4 | other_8_bases | 435 |
1 | 5 | other_8_bases | 438 |
dplyr
to calculate the average number of trips per hour for each car serviceThe group_by
function groups the rows by car_service
, and then by hour
. The summarise
function is used to calculate the mean of the average_trip_per_hrday
based on the grouping mentioned. Basically, what this does is group all hours across all different days for each car service. The arrange
function then orders the result by car_service
and hour
.
average_trips_perhour <-
data_transform %>%
dplyr::group_by(car_service, hour) %>%
dplyr::summarise(average_trips = mean(average_trip_per_hrday)) %>%
arrange(car_service, hour)
average_trips_perhour
Below is a preview of the summarized data for average_trips_perhour
, which is the average number of trips for the hour in military time for each car service.
car_service | hour | average_trips |
---|---|---|
lyft | 0 | 299.1429 |
lyft | 1 | 296.0000 |
lyft | 2 | 287.0000 |
lyft | 3 | 263.1429 |
lyft | 4 | 206.0000 |
lyft | 5 | 152.5714 |
ggplot2
to visualize dataThe plot below shows the average number of trips per hour for each car service.
ggplot
function is used to plot a scatter plot and generate a loess regression line. geom_smooth
is used to generate the regression line.
Arguments for geom_smooth
:
ggplot(average_trips_perhour, aes(x=hour, y=average_trips, color=car_service)) + geom_point() +
geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95) +
ggtitle("Average number of trips for each hour of the day")