Bookdown contribution by S. Tinapunan
tidyverse to process dataThis workflow tutorial demonstrates how to process data from start to finish using tidyverse.
In this tutorial, we will do the following:
readr package to read a csv file.tidyr package to transform data into tidy data.dplyr package to group and summarize data.ggplot2 package to visualize data.This data file was taken from https://data.fivethirtyeight.com/ under data set uber-tlc-foil-response. This specific data was taken from an Excel file named Uber Weekday-Hour AverageTrips.xlsx under a sheet entitled Trips Per Hour and Weekday.
file <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/Bookdown/Uber%20Weekday-Hour%20AverageTrips.csv"
library(tidyverse)
library(knitr)
readr packageThe code below demonstrates how to use the readr::read_csv, which is part of thetidyverse.
file is a variable that contains the file path of the data set.skip indicates the number of lines to skip before reading data.col_names is either TRUE or FALSE or a character vector of column names.The code below skips the first two rows of data. If you take a look at the data file, the first row provides a description of the five columns into two groups: Timeand Average trips per hour and day of week. The second row provides the column names; however, the code below will explicitly provide the column names. These two rows from the raw data file are skipped.
col_names <- c("weekday", "hour", "other_8_bases", "uber", "lyft")
data <- readr::read_csv(file, skip = 2, col_names)
## Parsed with column specification:
## cols(
## weekday = col_character(),
## hour = col_integer(),
## other_8_bases = col_number(),
## uber = col_number(),
## lyft = col_integer()
## )
Below is a preview of the data set.
| weekday | hour | other_8_bases | uber | lyft |
|---|---|---|---|---|
| 1 | 0 | 769 | 1458 | 323 |
| 1 | 1 | 645 | 1078 | 344 |
| 1 | 2 | 497 | 757 | 362 |
| 1 | 3 | 385 | 513 | 362 |
| 1 | 4 | 435 | 308 | 333 |
| 1 | 5 | 438 | 308 | 297 |
The code below uses tidyr package to transform data so that each row represents a car service. The transformation will take the columns uber, lyft, and other_8_bases and assign them to a column named car_service. It will also create a column called average_trip_per_hrday to store the values that used to be stored in the three columns mentioned. The last parameter (3:5) indicates the columns that will be renamed as car_service.
data_transform <- tidyr::gather(data, "car_service", "average_trip_per_hrday", 3:5)
Below is a preview of the data that has been transformed.
| weekday | hour | car_service | average_trip_per_hrday |
|---|---|---|---|
| 1 | 0 | other_8_bases | 769 |
| 1 | 1 | other_8_bases | 645 |
| 1 | 2 | other_8_bases | 497 |
| 1 | 3 | other_8_bases | 385 |
| 1 | 4 | other_8_bases | 435 |
| 1 | 5 | other_8_bases | 438 |
dplyr to calculate the average number of trips per hour for each car serviceThe group_by function groups the rows by car_service, and then by hour. The summarise function is used to calculate the mean of the average_trip_per_hrday based on the grouping mentioned. Basically, what this does is group all hours across all different days for each car service. The arrange function then orders the result by car_service and hour.
average_trips_perhour <-
data_transform %>%
dplyr::group_by(car_service, hour) %>%
dplyr::summarise(average_trips = mean(average_trip_per_hrday)) %>%
arrange(car_service, hour)
average_trips_perhourBelow is a preview of the summarized data for average_trips_perhour, which is the average number of trips for the hour in military time for each car service.
| car_service | hour | average_trips |
|---|---|---|
| lyft | 0 | 299.1429 |
| lyft | 1 | 296.0000 |
| lyft | 2 | 287.0000 |
| lyft | 3 | 263.1429 |
| lyft | 4 | 206.0000 |
| lyft | 5 | 152.5714 |
ggplot2 to visualize dataThe plot below shows the average number of trips per hour for each car service.
ggplot function is used to plot a scatter plot and generate a loess regression line. geom_smooth is used to generate the regression line.
Arguments for geom_smooth:
ggplot(average_trips_perhour, aes(x=hour, y=average_trips, color=car_service)) + geom_point() +
geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95) +
ggtitle("Average number of trips for each hour of the day")