In this tutorial I will use data on NYC city bikes to show how to define different factors by deriving month, hour, and day of the week from startime and then converting them to properly order the factors.
Inspecting the data, as well to see column names. stringAsFactors = FALSE so I can manually control the factor creation.
bike <- read.csv("NYC-CitiBike-2016.csv", stringsAsFactors = FALSE)
names(bike)
## [1] "tripduration" "starttime"
## [3] "stoptime" "start.station.id"
## [5] "start.station.name" "start.station.latitude"
## [7] "start.station.longitude" "end.station.id"
## [9] "end.station.name" "end.station.latitude"
## [11] "end.station.longitude" "bikeid"
## [13] "usertype" "birth.year"
## [15] "gender"
Converting starttime to POSIXct to extract components. Creating new characters of: month, hour, and day of the week.
bike$starttime <- as.POSIXct(bike$starttime, format = "%m/%d/%Y %H:%M:%S")
Deriving new columns
bike$month <- format(bike$starttime, "%m")
bike$hour <- format(bike$starttime, "%H")
bike$day_of_the_week <- weekdays(bike$starttime)
Month (ordered factors with 3-letter abbreviations), Hour (ordered factors with time of day labels), and Day of the Week (ordered factors M-S)
bike$month <- factor(bike$month,
levels = sprintf("%02d", 1:12),
labels = month.abb,
ordered = TRUE)
bike$hour <- factor(bike$hour,
levels = sprintf("%02d", 0:23),
labels = sprintf("%02d:00", 0:23),
ordered = TRUE)
dow_order <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
bike$day_of_the_week <- factor(bike$day_of_the_week,
levels = dow_order,
ordered = TRUE)
Converting to factor (Subscriber/Customer)
bike$usertype <- factor(bike$usertype)
Verifying all factor levels are correct and ordered.
cat("Month levels:\n"); print(levels(bike$month))
## Month levels:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
cat("Hour levels:\n"); print(levels(bike$hour))
## Hour levels:
## [1] "00:00" "01:00" "02:00" "03:00" "04:00" "05:00" "06:00" "07:00" "08:00"
## [10] "09:00" "10:00" "11:00" "12:00" "13:00" "14:00" "15:00" "16:00" "17:00"
## [19] "18:00" "19:00" "20:00" "21:00" "22:00" "23:00"
cat("Day-of-the-Wekk:\n"); print(levels(bike$day_of_the_week))
## Day-of-the-Wekk:
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
cat("User type levels:\n"); print(levels(bike$usertype))
## User type levels:
## [1] "" "Customer" "Subscriber"
Examing structure of key columns
str(bike[, c("month", "hour", "day_of_the_week", "usertype")])
## 'data.frame': 276798 obs. of 4 variables:
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : Ord.factor w/ 24 levels "00:00"<"01:00"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ day_of_the_week: Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 5 5 5 5 5 5 5 5 5 5 ...
## $ usertype : Factor w/ 3 levels "","Customer",..: 2 3 3 3 2 3 3 3 3 2 ...
The CitiBike 2016 dataset has been successfully transformed to meet all specified requirements.