Introduction

In this tutorial I will use data on NYC city bikes to show how to define different factors by deriving month, hour, and day of the week from startime and then converting them to properly order the factors.

Reading data

Inspecting the data, as well to see column names. stringAsFactors = FALSE so I can manually control the factor creation.

bike <- read.csv("NYC-CitiBike-2016.csv", stringsAsFactors = FALSE)

names(bike)
##  [1] "tripduration"            "starttime"              
##  [3] "stoptime"                "start.station.id"       
##  [5] "start.station.name"      "start.station.latitude" 
##  [7] "start.station.longitude" "end.station.id"         
##  [9] "end.station.name"        "end.station.latitude"   
## [11] "end.station.longitude"   "bikeid"                 
## [13] "usertype"                "birth.year"             
## [15] "gender"

– 1. Parse datetime and deriving new columns

Converting starttime to POSIXct to extract components. Creating new characters of: month, hour, and day of the week.

bike$starttime <- as.POSIXct(bike$starttime, format = "%m/%d/%Y %H:%M:%S")

Deriving new columns

bike$month <- format(bike$starttime, "%m")
bike$hour <- format(bike$starttime, "%H")
bike$day_of_the_week <- weekdays(bike$starttime)

– 2a. Converting to factors with correct order

Month (ordered factors with 3-letter abbreviations), Hour (ordered factors with time of day labels), and Day of the Week (ordered factors M-S)

bike$month <- factor(bike$month,
                     levels = sprintf("%02d", 1:12),
                     labels = month.abb, 
                     ordered = TRUE)

bike$hour <- factor(bike$hour,
                    levels = sprintf("%02d", 0:23),
                    labels = sprintf("%02d:00", 0:23),
                    ordered = TRUE)

dow_order <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
bike$day_of_the_week <- factor(bike$day_of_the_week,
                               levels = dow_order,
                               ordered = TRUE)

– 2b. User type

Converting to factor (Subscriber/Customer)

bike$usertype <- factor(bike$usertype)

– 3. Echo levels to verify

Verifying all factor levels are correct and ordered.

cat("Month levels:\n"); print(levels(bike$month))
## Month levels:
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
cat("Hour levels:\n"); print(levels(bike$hour))
## Hour levels:
##  [1] "00:00" "01:00" "02:00" "03:00" "04:00" "05:00" "06:00" "07:00" "08:00"
## [10] "09:00" "10:00" "11:00" "12:00" "13:00" "14:00" "15:00" "16:00" "17:00"
## [19] "18:00" "19:00" "20:00" "21:00" "22:00" "23:00"
cat("Day-of-the-Wekk:\n"); print(levels(bike$day_of_the_week))
## Day-of-the-Wekk:
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
## [7] "Sunday"
cat("User type levels:\n"); print(levels(bike$usertype))
## User type levels:
## [1] ""           "Customer"   "Subscriber"

– 4. Sanity Check

Examing structure of key columns

str(bike[, c("month", "hour", "day_of_the_week", "usertype")])
## 'data.frame':    276798 obs. of  4 variables:
##  $ month          : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hour           : Ord.factor w/ 24 levels "00:00"<"01:00"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ day_of_the_week: Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ usertype       : Factor w/ 3 levels "","Customer",..: 2 3 3 3 2 3 3 3 3 2 ...

Conclusion

The CitiBike 2016 dataset has been successfully transformed to meet all specified requirements.