This notebook illustrates the use of Tidyverse libraries (dplyr, tidyr, etc. ) for feature engineering. The data is used for this notebook the San Francisco Crime Classification competition on Kaggle. Let us first load all the libraries that we will use

# Load libraries
library(plyr)
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(purrr)
library(lubridate)
library(Matrix)
library(MatrixModels)
library(caret)

The libraries dplyr, tidyr, readr, stringr, and purrr are parts of the tidyverse collection. We load the data using the read_csv function of readr which is more efficient than the corressponding read.csv of base R.

train <- read_csv("./data/train.csv", col_types = "Tccccccnn")
test <- read_csv("./data/test.csv", col_types = "iTcccnn")

The variables test and train are tibble objects, a more efficient version of the base R data.frame. Let’s print train and test to explore what they look like

train

and

test