1 Goal


The goal of this tutorial is to learn fast functions to read datasets in case we need to make code faster. For example if we want to run the code in rented servers.


2 Reading the data


# In this tutorial we are going to use the dataset from the kaggle repository
# https://www.kaggle.com/datasnaek/mbti-type
# This dataset contains 8675 entries with strings in two columns

dataset <- read.csv("mbti_1.csv", stringsAsFactors = FALSE)
str(dataset)
## 'data.frame':    8675 obs. of  2 variables:
##  $ type : chr  "INFJ" "ENTP" "INTP" "INTJ" ...
##  $ posts: chr  "'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg|||enfp and intj mom"| __truncated__ "'I'm finding the lack of me in these posts very alarming.|||Sex can be boring if it's in the same position often. For example m"| __truncated__ "'Good one  _____   https://www.youtube.com/watch?v=fHiGbolFFGw|||Of course, to which I say I know; that's my blessing and my cu"| __truncated__ "'Dear INTP,   I enjoyed our conversation the other day.  Esoteric gabbing about the nature of the universe and the idea that ev"| __truncated__ ...

3 Benchmarking different functions for reading tables

3.1 The read.csv function


# Let's measure the time it takes to read the table
time_readcsv <- system.time(rep(read.csv("mbti_1.csv", stringsAsFactors = FALSE),10))

3.2 The read_csv function


# We are going to use the read_csv function from the readr library

library(readr)
time_read_csv <- system.time(rep(read_csv(file = "mbti_1.csv", col_names = TRUE, col_types = cols(.default = "c")),10))

3.3 The fread function


# We are going to use the fread from the data.table library
# When using this function the object created is both data table and data frame so it's advised to use data.frame(fread())
# Credit: Aoife 


library(data.table)
time_fread <- system.time(rep(fread("mbti_1.csv", stringsAsFactors = FALSE),10))

4 Time benchmarking


# Let's check the time diference between the three functions
time_readcsv[3]
## elapsed 
##   7.478
time_read_csv[3]
## elapsed 
##   0.337
time_fread[3]
## elapsed 
##   0.412

5 Conclusion


In this tutorial we have used three of the most used functions to read tables. We have learnt that both fread and read_csv are quite more faster than read.csv. The best function to use depends on the choice of the user.