The goal of this tutorial is to learn fast functions to read datasets in case we need to make code faster. For example if we want to run the code in rented servers.
# In this tutorial we are going to use the dataset from the kaggle repository
# https://www.kaggle.com/datasnaek/mbti-type
# This dataset contains 8675 entries with strings in two columns
dataset <- read.csv("mbti_1.csv", stringsAsFactors = FALSE)
str(dataset)
## 'data.frame': 8675 obs. of 2 variables:
## $ type : chr "INFJ" "ENTP" "INTP" "INTJ" ...
## $ posts: chr "'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg|||enfp and intj mom"| __truncated__ "'I'm finding the lack of me in these posts very alarming.|||Sex can be boring if it's in the same position often. For example m"| __truncated__ "'Good one _____ https://www.youtube.com/watch?v=fHiGbolFFGw|||Of course, to which I say I know; that's my blessing and my cu"| __truncated__ "'Dear INTP, I enjoyed our conversation the other day. Esoteric gabbing about the nature of the universe and the idea that ev"| __truncated__ ...
# Let's measure the time it takes to read the table
time_readcsv <- system.time(rep(read.csv("mbti_1.csv", stringsAsFactors = FALSE),10))
# We are going to use the read_csv function from the readr library
library(readr)
time_read_csv <- system.time(rep(read_csv(file = "mbti_1.csv", col_names = TRUE, col_types = cols(.default = "c")),10))
# We are going to use the fread from the data.table library
# When using this function the object created is both data table and data frame so it's advised to use data.frame(fread())
# Credit: Aoife
library(data.table)
time_fread <- system.time(rep(fread("mbti_1.csv", stringsAsFactors = FALSE),10))
# Let's check the time diference between the three functions
time_readcsv[3]
## elapsed
## 7.478
time_read_csv[3]
## elapsed
## 0.337
time_fread[3]
## elapsed
## 0.412
In this tutorial we have used three of the most used functions to read tables. We have learnt that both fread and read_csv are quite more faster than read.csv. The best function to use depends on the choice of the user.