Welcome to the second episode of our R from Zero to Hero journey. In
this section we will look at how to get data into our “screen”, in front
of our eyes. In other words, how to load or import data, but also how to
save data after some process has been completed.
Data collection
There are various way to get data to our screens, depending on the
origin or source of data. We will try to cover these methods
briefly.
External data sources
we called them external because they are not currently located in our
environment - not from our station necessary.
Loading data from a folder
In the early days of your R journey, you are most likely to load data
from MyDocument folder. We will look at various case here, depending on
the format of the file.
Loadind data from an Excel file
library(readxl)
my_dataframe_xlsx <- read_xlsx('./data/insider-purchases.xlsx')
print(head(my_dataframe_xlsx))
Loadind data from a CSV file
my_dataframe_csv <- read.csv('./data/insider-purchases.csv')
print(head(my_dataframe_csv))
NA
Loadind data from a PDF file
Depending on the industry or the subject that we are working on, we
might have to import data from a PDF file. While this is not usually
demonstrated in entry level material, I choose to cover it here, because
you will not have the choice on the format of data that are made
available to you.
library(tabulapdf)
#path_to_file <- "data/insider-purchases-pdf.pdf"
#my_dataframe_pdf <- extract_tables(path_to_file)
#### WE WILL HAVE TO COME BACK HERE TO HANDLE MULTI PAGES PDF TABLE
Loadind data from RDS file
The RDS format in R is a file format used to store a single R object,
created using the saveRDS() function and read using the
readRDS() function. The primary advantages of RDS are:
- Serialization of Single Objects: RDS files store a
single R object, making it simple to save and load specific data
structures, models, or any R objects.
- Preservation of Object Structure: RDS preserves the
exact structure and attributes of the saved R object, ensuring
consistency when reloaded.
- Compact Storage: RDS files are often smaller due to
internal compression, which saves disk space.
- Flexibility in Naming: Unlike other formats, when
reading an RDS file, you can assign any name to the loaded object.
- Interoperability: RDS files can be easily shared
and used across different R sessions and environments, enhancing
reproducibility.
my_dataframe_rds <- readRDS('data/my_dataframe_rds.rds')
print(head(my_dataframe_rds))
NA
Loading data from a database
# here develop how to use Postgres etc
Data Export
After completing certain task on our data, we would most likely need
to keep some of the records,…or all of them. For this purpose we will do
the exact opposite of the data collection processes: we will export our
data. In this section we will use the opposite of the ‘loading’
functions that were previously used. Before exporting our data, we will
print a preview, just for sanity checks. we will use the dataset that
was previously loaded from an RDS file - this is a discretionary
choice.
dataset_to_be_exported <- my_dataframe_rds
print(head(dataset_to_be_exported))
NA
Exporting to Excel file
xlsx::write.xlsx(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.xlsx')
Because this will likely happen, you might want to save this file
using the date of today as a key differentiator, we better as well learn
how to do it now. We will rely on the paste function of R, as well ad
the date formating funtcion - note we can apply that to any file naming
prior to saving.
Exporting to an Excel file with a date inside the name
# Format the date of today
today_date_formated = format(as.Date(Sys.Date()), format="%Y-%B-%d")
#Create the file name string
file_name = paste(today_date_formated,"dataset_to_be_exported.xlsx", sep=" ")
# Get the path to export
file_path_for_export = paste(getwd(),'/data/exports/', file_name, sep='')
# Export the file to the previously created name
xlsx::write.xlsx(dataset_to_be_exported,file_path_for_export)
Ah, what just happened. We just used two functions that we have never
seen before. 1 - Sys.Date(): This method gives us the current date of
our machine. We could also use Sys.time(), which would return for
example: “2020-05-31 14:49:47 HKT” 2 - getwd()….
Exporting to CSV file
write.csv(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.csv')
Exporting to RDS file
saveRDS(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.rds')
Exporting to Database
We will leave the following section for now, we will get back at it
later with the actual setup of a database - we need to consider what is
the best option for data exchange here
Database set up and connection
The folliwing section assumes that you have already setup a database
once again.
library(RPostgres)
con <-
dbConnect(
RPostgres::Postgres(),
dbname = "market_data",
host = "localhost",
port = 5432,
user = "myself",
password = "123456",
)
Updating Database record - with append
dbWriteTable(
con2,
"Example1",
dataset_to_be_exported,
overwrite = FALSE,
row.names = FALSE
)
Updating Database record - with replace
dbWriteTable(
con2,
"Example1",
dataset_to_be_exported,
overwrite = TRUE,
row.names = FALSE
)
Internal data sources - saving and loading an environment
With internal data source, we include anything that is in our current
environment. Assume that we have done some work on multiple dataframe,
and we are not looking to individually save each of them in an excel,
CSV or RDS file, because there would be 100 of them. Well R allows us to
save the entire environment, as it is, with all the variables,
dataframe, etc. that we have been working on. Let’s have a look at the
command for that purpose.
Saving an environment
save.image(file='myEnvironment.RData')
Loading an environment
load('myHeroEnvironment.RData')
This is it for importing and exporting data for today, we will now
move to Data Exploration, which is a critical step in our Machine
Learning Pipeline
