Welcome to the second episode of our R from Zero to Hero journey. In this section we will look at how to get data into our “screen”, in front of our eyes. In other words, how to load or import data, but also how to save data after some process has been completed.

Data collection

There are various way to get data to our screens, depending on the origin or source of data. We will try to cover these methods briefly.

External data sources

we called them external because they are not currently located in our environment - not from our station necessary.

Loading data from a folder

In the early days of your R journey, you are most likely to load data from MyDocument folder. We will look at various case here, depending on the format of the file.

Loadind data from an Excel file

library(readxl)
my_dataframe_xlsx <- read_xlsx('./data/insider-purchases.xlsx')
print(head(my_dataframe_xlsx))

Loadind data from a CSV file

my_dataframe_csv <- read.csv('./data/insider-purchases.csv')
print(head(my_dataframe_csv))
NA

Loadind data from a PDF file

Depending on the industry or the subject that we are working on, we might have to import data from a PDF file. While this is not usually demonstrated in entry level material, I choose to cover it here, because you will not have the choice on the format of data that are made available to you.

library(tabulapdf)
#path_to_file <- "data/insider-purchases-pdf.pdf"
#my_dataframe_pdf <- extract_tables(path_to_file)
#### WE WILL HAVE TO COME BACK HERE TO HANDLE MULTI PAGES PDF TABLE

Loadind data from RDS file

The RDS format in R is a file format used to store a single R object, created using the saveRDS() function and read using the readRDS() function. The primary advantages of RDS are:

  1. Serialization of Single Objects: RDS files store a single R object, making it simple to save and load specific data structures, models, or any R objects.
  2. Preservation of Object Structure: RDS preserves the exact structure and attributes of the saved R object, ensuring consistency when reloaded.
  3. Compact Storage: RDS files are often smaller due to internal compression, which saves disk space.
  4. Flexibility in Naming: Unlike other formats, when reading an RDS file, you can assign any name to the loaded object.
  5. Interoperability: RDS files can be easily shared and used across different R sessions and environments, enhancing reproducibility.
my_dataframe_rds <- readRDS('data/my_dataframe_rds.rds')
print(head(my_dataframe_rds))
NA

Loading data from a database

# here develop how to use Postgres etc

Data Export

After completing certain task on our data, we would most likely need to keep some of the records,…or all of them. For this purpose we will do the exact opposite of the data collection processes: we will export our data. In this section we will use the opposite of the ‘loading’ functions that were previously used. Before exporting our data, we will print a preview, just for sanity checks. we will use the dataset that was previously loaded from an RDS file - this is a discretionary choice.

dataset_to_be_exported <- my_dataframe_rds

print(head(dataset_to_be_exported))
NA

Exporting to Excel file

xlsx::write.xlsx(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.xlsx')

Because this will likely happen, you might want to save this file using the date of today as a key differentiator, we better as well learn how to do it now. We will rely on the paste function of R, as well ad the date formating funtcion - note we can apply that to any file naming prior to saving.

Exporting to an Excel file with a date inside the name


# Format the date of today
today_date_formated = format(as.Date(Sys.Date()), format="%Y-%B-%d")

#Create the file name string
file_name = paste(today_date_formated,"dataset_to_be_exported.xlsx", sep=" ")

# Get the path to export
file_path_for_export = paste(getwd(),'/data/exports/', file_name, sep='')

# Export the file to the previously created name
xlsx::write.xlsx(dataset_to_be_exported,file_path_for_export)

Ah, what just happened. We just used two functions that we have never seen before. 1 - Sys.Date(): This method gives us the current date of our machine. We could also use Sys.time(), which would return for example: “2020-05-31 14:49:47 HKT” 2 - getwd()….

Exporting to CSV file

write.csv(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.csv')

Exporting to RDS file

saveRDS(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.rds')

Exporting to Database

We will leave the following section for now, we will get back at it later with the actual setup of a database - we need to consider what is the best option for data exchange here

Database set up and connection

The folliwing section assumes that you have already setup a database once again.

library(RPostgres)
con <-
  dbConnect(
    RPostgres::Postgres(),
    dbname = "market_data",
    host = "localhost",
    port = 5432,
    user = "myself",
    password = "123456",
  )

Updating Database record - with append

dbWriteTable(
  con2,
  "Example1",
  dataset_to_be_exported,
  overwrite = FALSE,
  row.names = FALSE
)

Updating Database record - with replace

dbWriteTable(
  con2,
  "Example1",
  dataset_to_be_exported,
  overwrite = TRUE,
  row.names = FALSE
)

Internal data sources - saving and loading an environment

With internal data source, we include anything that is in our current environment. Assume that we have done some work on multiple dataframe, and we are not looking to individually save each of them in an excel, CSV or RDS file, because there would be 100 of them. Well R allows us to save the entire environment, as it is, with all the variables, dataframe, etc. that we have been working on. Let’s have a look at the command for that purpose.

Saving an environment

save.image(file='myEnvironment.RData')

Loading an environment

    
load('myHeroEnvironment.RData')

This is it for importing and exporting data for today, we will now move to Data Exploration, which is a critical step in our Machine Learning Pipeline

---
title: "R, from Zero to Hero: Data Management (Importing, loading, saving)"
author: "Frantz Moudoute"
output: html_notebook
---

Welcome to the second episode of our R from Zero to Hero journey. In this section we will look at how to get data into our "screen", in front of our eyes. In other words, how to load or import data, but also how to save data after some process has been completed.

# Data collection
There are various way to get data to our screens, depending on the origin or source of data. We will try to cover these methods briefly.

## External data sources
we called them external because they are not currently located in our environment - not from our station necessary. 

### Loading data from a folder
In the early days of your R journey, you are most likely to load data from MyDocument folder. We will look at various case here, depending on the format of the file.

#### Loadind data from an Excel file
```{r}
library(readxl)
my_dataframe_xlsx <- read_xlsx('./data/insider-purchases.xlsx')
print(head(my_dataframe_xlsx))
```


#### Loadind data from a CSV file
```{r}
my_dataframe_csv <- read.csv('./data/insider-purchases.csv')
print(head(my_dataframe_csv))

```


#### Loadind data from a PDF file
Depending on the industry or the subject that we are working on, we might have to import data from a PDF file. While this is not usually demonstrated in entry level material, I choose to cover it here, because you will not have the choice on the format of data that are made available to you.
```{r}
library(tabulapdf)
#path_to_file <- "data/insider-purchases-pdf.pdf"
#my_dataframe_pdf <- extract_tables(path_to_file)
#### WE WILL HAVE TO COME BACK HERE TO HANDLE MULTI PAGES PDF TABLE
```



#### Loadind data from RDS file
The RDS format in R is a file format used to store a single R object, created using the `saveRDS()` function and read using the `readRDS()` function. The primary advantages of RDS are:

1. **Serialization of Single Objects**: RDS files store a single R object, making it simple to save and load specific data structures, models, or any R objects.
2. **Preservation of Object Structure**: RDS preserves the exact structure and attributes of the saved R object, ensuring consistency when reloaded.
3. **Compact Storage**: RDS files are often smaller due to internal compression, which saves disk space.
4. **Flexibility in Naming**: Unlike other formats, when reading an RDS file, you can assign any name to the loaded object.
5. **Interoperability**: RDS files can be easily shared and used across different R sessions and environments, enhancing reproducibility.


```{r}
my_dataframe_rds <- readRDS('data/my_dataframe_rds.rds')
print(head(my_dataframe_rds))

```
### Loading data from a database

```{r}
# here develop how to use Postgres etc
```

# Data Export
After completing certain task on our data, we would most likely need to keep some of the records,...or all of them. For this purpose we will do the exact opposite of the data collection processes: we will export our data. In this section we will use the opposite of the 'loading' functions that were previously used. 
Before exporting our data, we will print a preview, just for sanity checks. we will use the dataset that was previously loaded from an RDS file - this is a discretionary choice.

```{r}
dataset_to_be_exported <- my_dataframe_rds

print(head(dataset_to_be_exported))

```


## Exporting to Excel file
```{r}
xlsx::write.xlsx(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.xlsx')

```
Because this will likely happen,  you might want to save this file using the date of today as a key differentiator, we better as well learn how to do it now. We will rely on the paste function of R, as well ad the date formating funtcion - note we can apply that to any file naming prior to saving. 

### Exporting to an Excel file with a date inside the name
```{r}

# Format the date of today
today_date_formated = format(as.Date(Sys.Date()), format="%Y-%B-%d")

#Create the file name string
file_name = paste(today_date_formated,"dataset_to_be_exported.xlsx", sep=" ")

# Get the path to export
file_path_for_export = paste(getwd(),'/data/exports/', file_name, sep='')

# Export the file to the previously created name
xlsx::write.xlsx(dataset_to_be_exported,file_path_for_export)
```

Ah, what just happened. We just used two functions that we have never seen before. 
1 - Sys.Date(): This method gives us the current date of our machine. We could also use Sys.time(), which would return for example: "2020-05-31 14:49:47 HKT"
2 - getwd()....






## Exporting to CSV file
```{r}
write.csv(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.csv')
```


## Exporting to RDS file
```{r}
saveRDS(dataset_to_be_exported, 'data/exports/dataset_to_be_exported.rds')
```



## Exporting to Database
We will leave the following section for now, we will get back at it later with the actual setup of a database  - we need to consider what is the best option for data exchange here

### Database set up and connection
The folliwing section assumes that you have already setup a database once again. 
```{r}
library(RPostgres)
con <-
  dbConnect(
    RPostgres::Postgres(),
    dbname = "market_data",
    host = "localhost",
    port = 5432,
    user = "myself",
    password = "123456",
  )
```

### Updating Database record - with append
```{r}
dbWriteTable(
  con2,
  "Example1",
  dataset_to_be_exported,
  overwrite = FALSE,
  row.names = FALSE
)
```


### Updating Database record - with replace
```{r}
dbWriteTable(
  con2,
  "Example1",
  dataset_to_be_exported,
  overwrite = TRUE,
  row.names = FALSE
)
```


## Internal data sources - saving and loading an environment
With internal data source, we include anything that is in our current environment. Assume that we have done some work on multiple dataframe, and we are not looking to individually save each of them in an excel, CSV or RDS file, because there would be 100 of them. Well R allows us to save the entire environment, as it is, with all the variables, dataframe, etc. that we have been working on. Let's have a look at the command for that purpose.

### Saving an environment
```{r}
save.image(file='myHeroEnvironment.RData')
```


### Loading an environment
```{r}
	
load('myHeroEnvironment.RData')
```
This is it for importing and exporting data for today, we will now move to Data Exploration, which is a critical step in our Machine Learning Pipeline



