In R, we can read data from files stored outside the R environment. We can also write data into files that will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, txt,rds, xml, json, etc.
Before we start working with data (interface data), first make sure your working directory in the right connection. You can check it by using the getwd()
function. You can also set a new working directory using setwd()
function
## [1] "C:/Users/USER/Documents/SEMESTER 3/ALGORITMA/R Vanessa"
## [1] "C:/Users/USER/Documents/SEMESTER 3/ALGORITMA/R Vanessa"
Here is a simple example of read.csv() function to read a CSV file available in your current working directory
# csv <- read.csv(file.choose()) # interface csv file without `setwd`
csv1 <- read.csv("csv1.csv",sep = ",") # this is read data for comma separator
csv2 <- read.csv("csv2.csv",sep = ";") # this is read data for semicolon separator
head(csv2,3) # print the result of `data1`
## id name salary start_date dept
## 1 1 Julian 623,3 2022-01-01 DS
## 2 2 Vanessa 515,2 2022-09-23 DS
## 3 3 Jeffry 611 2022-11-15 BA
R can create CSV file from existing data frame. The write.csv()
function is used to create the CSV file. This file gets created in the working directory.
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls
or .xlsx
format. R can read directly from these files using some excel specific packages. We will be using readxl package.
#xlsx <- read_excel(file.choose()) # interface xlsx file without `setwd`
#install.packages("readxl") # install `readxl` packages
library("readxl") # load `readxl` packages
xlsx1<-read_excel("xlsx1.xlsx",sheet=1) # read/import data xlsx from PC
head(xlsx1,3)
## # A tibble: 3 x 5
## id name salary start_date dept
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Julian 623 44562 DS
## 2 2 Vanessa 515 44827 DS
## 3 3 Jeffry 611 44880 BA
In order to write existing data frame into excel file you have to install writexl
package
One of the most common tasks we perform is reading in data from CSV and XLSX files. However, for large CSV or XLSX files, this can be slow. One neat trick is to read in the data and save as a TXT or an R binary file (RDS). To import in the TXT file, we useread.table()
and to import the RDS file we can use readRDS()
.
# txt1 <- read.table(file.choose()) # interface TXT file without `setwd`
txt1 <- read.table("txt1.txt") # read/load TXT format (notepad)
txt1 <- source("txt1.Rdmpd") # read/load TXT format (Rdmpd)
rds1 <- readRDS("rds1.rds") # read/load binary RDS format
ascii1 <- readRDS("ascii1.rds") # read/load binary ASCII format
To save as a TXT file we can use “write.table()” function, and for R binary file (RDS) we can use saveRDS() function. They are widely used by R itself, for example, to store metadata for a package and to store the help.search databases: the “.rds” file extension is most often used. This format can be binary or ASCII. Binary is more compact, while ASCII will be more efficient with version control systems like Git
data <- read.csv("csv1.csv",sep = ",") # read/import data from your PC
write.table(data,"txt2.txt") # save in a TXT format (notepad)
dump("data", "txt2.Rdmpd") # save in a TXT format (Rdmpd)
saveRDS(data, "rds2.rds") # save a single object in binary RDS format
saveRDS(data, "ascii2.rds", ascii=TRUE) # save a single object in ASCII RDS format
XML is a file format that shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag describes the structure of the page, in XML the markup tags describe the meaning of the data contained in the file. You can read an XML file in R using the “XML” package. The XML file is read by R using the function xmlParse()
. It is stored as a list in R.
library("XML") # load the package required to read XML files
library("methods") # also load the other required package.
result <- xmlParse(file = "xml1.xml") # give the input file name to the function
print(result) # Print the result
## <?xml version="1.0"?>
## <RECORDS>
## <EMPLOYEE>
## <id>1</id>
## <name>Julian</name>
## <salary>623.3</salary>
## <start_date>1/1/2022</start_date>
## <dept>DS</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>2</id>
## <name>Vanessa</name>
## <salary>515.2</salary>
## <start_date>9/23/2022</start_date>
## <dept>DS</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>3</id>
## <name>Jeffry</name>
## <salary>611</salary>
## <start_date>11/15/2022</start_date>
## <dept>BA</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>4</id>
## <name>Angel</name>
## <salary>729</salary>
## <start_date>5/11/2022</start_date>
## <dept>BA</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>5</id>
## <name>Nikki</name>
## <salary>843.25</salary>
## <start_date>3/27/2022</start_date>
## <dept>DS</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>6</id>
## <name>Ardifo</name>
## <salary>578</salary>
## <start_date>5/21/2022</start_date>
## <dept>Actuaries</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>7</id>
## <name>Irene</name>
## <salary>722.5</salary>
## <start_date>7/30/2022</start_date>
## <dept>Actuaries</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>8</id>
## <name>Kefas</name>
## <salary>632.8</salary>
## <start_date>6/17/2022</start_date>
## <dept>CA</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>9</id>
## <name>Sherly</name>
## <salary>632.8</salary>
## <start_date>7/30/2022</start_date>
## <dept>DE</dept>
## </EMPLOYEE>
## <EMPLOYEE>
## <id>10</id>
## <name>Bakti</name>
## <salary>NA</salary>
## <start_date>9/03/2018</start_date>
## <dept>Lecturer</dept>
## </EMPLOYEE>
## </RECORDS>
##
To handle the data effectively in large files we read the data in the XML file as a data frame. Then process the data frame for data analysis.
xmldataframe <- xmlToDataFrame(result) # convert the input xml file to a data frame
head(xmldataframe,3) # Print the result
## id name salary start_date dept
## 1 1 Julian 623.3 1/1/2022 DS
## 2 2 Vanessa 515.2 9/23/2022 DS
## 3 3 Jeffry 611 11/15/2022 BA
Create a XML file by copying this data into a text editor like notepad. Save the file with a .xml
extension and choosing the file type as all files(.).
JSON file stores data as text in human-readable format. Json stands for JavaScript Object Notation. R can read JSON files using the rjson
package. more about JSON.
library("rjson") # load the package to read JSON files
json1 <- fromJSON(file= "json1.json") # give the input file name to the function
print(json1) # print the result
## $id
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
##
## $name
## [1] "Julian" "Vanessa" "Jeffry" "Angel" "Nikki" "Ardifo" "Irene"
## [8] "Kefas" "Sherly" "Bakti"
##
## $salary
## [1] "623.3" "515.2" "611" "729" "843.25" "578" "722.5" "632.8"
## [9] "632.8" "NA"
##
## $start_date
## [1] "1/1/2022" "9/23/2022" "11/15/2022" "5/11/2022" "3/27/2022"
## [6] "5/21/2022" "7/30/2022" "6/17/2022" "7/30/2022" "9/3/2018"
##
## $dept
## [1] "DS" "DS" "BA" "DA" "DS" "Actuaries"
## [7] "Actuaries" "CA" "DE" "Lecturer"
We can convert the extracted data above to a R data frame for further analysis using the as.data.frame()
function.
json_data_frame <- as.data.frame(json1) # convert JSON file to a data frame
head(json_data_frame,3) # print the result
## id name salary start_date dept
## 1 1 Julian 623.3 1/1/2022 DS
## 2 2 Vanessa 515.2 9/23/2022 DS
## 3 3 Jeffry 611 11/15/2022 BA
Many websites provide data for consumption by its users. Using R programs, we can programmatically extract specific data from such websites. In this section, I provide examples of how to import data from the GitHub repository, but you can do the same thing to other websites or repository.
web_csv <- read.csv("https://github.com/Bakti-Siregar/dataset/raw/master/Bookdown-Data-Science-for-Beginners/csv1.csv")
head(web_csv,3)
## id name salary start_date dept
## 1 1 Julian 623,3 1/1/2022 DS
## 2 2 Vanessa 515,2 9/23/2022 DS
## 3 3 Jeffry 611 11/15/2022 BA
## [1] TRUE
web_xlsx <-rio::import("https://github.com/Bakti-Siregar/dataset/blob/master/Bookdown-Data-Science-for-Beginners/xlsx1.xlsx?raw=true")
head(web_xlsx,3)
## id name salary start_date dept
## 1 1 Julian 623 2022-01-01 DS
## 2 2 Vanessa 515 2022-09-23 DS
## 3 3 Jeffry 611 2022-11-15 BA
web_txt <- read.table("type URL/Web.txt here") # read/load TXT format (notepad) from web
web_rds <- readRDS("type URL/Web.rds here") # read/load RDS format from web
web_ascii <- readRDS("type URL/Web.ascii here") # read/load ASCII format from web
web_xml<- xmlParse("type URL/Web.xml here") # read/load XML format from web
xmlToDataFrame(web_xml) # convert the input xml file to a data frame
web_json <- fromJSON("type URL/Web.json here") # read/load JSON format from web
as.data.frame(web_json) # convert JSON file to a data frame
The data is Relational database systems are stored in a normalized format. So, to carry out statistical computing we will need very advanced and complex SQL queries. But R can connect easily to many relational databases like MySql, Oracle, SQL Server, etc. and fetch records from them as a data frame. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.
Note: We will learn this section in the specific section called “Database System with R”
Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.
Note: We will learn this section in the specific section called “Applied Data Science”