As an introduction, I would like to take an interesting and important topic of any data journey, which is “Data Extraction”.The main aim of this vignette is in understanding the various methods involved in the extraction of data which usually consumes 30%-40% of any data project.R provides various ways to read data for processing. It supports reading data from CSV files, Excel files, databases, other statistical tools, binary files, and websites.
Some of the most common file formats are:
Note the various libraries within the library section of the block below. These are required in order to inform R how to invoke these commands.
Finding a pattern on the R commands help us to write our script quickly on reading the data.
read_:
csv : "," Separated Files
csv2 : ";" Separated Files
excel : excel files
table : ".txt" Files. Use sep parameter like "\t" or ";"
delim : "," and "/t" files
html : .html files
fromJSON : Json Separated File
Let’s begin by applying some basic functions to read the excel files (.xls,.xlsx). It requires readxl and tidyverse library to be installed. I have used basic required parameter for reading excel file. Please refer to ?read_excel for further variations. The sample code as follows:
library("readxl")
library("tidyverse")
#> -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.0 v purrr 0.3.3
#> v tibble 2.1.3 v dplyr 0.8.5
#> v tidyr 1.0.2 v stringr 1.4.0
#> v readr 1.3.1 v forcats 0.5.0
#> -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
Economics_Df = read_excel('Economics.xlsx',sheet='economics')
ggplot(data = Economics_Df, aes(x = date, y = pop)) +
geom_line(color = "#CC79A7", size = 2)As most of the file formats falls into category of CSV file, either comma separated or semicolon separated, ReadCSV is the most widely used format of data read. For better visibility, I applied a mlbench algorithm to show the data as decision tree.
PimaIndiansDiabetes=read_csv('PimaIndiansDiabetes.csv',col_names = TRUE)
#> Parsed with column specification:
#> cols(
#> pregnant = col_double(),
#> glucose = col_double(),
#> pressure = col_double(),
#> triceps = col_double(),
#> insulin = col_double(),
#> mass = col_double(),
#> pedigree = col_double(),
#> age = col_double(),
#> diabetes = col_double()
#> )
#> Warning: 768 parsing failures.
#> row col expected actual file
#> 1 -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> 2 -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> 3 -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> 4 -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> 5 -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> ... ... ......... .......... .........................
#> See problems(...) for more details.
library(rpart)
library(rpart.plot)
library(mlbench)
library(readr)
rpart_model <- rpart (diabetes ~ glucose + insulin + mass + age,data= PimaIndiansDiabetes)
rpart.plot(rpart_model)One of the other function for reading a file in table format is through read.table. read.table can accepts .csv,.tsv and .txt files having main arguments as ‘file’, ‘header’, and ‘sep’. Generally, They use huge amount of memory when reading large files unless colclasses is part of Atomic Vector classes.
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
Employee_OV = read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",header = TRUE)
head(Employee_OV)
#> X1 X6 a
#> 1 2 7 b
#> 2 3 8 c
#> 3 4 9 d
#> 4 5 10 eAs a solution for reading large files, data.table offers efficient way to read memory intensive files. It provides functions to perform various operations which are specifically designed for handling large volume of data.Please refer to package Data.Table for detailed information and refer to sample code as follows:
library(data.table)
Ontime_marketing <- fread("915738537_T_ONTIME_MARKETING.csv", header= TRUE)
nrow(Ontime_marketing)
#> [1] 638649
# Time taken by data.table to import
system.time({Ontime_marketing <-
fread("915738537_T_ONTIME_MARKETING.csv", header= TRUE)})
#> user system elapsed
#> 0.08 0.01 0.06There are few analysis highlighted VROOM for performance of big data which is even better than data.table.For more information,Please refer to Vroom here.
One of the simple way of web scraping is through read_html that comes as part of rvest package. If we need to extract table data from html (look out for class: table in html node), then append the read.html function with html_table() which can easily picks up the table data from website.
library(tidyverse)
library(rvest)
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
url="https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102"
cases <- read_html(url) %>% html_nodes("table") %>% .[[5]] %>%
html_table(fill = TRUE)
head(cases)
#> Date West West West West West West West West West West West West West
#> 1 Date AK AZ CA CO HI ID MT NM NV OR UT WA WY
#> 2 Jan 21 1
#> 3 Jan 24
#> 4 Jan 25 1
#> 5 Jan 26 1 1
#> 6 Jan 30
#> Midwest Midwest Midwest Midwest Midwest Midwest Midwest Midwest Midwest
#> 1 IA IL IN KS MI MN MO ND NE
#> 2
#> 3 1
#> 4
#> 5
#> 6 1
#> Midwest Midwest Midwest Midwest South South South South South South South
#> 1 OH OK SD WI AL AR FL GA KY LA MS
#> 2
#> 3
#> 4
#> 5
#> 6
#> South South South South South South Northeast Northeast Northeast Northeast
#> 1 NC SC TN TX VA WV CT DC DE MA
#> 2
#> 3
#> 4
#> 5
#> 6
#> Northeast Northeast Northeast Northeast Northeast Northeast Northeast
#> 1 MD ME NH NJ NY PA RI
#> 2
#> 3
#> 4
#> 5
#> 6
#> Northeast Territories Territories Territories Territories Confirmed Confirmed
#> 1 VT GU MP PR VI New Cml
#> 2 1 1
#> 3 1 2
#> 4 1 3
#> 5 2 5
#> 6 1 6
#> Deaths Deaths Recovered Recovered
#> 1 New Cml New Cml
#> 2
#> 3
#> 4
#> 5
#> 6One of the other interesting method to retrieve a widely known format of data(JSON) is through fromJSON function belongs to jsonlite package. I am not mentioning about API interface that fetches data from endpoint. Instead, I’ll take an assumption that we have an API Endpoint and use fromJson method to retrieve it.
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
article_key <- "&api-key=b75da00e12d54774a2d362adddcc9bef"
url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=obamacare+socialism"
req <- fromJSON(paste0(url, article_key))
a <- req$response$docs
colnames(a)
#> [1] "abstract" "web_url" "snippet" "lead_paragraph"
#> [5] "print_section" "print_page" "source" "multimedia"
#> [9] "headline" "keywords" "pub_date" "document_type"
#> [13] "news_desk" "section_name" "byline" "type_of_material"
#> [17] "_id" "word_count" "uri" "subsection_name"
a$abstract
#> [1] "Once a program begins protecting elderly people from poverty or letting cancer patients receive treatment, the politics are transformed."
#> [2] "For more than a century, we’ve managed to think socialism is both a dead letter and the wave of the future — at the same time."
#> [3] "As the health care initiative proves to be a success, Republicans are rolling back attack ads, and Democrats are staying quiet."
#> [4] "The practical effect of the decision is likely to be months of delays, pushing the final outcome of the case beyond the 2020 election."
#> [5] "The logical conclusion of the Republicans’ efforts is to cut off people who need care."
#> [6] "Republicans spent almost seven years waging a battle to repeal the Affordable Care Act. Finally, they are set up for victory — or a new kind of disappointment."
#> [7] "The head of the company loved by Republicans praised the health reform law."
#> [8] "Bad ideas just keep coming back."
#> [9] "Socialism = cutting government spending."
#> [10] "Republicans will never find it."One of the important Organisation to track about economic progress and world trade under the association of major countries, is “OECD”. There is an API given by OECD to access some of its important stats through OECD-SDMX JSON API. However, using the API is quite hard task. As a solution, R provided a great package via OECD that contains various functions act as a bridge to interact with OECD stats data. The below snippet of R code helps to kick-off our journey on extracting OECD data.
##Example to read API from OECD
library("OECD")
df <- get_dataset("EPL_OV",
filter = list(c("DEU", "FRA"),
c("EPRC_V1", "EPRC_V2")),
start_time = 2008, end_time = 2008)
head(df)
#> # A tibble: 4 x 5
#> COUNTRY SERIES TIME_FORMAT obsTime obsValue
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 FRA EPRC_V1 P1Y 2008 2.47
#> 2 DEU EPRC_V1 P1Y 2008 2.68
#> 3 FRA EPRC_V2 P1Y 2008 2.73
#> 4 DEU EPRC_V2 P1Y 2008 2.95Ofcourse, a week of exploration (at least 3 hrs everyday) on R has provided me a glimpse of how far I have to go in order to start a decent R project. So there are plenty of packages exist that provides better functionality, statistical and graphical interfaces. Long Way to Go!!!!
Practical Big Data Analytics: Hands-on Techniques to Implement Enterprise Analytics and Machine Learning Using Hadoop, Spark, NoSQL and R . Book by Nataraj Dasgupta. Link
RDocumentation on Read.Table Read.Table
Cran Package about Data.table
Rvest: Web Scraping with R in Rvest
More information about read.html using USA wiki in Covid-19