Introduction to Data Extraction Using R

As an introduction, I would like to take an interesting and important topic of any data journey, which is “Data Extraction”.The main aim of this vignette is in understanding the various methods involved in the extraction of data which usually consumes 30%-40% of any data project.R provides various ways to read data for processing. It supports reading data from CSV files, Excel files, databases, other statistical tools, binary files, and websites.

Explore Further

COMMAND : READXL

Let’s begin by applying some basic functions to read the excel files (.xls,.xlsx). It requires readxl and tidyverse library to be installed. I have used basic required parameter for reading excel file. Please refer to ?read_excel for further variations. The sample code as follows:

library("readxl")
library("tidyverse")
#> -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.0     v purrr   0.3.3
#> v tibble  2.1.3     v dplyr   0.8.5
#> v tidyr   1.0.2     v stringr 1.4.0
#> v readr   1.3.1     v forcats 0.5.0
#> -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
Economics_Df = read_excel('Economics.xlsx',sheet='economics')
ggplot(data = Economics_Df, aes(x = date, y = pop)) +
  geom_line(color = "#CC79A7", size = 2)

COMMAND : READ_CSV

As most of the file formats falls into category of CSV file, either comma separated or semicolon separated, ReadCSV is the most widely used format of data read. For better visibility, I applied a mlbench algorithm to show the data as decision tree.

PimaIndiansDiabetes=read_csv('PimaIndiansDiabetes.csv',col_names = TRUE)
#> Parsed with column specification:
#> cols(
#>   pregnant = col_double(),
#>   glucose = col_double(),
#>   pressure = col_double(),
#>   triceps = col_double(),
#>   insulin = col_double(),
#>   mass = col_double(),
#>   pedigree = col_double(),
#>   age = col_double(),
#>   diabetes = col_double()
#> )
#> Warning: 768 parsing failures.
#> row col  expected     actual                      file
#>   1  -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#>   2  -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#>   3  -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#>   4  -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#>   5  -- 9 columns 10 columns 'PimaIndiansDiabetes.csv'
#> ... ... ......... .......... .........................
#> See problems(...) for more details.
library(rpart)
library(rpart.plot)
library(mlbench)
library(readr)
rpart_model <- rpart (diabetes ~ glucose + insulin + mass + age,data= PimaIndiansDiabetes)
rpart.plot(rpart_model)

COMMAND : READ.TABLE

One of the other function for reading a file in table format is through read.table. read.table can accepts .csv,.tsv and .txt files having main arguments as ‘file’, ‘header’, and ‘sep’. Generally, They use huge amount of memory when reading large files unless colclasses is part of Atomic Vector classes.

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
Employee_OV = read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",header = TRUE)
head(Employee_OV)
#>   X1 X6 a
#> 1  2  7 b
#> 2  3  8 c
#> 3  4  9 d
#> 4  5 10 e

COMMAND : DATA.TABLE

As a solution for reading large files, data.table offers efficient way to read memory intensive files. It provides functions to perform various operations which are specifically designed for handling large volume of data.Please refer to package Data.Table for detailed information and refer to sample code as follows:

library(data.table)
Ontime_marketing <- fread("915738537_T_ONTIME_MARKETING.csv",  header= TRUE)
nrow(Ontime_marketing)
#> [1] 638649

# Time taken by data.table to import
system.time({Ontime_marketing <- 
  fread("915738537_T_ONTIME_MARKETING.csv",  header= TRUE)})
#>    user  system elapsed 
#>    0.08    0.01    0.06

COMMAND : VROOM

There are few analysis highlighted VROOM for performance of big data which is even better than data.table.For more information,Please refer to Vroom here.

COMMAND : READ_HTML

One of the simple way of web scraping is through read_html that comes as part of rvest package. If we need to extract table data from html (look out for class: table in html node), then append the read.html function with html_table() which can easily picks up the table data from website.

library(tidyverse)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
url="https://en.wikipedia.org/w/index.php?title=2020_coronavirus_outbreak_in_the_United_States&oldid=944107102"
cases <- read_html(url)  %>% html_nodes("table") %>% .[[5]] %>% 
  html_table(fill = TRUE)
head(cases)
#>     Date West West West West West West West West West West West West West
#> 1   Date   AK   AZ   CA   CO   HI   ID   MT   NM   NV   OR   UT   WA   WY
#> 2 Jan 21                                                           1     
#> 3 Jan 24                                                                 
#> 4 Jan 25              1                                                  
#> 5 Jan 26         1    1                                                  
#> 6 Jan 30                                                                 
#>   Midwest Midwest Midwest Midwest Midwest Midwest Midwest Midwest Midwest
#> 1      IA      IL      IN      KS      MI      MN      MO      ND      NE
#> 2                                                                        
#> 3               1                                                        
#> 4                                                                        
#> 5                                                                        
#> 6               1                                                        
#>   Midwest Midwest Midwest Midwest South South South South South South South
#> 1      OH      OK      SD      WI    AL    AR    FL    GA    KY    LA    MS
#> 2                                                                          
#> 3                                                                          
#> 4                                                                          
#> 5                                                                          
#> 6                                                                          
#>   South South South South South South Northeast Northeast Northeast Northeast
#> 1    NC    SC    TN    TX    VA    WV        CT        DC        DE        MA
#> 2                                                                            
#> 3                                                                            
#> 4                                                                            
#> 5                                                                            
#> 6                                                                            
#>   Northeast Northeast Northeast Northeast Northeast Northeast Northeast
#> 1        MD        ME        NH        NJ        NY        PA        RI
#> 2                                                                      
#> 3                                                                      
#> 4                                                                      
#> 5                                                                      
#> 6                                                                      
#>   Northeast Territories Territories Territories Territories Confirmed Confirmed
#> 1        VT          GU          MP          PR          VI       New       Cml
#> 2                                                                   1         1
#> 3                                                                   1         2
#> 4                                                                   1         3
#> 5                                                                   2         5
#> 6                                                                   1         6
#>   Deaths Deaths Recovered Recovered
#> 1    New    Cml       New       Cml
#> 2                                  
#> 3                                  
#> 4                                  
#> 5                                  
#> 6

COMMAND : FROMJSON

One of the other interesting method to retrieve a widely known format of data(JSON) is through fromJSON function belongs to jsonlite package. I am not mentioning about API interface that fetches data from endpoint. Instead, I’ll take an assumption that we have an API Endpoint and use fromJson method to retrieve it.

library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten
article_key <- "&api-key=b75da00e12d54774a2d362adddcc9bef"
url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=obamacare+socialism"
req <- fromJSON(paste0(url, article_key))
a <- req$response$docs
colnames(a)
#>  [1] "abstract"         "web_url"          "snippet"          "lead_paragraph"  
#>  [5] "print_section"    "print_page"       "source"           "multimedia"      
#>  [9] "headline"         "keywords"         "pub_date"         "document_type"   
#> [13] "news_desk"        "section_name"     "byline"           "type_of_material"
#> [17] "_id"              "word_count"       "uri"              "subsection_name"
a$abstract
#>  [1] "Once a program begins protecting elderly people from poverty or letting cancer patients receive treatment, the politics are transformed."                       
#>  [2] "For more than a century, we’ve managed to think socialism is both a dead letter and the wave of the future — at the same time."                                 
#>  [3] "As the health care initiative proves to be a success, Republicans are rolling back attack ads, and Democrats are staying quiet."                                
#>  [4] "The practical effect of the decision is likely to be months of delays, pushing the final outcome of the case beyond the 2020 election."                         
#>  [5] "The logical conclusion of the Republicans’ efforts is to cut off people who need care."                                                                         
#>  [6] "Republicans spent almost seven years waging a battle to repeal the Affordable Care Act. Finally, they are set up for victory — or a new kind of disappointment."
#>  [7] "The head of the company loved by Republicans praised the health reform law."                                                                                    
#>  [8] "Bad ideas just keep coming back."                                                                                                                               
#>  [9] "Socialism = cutting government spending."                                                                                                                       
#> [10] "Republicans will never find it."

OECD - The Organisation for Economic Co-operation and Development

One of the important Organisation to track about economic progress and world trade under the association of major countries, is “OECD”. There is an API given by OECD to access some of its important stats through OECD-SDMX JSON API. However, using the API is quite hard task. As a solution, R provided a great package via OECD that contains various functions act as a bridge to interact with OECD stats data. The below snippet of R code helps to kick-off our journey on extracting OECD data.

##Example to read API from OECD
library("OECD")
df <- get_dataset("EPL_OV",
                  filter = list(c("DEU", "FRA"),
                                c("EPRC_V1", "EPRC_V2")),
                  start_time = 2008, end_time = 2008)
head(df)
#> # A tibble: 4 x 5
#>   COUNTRY SERIES  TIME_FORMAT obsTime obsValue
#>   <chr>   <chr>   <chr>       <chr>      <dbl>
#> 1 FRA     EPRC_V1 P1Y         2008        2.47
#> 2 DEU     EPRC_V1 P1Y         2008        2.68
#> 3 FRA     EPRC_V2 P1Y         2008        2.73
#> 4 DEU     EPRC_V2 P1Y         2008        2.95

Lessons Learnt

Ofcourse, a week of exploration (at least 3 hrs everyday) on R has provided me a glimpse of how far I have to go in order to start a decent R project. So there are plenty of packages exist that provides better functionality, statistical and graphical interfaces. Long Way to Go!!!!

Introduction to Data Extraction Using R

Eswari Priya

2020-04-04

Vignette Info

Templates