Introduction

One of the most frustrating tasks in analytics is getting the data into R! It often takes more time than the analysis itself.

While there are many packages that make the task less daunting, the most commonly used ones are highlighted in this document.

While going through the examples and various packages mentioned, keep in mind that R, like most stats systems, aren’t geared toward manipulating large-scale data. If you do have a lot, and I mean a lot of data, rather work with database manipulations systems, to help save your hairline and calm the nerves.

If you absolutely have to work with “biggish-data” in R, there are several packages developed in languages such as Java, perl and python that can be directly integrated with R code.

e.g.: rJava SJava, RSPerl, RSPython

Remember though that R, like S, comes from the Unix tradition of small re-usable tools, so it makes sense to do most of the data manipulation using tools such as “awk” and “perl” before importing into R (or after exporting from R).

Data Imports

At all times, if you can, get your data as a simple text file (small to medium scale).

However, “real life” ain’t so pretty, at some stage, you’ll end up having to deal with data in some proprietary data format, e.g. an Excel spreadsheet or an SPSS file. If you can, go back and export the data as a text file (or get someone to do it for you…)!

If you can’t, fear not, there are other ways of getting the data into R. While I won’t go into detail, there’s a lot of information on the web and in the R docs to help you out. Links to these are given at the end of this document.

The next most common issue you’ll probably face is with data encoding.

Encodings

Unless data is entirely ASCII, we need to figure out how it’s encoded, to help R out.

File extensions and encoding:

text.Rd: UTF-8 Unicode English text
text2.dat: ISO-8859 English text
text3.dat: Little-endian UTF-16 Unicode English character data, with CRLF line terminators
intro.dat: UTF-8 Unicode text
intro.dat: UTF-8 Unicode (with BOM) text

Unix-like systems produce UTF-8 files, while Windows’ ‘Unicode’ is UCS-2LE or UTF-16LE. If the data’s from a Chinese/Japanese/Korean locale, it won’t be standardised and will have some other 8-bit encoding.

Windows Systems

If you’re dealing with data that comes from Windows-originated files, you’ll have a bit more work on your hands. Windows-based files with UCS-2/UTF-16 encoding have associated ‘BOMs’ (Byte Order Marks). These cause problems with Unicode files, where the file utility will flatly refuse to read files with a BOM!

In this case, save yourself some trouble and revert to a command-line utility or a hex editor to work out the encoding.

Another typical issue encountered when running R on a Windows based system, is one where R says it cannot open the file, even though you know it exists!

What’s causing this? Probably backslashes in the file path.

Solution: change the backslashes to forward slashes or to double backslashes. e.g. * path_to_file_original <- “C:.txt”

path_to_file_forward_slashes <- “C:/data/sample.txt”
path_to_file_double_backslashes <- “C:\data\sample.txt”

Best Practices for Reading Data into R:

While R and the packages we use to read data into R take a “best guess” at data types, if dealing with large data sets, it’s best to have a strategy for guessing the column types.

The standard practice is outlined below:

Read in a small amount of data, letting the package take a guess at the column types,
Call str(data) on the data read in,
Check the column types returned,
Create a vector for the correct/desired variable type,
Pass that in as an argument to the read method.

Reading data in in the correct format to start with speeds things up because R doesn’t need to guess the format. It also helps reduce the amount of work you have to do in-memory later down the wrangling pipeline.

Reading Data into R: Specific Formats

Comma, Tab Separated Files

These are amongst the “friendliest” files to read into R. For .csv files, we often use read.csv( ), although for imporved speed, you may want to consider readr 1.0.0

Column Guessing

Letting R take a guess, using a limited number of entries at first:

install.packages("readr")

## 
## The downloaded binary packages are in
##  /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages

library(readr)
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)

## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_date(format = "")
## )

head(challenge)

## # A tibble: 6 x 2
##       x      y
##   <dbl> <date>
## 1   404   <NA>
## 2  4172   <NA>
## 3  3004   <NA>
## 4   787   <NA>
## 5    37   <NA>
## 6  2332   <NA>

To suppress the printed spec, add col_types = cols( ) to the argument. We can use guess_parser( ) to see which parser readr is going to select:

guess_parser("1,234")

## [1] "number"

Previously Guessed as Numbers:

guess_parser(c(".", "-"))

## [1] "character"

guess_parser(c("10E", "20N"))

## [1] "character"

Parsing Date-Time

Date time parsers recognise the following:

install.packages("curl", quiet = TRUE)
library(curl)

## 
## Attaching package: 'curl'

## The following object is masked from 'package:readr':
## 
##     parse_date

install.packages("devtools", quiet = TRUE)
library(devtools)
devtools::install_github("rstats-db/hms")

## Skipping install of 'hms' from a github remote, the SHA1 (98d14cf5) has not changed since last install.
##   Use `force = TRUE` to force installation

library(hms)
hms(56, 34, 12)

## 12:34:56

as.hms(1)

## 00:00:01

as.hms("12:34:56")

## 12:34:56

as.hms(Sys.time())

## 23:18:54.006723

as.POSIXct(hms(1))

## [1] "1970-01-01 00:00:01 UTC"

Example Using mtcars:

mtcars2 <- read_csv(readr_example("mtcars.csv"))

## Parsed with column specification:
## cols(
##   mpg = col_double(),
##   cyl = col_integer(),
##   disp = col_double(),
##   hp = col_integer(),
##   drat = col_double(),
##   wt = col_double(),
##   qsec = col_double(),
##   vs = col_integer(),
##   am = col_integer(),
##   gear = col_integer(),
##   carb = col_integer()
## )

Given the output, we now figure out the correct/desired column types and make parsing strict.

Method: copy-paste the output or save the spec to disk:

mtcars_spec <- write_rds(spec(mtcars2), "mtcars2-spec.rds")

With every load thereafter, pass this into the read function. If using this in production, you can throw an error if there are any problems with parsing:

mtcars2 <- as.data.frame(read_csv(readr_example("mtcars.csv"), col_types = read_rds("mtcars2-spec.rds")))
stop_for_problems(mtcars2)
head(mtcars2)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Low-level Readers and Writers

In readr:

read_file( ): reads file into length-1 character vector
read_file_raw( ): reads file into single raw vector
read_lines( ): reads file into character vector with one entry per line
read_lines_raw( ): reads file into a list of raw vectors with one entry per line.

These are paried with wite_lines( ) and write_file( )

Fixed-Width-Format Files

Fixed width format is exactly that: the data items occur at fixed boundaries. For these, we use read.fwf. read_fwf( ) is reported to reliably read only a few columns and is supposed to be ideal for files with ragged final columns (set final position/width to NA) and when you want to skip comments (set with the comments argument).

e.g: records <- read.fwf(“filename”, widths=c(w1, w2, …, wn)) where w is the width of the columns and there are n columns in the data.

To read in data from a fixed-width text file, we need to know how wide the columns are.

fwftext_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/fixed-width.txt"
records <- read.fwf(fwftext_file, widths=c(8,9,4,-1,4))
records

##         V1        V2   V3   V4
## 1 Fisher   R.A.      1890 1962
## 2 Pearson  Karl      1857 1936
## 3 Cox      Gertrude  1900 1978
## 4 Yates    Frank     1902 1994
## 5 Smith    Krisine   1878 1939

Here we have specified the first, second, third and last columns to contain 8, 9, 4 and 4 characters, respectively and the fourth column is the single space between the years. We don’t need the space, so we leave the column out. Changing the column names is the same as with any other data frame/table etc:

records_renamed <- read.fwf(fwftext_file, widths=c(8,9,4,-1,4), col.names=c("Last", "First", "Born", "Died"))
records_renamed

##       Last     First Born Died
## 1 Fisher   R.A.      1890 1962
## 2 Pearson  Karl      1857 1936
## 3 Cox      Gertrude  1900 1978
## 4 Yates    Frank     1902 1994
## 5 Smith    Krisine   1878 1939

Reading in Tabular Data

Tabular data is data in a file with a simple format:

Each line contains one record
In each record, the fields are separated by a one-character delimeter (e.g. space, tab, comma, colon, semi-colon etc).
Each record has the same number of fields

Although similar to data in csv and tsv form, tabular data’s more free-form because the fields aren’t neccessarily aligned by position. We use a function to read the data in as a table and not row by row.

tabular_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians.txt"
dfstats <- read.table(tabular_file)
print(dfstats)

##        V1       V2   V3   V4
## 1  Fisher     R.A. 1890 1962
## 2 Pearson     Karl 1857 1936
## 3     Cox Gertrude 1900 1978
## 4   Yates    Frank 1902 1994
## 5   Smith  Krisine 1878 1939

Other delimeters are dealt with by specifying them as an argument to read.table( ):

tabular_col_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians_col.txt"
dfstats_col <- read.table(tabular_col_file, sep=":")
print(dfstats_col)

##        V1       V2   V3   V4
## 1  Fisher     R.A. 1890 1962
## 2 Pearson     Karl 1857 1936
## 3     Cox Gertrude 1900 1978
## 4   Yates    Frank 1902 1994
## 5   Smith  Krisine 1878 1939

str(dfstats_col)

## 'data.frame':    5 obs. of  4 variables:
##  $ V1: Factor w/ 5 levels "Cox","Fisher",..: 2 3 1 5 4
##  $ V2: Factor w/ 5 levels "Frank","Gertrude",..: 5 3 2 1 4
##  $ V3: int  1890 1857 1900 1902 1878
##  $ V4: int  1962 1936 1978 1994 1939

String Interpretation

Be careful with read.table( ), as it often interprets strings as factors. Change this behavior by being specific with stringsAsFactors=FALSE.

tabular_col_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians_col.txt"
dfstats_col_strings <- read.table(tabular_col_file, sep=":", stringsAsFactors = FALSE)
str(dfstats_col_strings)

## 'data.frame':    5 obs. of  4 variables:
##  $ V1: chr  "Fisher" "Pearson" "Cox" "Yates" ...
##  $ V2: chr  "R.A." "Karl" "Gertrude" "Frank" ...
##  $ V3: int  1890 1857 1900 1902 1878
##  $ V4: int  1962 1936 1978 1994 1939

Dealing with Missing Values

By default, read.table( ) interprets any “NA” string as NA. If missing values are specified by some other string (e.g. SAS uses “.”), specify this in the read method.

mdfmissing <- "/Users/Tina/Projects/R_bootcamp_intermediate/SAS_format.txt"
dfmissing1 <- read.table(mdfmissing)
dfmissing <- read.table(mdfmissing, na.strings=".")
print(dfmissing1)

##       V1       V2   V3   V4
## 1 Fisher       RA 1890 1962
## 2      .     Karl 1857 1936
## 3    Cox Gertrude 1900 1978
## 4  Yates    Frank 1902    .
## 5  Smith  Krisine 1878 1939

print(dfmissing)

##       V1       V2   V3   V4
## 1 Fisher       RA 1890 1962
## 2   <NA>     Karl 1857 1936
## 3    Cox Gertrude 1900 1978
## 4  Yates    Frank 1902   NA
## 5  Smith  Krisine 1878 1939

Other useful functionality with read.table: If the data has a header, you can read that in with the argument. In addition, read.table ignores any comment lines (assuming they begin with #).

file_with_header_and_comments <- "/Users/Tina/Projects/R_bootcamp_intermediate/head_comment.txt"
read.table(file_with_header_and_comments, header=TRUE, stringsAsFactors=FALSE)

##   Firstname Lastname Born Died
## 1    Fisher     R.A. 1890 1962
## 2   Pearson     Karl 1857 1936
## 3       Cox Gertrude 1900 1978
## 4     Yates    Frank 1902 1994
## 5     Smith  Krisine 1878 1939

Using Connections to Files

Output to Connections

zz <- file("ex.data", "w")  
# open an output file connection
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
cat("One more line\n", file = zz)
close(zz)

Convert decimal point to comma in output, using a pipe (Unix) Both R strings and (probably) the shell need doubled.

zz <- pipe(paste("sed s/\\\\./,/ >", "outfile"), "w")
cat(format(round(rnorm(100), 4)), sep = "\n", file = zz)
close(zz)

Now look at the output file using file.show(“outfile”, delete.file = TRUE), then capture the R output.

zz <- textConnection("ex.lm.out", "w")
sink(zz)
sink()
close(zz)
## now ‘ex.lm.out’ contains the output for futher processing.
## Look at it by, e.g.,
cat(ex.lm.out, sep = "\n")

Input from Connections

## read in file created in last examples
readLines("ex.data")

## [1] "TITLE extra line" "2 3 5 7"          ""                
## [4] "11 13 17"         "One more line"

unlink("ex.data")
## read listing of current directory (Unix)
readLines(pipe("ls -1"))

##  [1] "Orders-updated_Aug_15.ods"                    
##  [2] "Orders_as_of_Aug_11.ods"                      
##  [3] "Orders_as_of_Aug_15.csv"                      
##  [4] "Orders_as_of_Aug_15_unchecked_against_new.ods"
##  [5] "Orders_as_of_Aug_8.xls"                       
##  [6] "Reading_html_tables_from_web.Rmd"             
##  [7] "SAS_format.txt"                               
##  [8] "Untitled.Rmd"                                 
##  [9] "attendees-26913915245.pdf"                    
## [10] "attendees-Aug15.pdf"                          
## [11] "fixed-width.txt"                              
## [12] "head_comment.txt"                             
## [13] "mtcars2-spec.rds"                             
## [14] "outfile"                                      
## [15] "reading_data_into_r.Rmd"                      
## [16] "reading_data_into_r.html"                     
## [17] "statisticians.txt"                            
## [18] "statisticians_col.txt"                        
## [19] "statisticians_missing.txt"

# remove trailing commas from an input file.

Suppose we are given a file ‘data’ containing 450, 390, 467, 654, 30, 542, 334, 432, 421, 357, 497, 493, 550, 549, 467, 575, 578, 342, 446, 547, 534, 495, 979, 479 Then read this by:

scan(pipe("sed -e s/,$// data"), sep=",")

## numeric(0)

Pushback

Pushback is a method to push data back to the source with which you are connected.

zz <- textConnection(LETTERS)
readLines(zz, 2)

## [1] "A" "B"

scan(zz, "", 4)

## [1] "C" "D" "E" "F"

pushBack(c("aa", "bb"), zz)
scan(zz, "", 4)

## [1] "aa" "bb" "G"  "H"

close(zz)

Reading Data from a Database

R can read data from several RDBMS systems:

MySQL
SQLite
Oracle
Sybase
PostgreSQL

Only MySQL is covered here briefly, but without a connection to a db.

Package: RMySQL The code in the block below is not live and assumes you’ve got the following parameters in the [client] section, so your config file looks something like: [client] user = userid

password = password

host = hostname

#all the code in this block is commented out on purpose.
#library(RMySQL)
#con <- dbConnect(MySQL(), client.flag=CLIENT_MULTI_RESULTS)
#con <- dbConnect(MySQL(), user="userid", password="pswd", host="hostname", client.flag=CLIENT_MULTI_RESULTS)

If your [client] config file doesn’t include the parameters, you can still connect as follows:

#all the code in this block is commented out on purpose.
#library(RMySQL)
#con <- dbConnect(MySQL(), user="userid", password="pswd", host="hostname", client.flag=CLIENT_MULTI_RESULTS)

You need an open database connection to submit your SQL to the db and read the result:

#sql <- "SELECT * from SurveyResults WHERE City = 'Chicago'"
#rows <- dbGetQuery(con, sql)

You need to construct your own SQL queries, all of them work with RMySQL. Results are packaged into a data frame, making it easy to work with the data from there. For more information, see the RMySQL docs.

Reading Data from the Web: HTML Tables

The web has a host of rich data sets and data that you probably don’t want to have to select, copy and paste from. In this section, I’ll briefly demonstrate how to get data from an html table.

The package htmltab, is what seems to work for me, although there are many other ways described in detail on SO.

Example:

Reading all the tables on the web page:

install.packages("htmltab")

## 
## The downloaded binary packages are in
##  /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages

install.packages("tidyr")

## 
## The downloaded binary packages are in
##  /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages

library(htmltab)
library(tidyr)
url <- "http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom"
ukLang <- htmltab(doc = url, which = "//th[text() = 'Ability']/ancestor::table")
str(ukLang)

## 'data.frame':    10 obs. of  11 variables:
##  $ Ability                                   : chr  "Understands but does not speak, read or write" "Speaks, reads and writes" "Speaks but does not read or write" "Speaks and reads but does not write" ...
##  $ Wales >> Welsh >> Number                  : chr  "157,792" "430,717" "80,429" "45,524" ...
##  $ Wales >> Welsh >> %                       : chr  "5.15%" "14.06%" "2.63%" "1.49%" ...
##  $ Scotland >> Scottish Gaelic >> Number     : chr  "23,357" "32,191" "18,966" "6,218" ...
##  $ Scotland >> Scottish Gaelic >> %          : chr  "0.46%" "0.63%" "0.37%" "0.12%" ...
##  $ Scotland >> Scots >> Number               : chr  "267,412" "1,225,622" "179,295" "132,709" ...
##  $ Scotland >> Scots >> %                    : chr  "5.22%" "23.95%" "3.50%" "2.59%" ...
##  $ Northern Ireland >> Irish >> Number       : chr  "70,501" "71,996" "24,677" "7,414" ...
##  $ Northern Ireland >> Irish >> %            : chr  "4.06%" "4.15%" "1.42%" "0.43%" ...
##  $ Northern Ireland >> Ulster-Scots >> Number: chr  "92,040" "17,228" "10,265" "7,801" ...
##  $ Northern Ireland >> Ulster-Scots >> %     : chr  "5.30%" "0.99%" "0.59%" "0.45%" ...

head(ukLang)

##                                         Ability Wales >> Welsh >> Number
## 4 Understands but does not speak, read or write                  157,792
## 5                      Speaks, reads and writes                  430,717
## 6             Speaks but does not read or write                   80,429
## 7           Speaks and reads but does not write                   45,524
## 8             Reads but does not speak or write                   44,327
## 9                   Other combination of skills                   40,692
##   Wales >> Welsh >> % Scotland >> Scottish Gaelic >> Number
## 4               5.15%                                23,357
## 5              14.06%                                32,191
## 6               2.63%                                18,966
## 7               1.49%                                 6,218
## 8               1.45%                                 4,646
## 9               1.33%                                 1,678
##   Scotland >> Scottish Gaelic >> % Scotland >> Scots >> Number
## 4                            0.46%                     267,412
## 5                            0.63%                   1,225,622
## 6                            0.37%                     179,295
## 7                            0.12%                     132,709
## 8                            0.09%                     107,025
## 9                            0.03%                      17,381
##   Scotland >> Scots >> % Northern Ireland >> Irish >> Number
## 4                  5.22%                              70,501
## 5                 23.95%                              71,996
## 6                  3.50%                              24,677
## 7                  2.59%                               7,414
## 8                  2.09%                               5,659
## 9                  0.34%                               4,651
##   Northern Ireland >> Irish >> %
## 4                          4.06%
## 5                          4.15%
## 6                          1.42%
## 7                          0.43%
## 8                          0.33%
## 9                          0.27%
##   Northern Ireland >> Ulster-Scots >> Number
## 4                                     92,040
## 5                                     17,228
## 6                                     10,265
## 7                                      7,801
## 8                                     11,911
## 9                                        959
##   Northern Ireland >> Ulster-Scots >> %
## 4                                 5.30%
## 5                                 0.99%
## 6                                 0.59%
## 7                                 0.45%
## 8                                 0.69%
## 9                                 0.06%

The header information has been recast into a format that respects the hierarchical order of the variables and yet only spans a single line in the R table.

Using functionality from the tidyr package, the next couple of data cleaning steps may be the following:

ukLang <- gather(ukLang, key, value, -Ability)
ukLang <- separate(ukLang, key, into = c("region", "language", "statistic"), sep = " >> ")
head(ukLang)

##                                         Ability region language statistic
## 1 Understands but does not speak, read or write  Wales    Welsh    Number
## 2                      Speaks, reads and writes  Wales    Welsh    Number
## 3             Speaks but does not read or write  Wales    Welsh    Number
## 4           Speaks and reads but does not write  Wales    Welsh    Number
## 5             Reads but does not speak or write  Wales    Welsh    Number
## 6                   Other combination of skills  Wales    Welsh    Number
##     value
## 1 157,792
## 2 430,717
## 3  80,429
## 4  45,524
## 5  44,327
## 6  40,692

More Control Over Value Conversion

Arguments to htmltab( ): - body - header - bodyFun - headerFun - rm_escape - rm_footnote - rm_superscript - rm_nodata_cols - rm_invisible - rm_whitespace

Investigating Arguments:

htmltab employs heuristics for body and header element identification but they are no guarantee, hence the need to specify these.

url2 <- "http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002"
xp <- "//caption[starts-with(text(), 'Electorate results')]/ancestor::table"
body_xp <- "//tr[./td[not(@colspan = '10')]]"

nz1 <- htmltab(doc = url2, which = xp, body = body_xp, encoding = "UTF-8")

## Warning: Columns [Incumbent] seem to have no data and are removed. Use
## rm_nodata_cols = F to suppress this behavior

The code will throw the following warning:

“Warning: Columns [Incumbent,Majority] seem to have no data and are removed.”

We use rm_nodata_cols = F to suppress this behavior:

nz1 <- htmltab(doc = url2, which = xp, body = body_xp, encoding = "UTF-8", rm_nodata_cols = FALSE)

Using Table Information that Intercept Body Rows

Previously, we disgarded two intercepting rows in the body that give variable/header information. What if we need these?

nz2 <- htmltab(doc = url2, which = xp, header = 1 + "//tr/td[@colspan = '10']", body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8")

## Warning: Columns [Incumbent] seem to have no data and are removed. Use
## rm_nodata_cols = F to suppress this behavior

tail(nz2)

##             Header_1      Electorate        Incumbent        Winner
## 67 Māori electorates          Tainui (new electorate)          <NA>
## 68 Māori electorates Tamaki Makaurau (new electorate)          <NA>
## 69 Māori electorates  Te Tai Hauāuru    Tariana Turia Tariana Turia
## 70 Māori electorates  Te Tai Tokerau    Dover Samuels Dover Samuels
## 71 Māori electorates    Te Tai Tonga    Mahara Okeroa Mahara Okeroa
## 72 Māori electorates        Waiariki     Mita Ririnui  Mita Ririnui
##         Winner.1 Majority                  Runner up
## 67 Nanaia Mahuta    3,430  Willie Jackson (Alliance)
## 68 John Tamihere    9,444      Metiria Turei (Green)
## 69 Tariana Turia    8,657      Ken Mair (Mana Māori)
## 70 Dover Samuels    5,336   Mere Mangu (Independent)
## 71 Mahara Okeroa    8,052 Bill Karaitiana (National)
## 72  Mita Ririnui    6,707   Rihi Vercoe (Mana Māori)
##                   Runner up.1                     Third place
## 67  Willie Jackson (Alliance) Angeline Greensill (Mana Māori)
## 68      Metiria Turei (Green)        George Ngatai (National)
## 69      Ken Mair (Mana Māori)           Greg White (National)
## 70   Mere Mangu (Independent)        Naida Glavish (Alliance)
## 71 Bill Karaitiana (National)    Jacqui Amohanga (Mana Māori)
## 72   Rihi Vercoe (Mana Māori)     Hamuera Mitchell (National)
##                      Third place.1
## 67 Angeline Greensill (Mana Māori)
## 68        George Ngatai (National)
## 69           Greg White (National)
## 70        Naida Glavish (Alliance)
## 71    Jacqui Amohanga (Mana Māori)
## 72     Hamuera Mitchell (National)

nz3 <- htmltab(doc = url2, which = xp, header = 1 + "//tr/td[@colspan = '10']", body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8", rm_nodata_cols = F)
tail(nz3)

##             Header_1      Electorate        Incumbent        Incumbent
## 67 Māori electorates          Tainui (new electorate) (new electorate)
## 68 Māori electorates Tamaki Makaurau (new electorate) (new electorate)
## 69 Māori electorates  Te Tai Hauāuru             <NA>    Tariana Turia
## 70 Māori electorates  Te Tai Tokerau             <NA>    Dover Samuels
## 71 Māori electorates    Te Tai Tonga             <NA>    Mahara Okeroa
## 72 Māori electorates        Waiariki             <NA>     Mita Ririnui
##           Winner        Winner Majority                  Runner up
## 67          <NA> Nanaia Mahuta    3,430  Willie Jackson (Alliance)
## 68          <NA> John Tamihere    9,444      Metiria Turei (Green)
## 69 Tariana Turia Tariana Turia    8,657      Ken Mair (Mana Māori)
## 70 Dover Samuels Dover Samuels    5,336   Mere Mangu (Independent)
## 71 Mahara Okeroa Mahara Okeroa    8,052 Bill Karaitiana (National)
## 72  Mita Ririnui  Mita Ririnui    6,707   Rihi Vercoe (Mana Māori)
##                     Runner up                     Third place
## 67  Willie Jackson (Alliance) Angeline Greensill (Mana Māori)
## 68      Metiria Turei (Green)        George Ngatai (National)
## 69      Ken Mair (Mana Māori)           Greg White (National)
## 70   Mere Mangu (Independent)        Naida Glavish (Alliance)
## 71 Bill Karaitiana (National)    Jacqui Amohanga (Mana Māori)
## 72   Rihi Vercoe (Mana Māori)     Hamuera Mitchell (National)
##                        Third place
## 67 Angeline Greensill (Mana Māori)
## 68        George Ngatai (National)
## 69           Greg White (National)
## 70        Naida Glavish (Alliance)
## 71    Jacqui Amohanga (Mana Māori)
## 72     Hamuera Mitchell (National)

Removing Unneeded Information

Many HTML tables include additional information which are of little interest to data analysts such as information encoded in superscript and footnote tags, as well as escape sequences.

By default, htmltab() removes information from the first two and replaces all escape sequences by a whitespace.

You can change this behaviour through: - rm_superscript - rm_footnotes - rm_escape - rm_nodata_cols - rm_invisible - rm_whitespace arguments.

Writing to File

While not the main topic here, we often need to write modified data back to file.

Main functions used:

write.table( ): write outs a data frame (or something that can be coerced to a df)
write( ): just writes out a matrix/vector in given number of columns, transposing the matrix in the process.

Recommended method:

Open a file connection for writing,
Call on the function cat( ) to open the connection
Append the data
Close the connection

Take Cares when Writing to File

Precision:

When using write( ), governed by options(digits) setting. Fine tune using format( ).

Header lines:

R doesn’t like row name headers, achieved using write.table( ).

Seperator:

csv, use write.csv( ). If you need a “,” as a decimal seperator, use (sep = “;” or sep = “”) as a seperator for the file entries.

Missing values:

Default to NA, NaN’s treated as NA by write.table but not by cat( ) nor write( )

Quoting strings:

Default: quoted. Can be changed in quote argument. For embedded quotes in strings:

df <- data.frame(a = I("a \" quote"))
write.table(df)

## "a"
## "1" "a \" quote"

write.table(df, qmethod = "double")

## "a"
## "1" "a "" quote"

write.table(df, quote = FALSE, sep = ",")

## a
## 1,a " quote

Encodings:

Our favorite friends…

Text files don’t contain metadata on their encodings. For non-ASCII data, target the file to the application that’s going to read it. All the functions to write files can write to a connection, the connection lets you specify encoding. Write.table( ) has argument fileEncoding if you want to be specific.

Functions worth noting:

write.matrix( ) (in MASS) special interface for writing matrices
write.foreign( ) (in foreign) uses write.table( ) and should be used to export for SAS, SPSS and Stata.

Sources:

Cran html vignettes
R-manuals, Import/Export
R Cookbook, Paul Teetor, Pub: O’Reilly Media, Inc., ISBN: 978-0-596-80915-7, Available online
Stack Overflow
R-bloggers

Reading Data Into R

Tina Heiligers

August 15, 2016