One of the most frustrating tasks in analytics is getting the data into R! It often takes more time than the analysis itself.
While there are many packages that make the task less daunting, the most commonly used ones are highlighted in this document.
While going through the examples and various packages mentioned, keep in mind that R, like most stats systems, aren’t geared toward manipulating large-scale data. If you do have a lot, and I mean a lot of data, rather work with database manipulations systems, to help save your hairline and calm the nerves.
If you absolutely have to work with “biggish-data” in R, there are several packages developed in languages such as Java, perl and python that can be directly integrated with R code.
e.g.: rJava SJava, RSPerl, RSPython
Remember though that R, like S, comes from the Unix tradition of small re-usable tools, so it makes sense to do most of the data manipulation using tools such as “awk” and “perl” before importing into R (or after exporting from R).
At all times, if you can, get your data as a simple text file (small to medium scale).
However, “real life” ain’t so pretty, at some stage, you’ll end up having to deal with data in some proprietary data format, e.g. an Excel spreadsheet or an SPSS file. If you can, go back and export the data as a text file (or get someone to do it for you…)!
If you can’t, fear not, there are other ways of getting the data into R. While I won’t go into detail, there’s a lot of information on the web and in the R docs to help you out. Links to these are given at the end of this document.
The next most common issue you’ll probably face is with data encoding.
Unless data is entirely ASCII, we need to figure out how it’s encoded, to help R out.
File extensions and encoding:
text.Rd: UTF-8 Unicode English text
text2.dat: ISO-8859 English text
text3.dat: Little-endian UTF-16 Unicode English character data, with CRLF line terminators
intro.dat: UTF-8 Unicode text
intro.dat: UTF-8 Unicode (with BOM) text
Unix-like systems produce UTF-8 files, while Windows’ ‘Unicode’ is UCS-2LE or UTF-16LE. If the data’s from a Chinese/Japanese/Korean locale, it won’t be standardised and will have some other 8-bit encoding.
Windows Systems
If you’re dealing with data that comes from Windows-originated files, you’ll have a bit more work on your hands. Windows-based files with UCS-2/UTF-16 encoding have associated ‘BOMs’ (Byte Order Marks). These cause problems with Unicode files, where the file utility will flatly refuse to read files with a BOM!
In this case, save yourself some trouble and revert to a command-line utility or a hex editor to work out the encoding.
Another typical issue encountered when running R on a Windows based system, is one where R says it cannot open the file, even though you know it exists!
What’s causing this? Probably backslashes in the file path.
Solution: change the backslashes to forward slashes or to double backslashes. e.g. * path_to_file_original <- “C:.txt”
path_to_file_forward_slashes <- “C:/data/sample.txt”
path_to_file_double_backslashes <- “C:\data\sample.txt”
While R and the packages we use to read data into R take a “best guess” at data types, if dealing with large data sets, it’s best to have a strategy for guessing the column types.
The standard practice is outlined below:
Read in a small amount of data, letting the package take a guess at the column types,
Call str(data) on the data read in,
Check the column types returned,
Create a vector for the correct/desired variable type,
Pass that in as an argument to the read method.
Reading data in in the correct format to start with speeds things up because R doesn’t need to guess the format. It also helps reduce the amount of work you have to do in-memory later down the wrangling pipeline.
These are amongst the “friendliest” files to read into R. For .csv files, we often use read.csv( ), although for imporved speed, you may want to consider readr 1.0.0
Letting R take a guess, using a limited number of entries at first:
install.packages("readr")
##
## The downloaded binary packages are in
## /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages
library(readr)
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
## Parsed with column specification:
## cols(
## x = col_double(),
## y = col_date(format = "")
## )
head(challenge)
## # A tibble: 6 x 2
## x y
## <dbl> <date>
## 1 404 <NA>
## 2 4172 <NA>
## 3 3004 <NA>
## 4 787 <NA>
## 5 37 <NA>
## 6 2332 <NA>
To suppress the printed spec, add col_types = cols( ) to the argument. We can use guess_parser( ) to see which parser readr is going to select:
guess_parser("1,234")
## [1] "number"
guess_parser(c(".", "-"))
## [1] "character"
guess_parser(c("10E", "20N"))
## [1] "character"
Date time parsers recognise the following:
install.packages("curl", quiet = TRUE)
library(curl)
##
## Attaching package: 'curl'
## The following object is masked from 'package:readr':
##
## parse_date
install.packages("devtools", quiet = TRUE)
library(devtools)
devtools::install_github("rstats-db/hms")
## Skipping install of 'hms' from a github remote, the SHA1 (98d14cf5) has not changed since last install.
## Use `force = TRUE` to force installation
library(hms)
hms(56, 34, 12)
## 12:34:56
as.hms(1)
## 00:00:01
as.hms("12:34:56")
## 12:34:56
as.hms(Sys.time())
## 23:18:54.006723
as.POSIXct(hms(1))
## [1] "1970-01-01 00:00:01 UTC"
mtcars2 <- read_csv(readr_example("mtcars.csv"))
## Parsed with column specification:
## cols(
## mpg = col_double(),
## cyl = col_integer(),
## disp = col_double(),
## hp = col_integer(),
## drat = col_double(),
## wt = col_double(),
## qsec = col_double(),
## vs = col_integer(),
## am = col_integer(),
## gear = col_integer(),
## carb = col_integer()
## )
Given the output, we now figure out the correct/desired column types and make parsing strict.
Method: copy-paste the output or save the spec to disk:
mtcars_spec <- write_rds(spec(mtcars2), "mtcars2-spec.rds")
With every load thereafter, pass this into the read function. If using this in production, you can throw an error if there are any problems with parsing:
mtcars2 <- as.data.frame(read_csv(readr_example("mtcars.csv"), col_types = read_rds("mtcars2-spec.rds")))
stop_for_problems(mtcars2)
head(mtcars2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In readr:
read_file( ): reads file into length-1 character vector
read_file_raw( ): reads file into single raw vector
read_lines( ): reads file into character vector with one entry per line
read_lines_raw( ): reads file into a list of raw vectors with one entry per line.
These are paried with wite_lines( ) and write_file( )
Other methods in readr:
read_fwf( ): reliably reads only a few columns, ideal for files with ragged final columns (set final position/width to NA) and to skip comments (set with the comments argument)
read_csv_chunked( ), read_lines_chunked: experimental API for reading a file in chunks, allowing you to work with files bigger than memory. NB: API not finalised yet!, use with caution
Fixed width format is exactly that: the data items occur at fixed boundaries. For these, we use read.fwf. read_fwf( ) is reported to reliably read only a few columns and is supposed to be ideal for files with ragged final columns (set final position/width to NA) and when you want to skip comments (set with the comments argument).
e.g: records <- read.fwf(“filename”, widths=c(w1, w2, …, wn)) where w is the width of the columns and there are n columns in the data.
To read in data from a fixed-width text file, we need to know how wide the columns are.
fwftext_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/fixed-width.txt"
records <- read.fwf(fwftext_file, widths=c(8,9,4,-1,4))
records
## V1 V2 V3 V4
## 1 Fisher R.A. 1890 1962
## 2 Pearson Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 1994
## 5 Smith Krisine 1878 1939
Here we have specified the first, second, third and last columns to contain 8, 9, 4 and 4 characters, respectively and the fourth column is the single space between the years. We don’t need the space, so we leave the column out. Changing the column names is the same as with any other data frame/table etc:
records_renamed <- read.fwf(fwftext_file, widths=c(8,9,4,-1,4), col.names=c("Last", "First", "Born", "Died"))
records_renamed
## Last First Born Died
## 1 Fisher R.A. 1890 1962
## 2 Pearson Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 1994
## 5 Smith Krisine 1878 1939
Tabular data is data in a file with a simple format:
Each line contains one record
In each record, the fields are separated by a one-character delimeter (e.g. space, tab, comma, colon, semi-colon etc).
Each record has the same number of fields
Although similar to data in csv and tsv form, tabular data’s more free-form because the fields aren’t neccessarily aligned by position. We use a function to read the data in as a table and not row by row.
tabular_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians.txt"
dfstats <- read.table(tabular_file)
print(dfstats)
## V1 V2 V3 V4
## 1 Fisher R.A. 1890 1962
## 2 Pearson Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 1994
## 5 Smith Krisine 1878 1939
Other delimeters are dealt with by specifying them as an argument to read.table( ):
tabular_col_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians_col.txt"
dfstats_col <- read.table(tabular_col_file, sep=":")
print(dfstats_col)
## V1 V2 V3 V4
## 1 Fisher R.A. 1890 1962
## 2 Pearson Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 1994
## 5 Smith Krisine 1878 1939
str(dfstats_col)
## 'data.frame': 5 obs. of 4 variables:
## $ V1: Factor w/ 5 levels "Cox","Fisher",..: 2 3 1 5 4
## $ V2: Factor w/ 5 levels "Frank","Gertrude",..: 5 3 2 1 4
## $ V3: int 1890 1857 1900 1902 1878
## $ V4: int 1962 1936 1978 1994 1939
Be careful with read.table( ), as it often interprets strings as factors. Change this behavior by being specific with stringsAsFactors=FALSE.
tabular_col_file <- "/Users/Tina/Projects/R_bootcamp_intermediate/statisticians_col.txt"
dfstats_col_strings <- read.table(tabular_col_file, sep=":", stringsAsFactors = FALSE)
str(dfstats_col_strings)
## 'data.frame': 5 obs. of 4 variables:
## $ V1: chr "Fisher" "Pearson" "Cox" "Yates" ...
## $ V2: chr "R.A." "Karl" "Gertrude" "Frank" ...
## $ V3: int 1890 1857 1900 1902 1878
## $ V4: int 1962 1936 1978 1994 1939
By default, read.table( ) interprets any “NA” string as NA. If missing values are specified by some other string (e.g. SAS uses “.”), specify this in the read method.
mdfmissing <- "/Users/Tina/Projects/R_bootcamp_intermediate/SAS_format.txt"
dfmissing1 <- read.table(mdfmissing)
dfmissing <- read.table(mdfmissing, na.strings=".")
print(dfmissing1)
## V1 V2 V3 V4
## 1 Fisher RA 1890 1962
## 2 . Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 .
## 5 Smith Krisine 1878 1939
print(dfmissing)
## V1 V2 V3 V4
## 1 Fisher RA 1890 1962
## 2 <NA> Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 NA
## 5 Smith Krisine 1878 1939
Other useful functionality with read.table: If the data has a header, you can read that in with the argument. In addition, read.table ignores any comment lines (assuming they begin with #).
file_with_header_and_comments <- "/Users/Tina/Projects/R_bootcamp_intermediate/head_comment.txt"
read.table(file_with_header_and_comments, header=TRUE, stringsAsFactors=FALSE)
## Firstname Lastname Born Died
## 1 Fisher R.A. 1890 1962
## 2 Pearson Karl 1857 1936
## 3 Cox Gertrude 1900 1978
## 4 Yates Frank 1902 1994
## 5 Smith Krisine 1878 1939
zz <- file("ex.data", "w")
# open an output file connection
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
cat("One more line\n", file = zz)
close(zz)
Convert decimal point to comma in output, using a pipe (Unix) Both R strings and (probably) the shell need doubled.
zz <- pipe(paste("sed s/\\\\./,/ >", "outfile"), "w")
cat(format(round(rnorm(100), 4)), sep = "\n", file = zz)
close(zz)
Now look at the output file using file.show(“outfile”, delete.file = TRUE), then capture the R output.
zz <- textConnection("ex.lm.out", "w")
sink(zz)
sink()
close(zz)
## now ‘ex.lm.out’ contains the output for futher processing.
## Look at it by, e.g.,
cat(ex.lm.out, sep = "\n")
## read in file created in last examples
readLines("ex.data")
## [1] "TITLE extra line" "2 3 5 7" ""
## [4] "11 13 17" "One more line"
unlink("ex.data")
## read listing of current directory (Unix)
readLines(pipe("ls -1"))
## [1] "Orders-updated_Aug_15.ods"
## [2] "Orders_as_of_Aug_11.ods"
## [3] "Orders_as_of_Aug_15.csv"
## [4] "Orders_as_of_Aug_15_unchecked_against_new.ods"
## [5] "Orders_as_of_Aug_8.xls"
## [6] "Reading_html_tables_from_web.Rmd"
## [7] "SAS_format.txt"
## [8] "Untitled.Rmd"
## [9] "attendees-26913915245.pdf"
## [10] "attendees-Aug15.pdf"
## [11] "fixed-width.txt"
## [12] "head_comment.txt"
## [13] "mtcars2-spec.rds"
## [14] "outfile"
## [15] "reading_data_into_r.Rmd"
## [16] "reading_data_into_r.html"
## [17] "statisticians.txt"
## [18] "statisticians_col.txt"
## [19] "statisticians_missing.txt"
# remove trailing commas from an input file.
Suppose we are given a file ‘data’ containing 450, 390, 467, 654, 30, 542, 334, 432, 421, 357, 497, 493, 550, 549, 467, 575, 578, 342, 446, 547, 534, 495, 979, 479 Then read this by:
scan(pipe("sed -e s/,$// data"), sep=",")
## numeric(0)
Pushback is a method to push data back to the source with which you are connected.
zz <- textConnection(LETTERS)
readLines(zz, 2)
## [1] "A" "B"
scan(zz, "", 4)
## [1] "C" "D" "E" "F"
pushBack(c("aa", "bb"), zz)
scan(zz, "", 4)
## [1] "aa" "bb" "G" "H"
close(zz)
R can read data from several RDBMS systems:
MySQL
SQLite
Oracle
Sybase
PostgreSQL
Only MySQL is covered here briefly, but without a connection to a db.
Package: RMySQL The code in the block below is not live and assumes you’ve got the following parameters in the [client] section, so your config file looks something like: [client] user = userid
password = password
host = hostname
#all the code in this block is commented out on purpose.
#library(RMySQL)
#con <- dbConnect(MySQL(), client.flag=CLIENT_MULTI_RESULTS)
#con <- dbConnect(MySQL(), user="userid", password="pswd", host="hostname", client.flag=CLIENT_MULTI_RESULTS)
If your [client] config file doesn’t include the parameters, you can still connect as follows:
#all the code in this block is commented out on purpose.
#library(RMySQL)
#con <- dbConnect(MySQL(), user="userid", password="pswd", host="hostname", client.flag=CLIENT_MULTI_RESULTS)
You need an open database connection to submit your SQL to the db and read the result:
#sql <- "SELECT * from SurveyResults WHERE City = 'Chicago'"
#rows <- dbGetQuery(con, sql)
You need to construct your own SQL queries, all of them work with RMySQL. Results are packaged into a data frame, making it easy to work with the data from there. For more information, see the RMySQL docs.
The web has a host of rich data sets and data that you probably don’t want to have to select, copy and paste from. In this section, I’ll briefly demonstrate how to get data from an html table.
The package htmltab, is what seems to work for me, although there are many other ways described in detail on SO.
Example:
Reading all the tables on the web page:
install.packages("htmltab")
##
## The downloaded binary packages are in
## /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages
install.packages("tidyr")
##
## The downloaded binary packages are in
## /var/folders/x1/4tdr7dhx0pbfxvy01l8zfnfc0000gn/T//RtmpbVnXul/downloaded_packages
library(htmltab)
library(tidyr)
url <- "http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom"
ukLang <- htmltab(doc = url, which = "//th[text() = 'Ability']/ancestor::table")
str(ukLang)
## 'data.frame': 10 obs. of 11 variables:
## $ Ability : chr "Understands but does not speak, read or write" "Speaks, reads and writes" "Speaks but does not read or write" "Speaks and reads but does not write" ...
## $ Wales >> Welsh >> Number : chr "157,792" "430,717" "80,429" "45,524" ...
## $ Wales >> Welsh >> % : chr "5.15%" "14.06%" "2.63%" "1.49%" ...
## $ Scotland >> Scottish Gaelic >> Number : chr "23,357" "32,191" "18,966" "6,218" ...
## $ Scotland >> Scottish Gaelic >> % : chr "0.46%" "0.63%" "0.37%" "0.12%" ...
## $ Scotland >> Scots >> Number : chr "267,412" "1,225,622" "179,295" "132,709" ...
## $ Scotland >> Scots >> % : chr "5.22%" "23.95%" "3.50%" "2.59%" ...
## $ Northern Ireland >> Irish >> Number : chr "70,501" "71,996" "24,677" "7,414" ...
## $ Northern Ireland >> Irish >> % : chr "4.06%" "4.15%" "1.42%" "0.43%" ...
## $ Northern Ireland >> Ulster-Scots >> Number: chr "92,040" "17,228" "10,265" "7,801" ...
## $ Northern Ireland >> Ulster-Scots >> % : chr "5.30%" "0.99%" "0.59%" "0.45%" ...
head(ukLang)
## Ability Wales >> Welsh >> Number
## 4 Understands but does not speak, read or write 157,792
## 5 Speaks, reads and writes 430,717
## 6 Speaks but does not read or write 80,429
## 7 Speaks and reads but does not write 45,524
## 8 Reads but does not speak or write 44,327
## 9 Other combination of skills 40,692
## Wales >> Welsh >> % Scotland >> Scottish Gaelic >> Number
## 4 5.15% 23,357
## 5 14.06% 32,191
## 6 2.63% 18,966
## 7 1.49% 6,218
## 8 1.45% 4,646
## 9 1.33% 1,678
## Scotland >> Scottish Gaelic >> % Scotland >> Scots >> Number
## 4 0.46% 267,412
## 5 0.63% 1,225,622
## 6 0.37% 179,295
## 7 0.12% 132,709
## 8 0.09% 107,025
## 9 0.03% 17,381
## Scotland >> Scots >> % Northern Ireland >> Irish >> Number
## 4 5.22% 70,501
## 5 23.95% 71,996
## 6 3.50% 24,677
## 7 2.59% 7,414
## 8 2.09% 5,659
## 9 0.34% 4,651
## Northern Ireland >> Irish >> %
## 4 4.06%
## 5 4.15%
## 6 1.42%
## 7 0.43%
## 8 0.33%
## 9 0.27%
## Northern Ireland >> Ulster-Scots >> Number
## 4 92,040
## 5 17,228
## 6 10,265
## 7 7,801
## 8 11,911
## 9 959
## Northern Ireland >> Ulster-Scots >> %
## 4 5.30%
## 5 0.99%
## 6 0.59%
## 7 0.45%
## 8 0.69%
## 9 0.06%
The header information has been recast into a format that respects the hierarchical order of the variables and yet only spans a single line in the R table.
Using functionality from the tidyr package, the next couple of data cleaning steps may be the following:
ukLang <- gather(ukLang, key, value, -Ability)
ukLang <- separate(ukLang, key, into = c("region", "language", "statistic"), sep = " >> ")
head(ukLang)
## Ability region language statistic
## 1 Understands but does not speak, read or write Wales Welsh Number
## 2 Speaks, reads and writes Wales Welsh Number
## 3 Speaks but does not read or write Wales Welsh Number
## 4 Speaks and reads but does not write Wales Welsh Number
## 5 Reads but does not speak or write Wales Welsh Number
## 6 Other combination of skills Wales Welsh Number
## value
## 1 157,792
## 2 430,717
## 3 80,429
## 4 45,524
## 5 44,327
## 6 40,692
Arguments to htmltab( ): - body - header - bodyFun - headerFun - rm_escape - rm_footnote - rm_superscript - rm_nodata_cols - rm_invisible - rm_whitespace
htmltab employs heuristics for body and header element identification but they are no guarantee, hence the need to specify these.
url2 <- "http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002"
xp <- "//caption[starts-with(text(), 'Electorate results')]/ancestor::table"
body_xp <- "//tr[./td[not(@colspan = '10')]]"
nz1 <- htmltab(doc = url2, which = xp, body = body_xp, encoding = "UTF-8")
## Warning: Columns [Incumbent] seem to have no data and are removed. Use
## rm_nodata_cols = F to suppress this behavior
The code will throw the following warning:
“Warning: Columns [Incumbent,Majority] seem to have no data and are removed.”
We use rm_nodata_cols = F to suppress this behavior:
nz1 <- htmltab(doc = url2, which = xp, body = body_xp, encoding = "UTF-8", rm_nodata_cols = FALSE)
Previously, we disgarded two intercepting rows in the body that give variable/header information. What if we need these?
nz2 <- htmltab(doc = url2, which = xp, header = 1 + "//tr/td[@colspan = '10']", body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8")
## Warning: Columns [Incumbent] seem to have no data and are removed. Use
## rm_nodata_cols = F to suppress this behavior
tail(nz2)
## Header_1 Electorate Incumbent Winner
## 67 Māori electorates Tainui (new electorate) <NA>
## 68 Māori electorates Tamaki Makaurau (new electorate) <NA>
## 69 Māori electorates Te Tai Hauāuru Tariana Turia Tariana Turia
## 70 Māori electorates Te Tai Tokerau Dover Samuels Dover Samuels
## 71 Māori electorates Te Tai Tonga Mahara Okeroa Mahara Okeroa
## 72 Māori electorates Waiariki Mita Ririnui Mita Ririnui
## Winner.1 Majority Runner up
## 67 Nanaia Mahuta 3,430 Willie Jackson (Alliance)
## 68 John Tamihere 9,444 Metiria Turei (Green)
## 69 Tariana Turia 8,657 Ken Mair (Mana Māori)
## 70 Dover Samuels 5,336 Mere Mangu (Independent)
## 71 Mahara Okeroa 8,052 Bill Karaitiana (National)
## 72 Mita Ririnui 6,707 Rihi Vercoe (Mana Māori)
## Runner up.1 Third place
## 67 Willie Jackson (Alliance) Angeline Greensill (Mana Māori)
## 68 Metiria Turei (Green) George Ngatai (National)
## 69 Ken Mair (Mana Māori) Greg White (National)
## 70 Mere Mangu (Independent) Naida Glavish (Alliance)
## 71 Bill Karaitiana (National) Jacqui Amohanga (Mana Māori)
## 72 Rihi Vercoe (Mana Māori) Hamuera Mitchell (National)
## Third place.1
## 67 Angeline Greensill (Mana Māori)
## 68 George Ngatai (National)
## 69 Greg White (National)
## 70 Naida Glavish (Alliance)
## 71 Jacqui Amohanga (Mana Māori)
## 72 Hamuera Mitchell (National)
nz3 <- htmltab(doc = url2, which = xp, header = 1 + "//tr/td[@colspan = '10']", body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8", rm_nodata_cols = F)
tail(nz3)
## Header_1 Electorate Incumbent Incumbent
## 67 Māori electorates Tainui (new electorate) (new electorate)
## 68 Māori electorates Tamaki Makaurau (new electorate) (new electorate)
## 69 Māori electorates Te Tai Hauāuru <NA> Tariana Turia
## 70 Māori electorates Te Tai Tokerau <NA> Dover Samuels
## 71 Māori electorates Te Tai Tonga <NA> Mahara Okeroa
## 72 Māori electorates Waiariki <NA> Mita Ririnui
## Winner Winner Majority Runner up
## 67 <NA> Nanaia Mahuta 3,430 Willie Jackson (Alliance)
## 68 <NA> John Tamihere 9,444 Metiria Turei (Green)
## 69 Tariana Turia Tariana Turia 8,657 Ken Mair (Mana Māori)
## 70 Dover Samuels Dover Samuels 5,336 Mere Mangu (Independent)
## 71 Mahara Okeroa Mahara Okeroa 8,052 Bill Karaitiana (National)
## 72 Mita Ririnui Mita Ririnui 6,707 Rihi Vercoe (Mana Māori)
## Runner up Third place
## 67 Willie Jackson (Alliance) Angeline Greensill (Mana Māori)
## 68 Metiria Turei (Green) George Ngatai (National)
## 69 Ken Mair (Mana Māori) Greg White (National)
## 70 Mere Mangu (Independent) Naida Glavish (Alliance)
## 71 Bill Karaitiana (National) Jacqui Amohanga (Mana Māori)
## 72 Rihi Vercoe (Mana Māori) Hamuera Mitchell (National)
## Third place
## 67 Angeline Greensill (Mana Māori)
## 68 George Ngatai (National)
## 69 Greg White (National)
## 70 Naida Glavish (Alliance)
## 71 Jacqui Amohanga (Mana Māori)
## 72 Hamuera Mitchell (National)
Many HTML tables include additional information which are of little interest to data analysts such as information encoded in superscript and footnote tags, as well as escape sequences.
By default, htmltab() removes information from the first two and replaces all escape sequences by a whitespace.
You can change this behaviour through: - rm_superscript - rm_footnotes - rm_escape - rm_nodata_cols - rm_invisible - rm_whitespace arguments.
While not the main topic here, we often need to write modified data back to file.
Main functions used:
write.table( ): write outs a data frame (or something that can be coerced to a df)
write( ): just writes out a matrix/vector in given number of columns, transposing the matrix in the process.
Recommended method:
Open a file connection for writing,
Call on the function cat( ) to open the connection
Append the data
Close the connection
Precision:
When using write( ), governed by options(digits) setting. Fine tune using format( ).
Header lines:
R doesn’t like row name headers, achieved using write.table( ).
Seperator:
csv, use write.csv( ). If you need a “,” as a decimal seperator, use (sep = “;” or sep = “”) as a seperator for the file entries.
Missing values:
Default to NA, NaN’s treated as NA by write.table but not by cat( ) nor write( )
Quoting strings:
Default: quoted. Can be changed in quote argument. For embedded quotes in strings:
df <- data.frame(a = I("a \" quote"))
write.table(df)
## "a"
## "1" "a \" quote"
write.table(df, qmethod = "double")
## "a"
## "1" "a "" quote"
write.table(df, quote = FALSE, sep = ",")
## a
## 1,a " quote
Encodings:
Our favorite friends…
Text files don’t contain metadata on their encodings. For non-ASCII data, target the file to the application that’s going to read it. All the functions to write files can write to a connection, the connection lets you specify encoding. Write.table( ) has argument fileEncoding if you want to be specific.
Functions worth noting:
R Cookbook, Paul Teetor, Pub: O’Reilly Media, Inc., ISBN: 978-0-596-80915-7, Available online