Today

Importing CSV files into R (Chapter 15 book)
HTML Tables (Chapter 15 book)
Data Cleaning/Scraping example

1. Flat files such as CSV, tab delimited or text files

CSV stands for comma-separated values. It’s a text format that can be read with a huge variety of softare. It has a data table format, with the values of variables in each case separated by commas. Here’s an example of the first several lines of a CSV file:

"name","sex","count","year"
"Mary","F",7065,1880
"Anna","F",2604,1880
"Emma","F",2003,1880
"Elizabeth","F",1939,1880

The top row usually (but not always) contains the variable names. Quotation marks are often used at the start and end of character strings; these quotation marks are not part of the content of the string.

Although CSV files are often named with the .csv suffix, it’s also common for them to be named with .txt or other things. You will also see characters other than commas being used to delimit the fields, tabs and vertical bars are particularly common.

read.csv() is an old school function. It has no real advantage since it is slow and doesn’t read csv files on the web. Better functions for reading CSV files into R is read_csv() in the readr() package or fread() in the data.table package. fread() figures out the delimiter for you (from the filename ending) so can be used with different types of files. The problem with fread() is that it outputs a data.table instead of a data.frame (data.table is a data type of the data.table package). read.file() has all of the advantages of fread() but outputs a dataframe. Hence we will use read.file().

Here is a useful overview of functions to read CSV files:

Function	Package	WebURL	Fast
read.csv()	base	no	no
read_csv()	readr	yes	yes
fread()	data.table	yes	yes
read.file()	mosaic	yes	yes

Here’s a way to access a .csv file over the Internet.

houses

library(mosaic)

myURL <- "http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv"
my_dataTable <- myURL %>% 
  read.file() %>%
  head(3)

  class(my_dataTable)

## [1] "data.frame"

  my_dataTable

Price	Living.Area	Baths	Bedrooms	Fireplace	Acres	Age
142212	1982	1.0	3	N	2.00	133
134865	1676	1.5	3	Y	0.38	14
118007	1694	2.0	3	Y	0.96	15

Here is an example of text file whose delimiter are tabs.

potatoes

Use read_tsv for this:

library(readr)
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt"
potatoes <- read_tsv(url_delim)
potatoes %>% head(3)

area	temp	size	storage	method	texture	flavor	moistness
1	1	1	1	1	2.9	3.2	3.0
1	1	1	1	2	2.3	2.5	2.6
1	1	1	1	3	2.5	2.8	2.8

Now try it with read.file(). From file ending it knows to use tab delimiter.

library(mosaic)
myURL <- "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt"
myURL %>% 
  read.file() %>%
  head(3)

area	temp	size	storage	method	texture	flavor	moistness
1	1	1	1	1	2.9	3.2	3.0
1	1	1	1	2	2.3	2.5	2.6
1	1	1	1	3	2.5	2.8	2.8

Reading from a file on your own computer is even easier. You just need to have the file path, as can be found using file.choose(). For instance:

# Call file.choose() then copy the string with the file path below
fileName <- "~/Project1/Important.csv"
fileName %>%
  read.file()

Useful argument to read.file():

stringsAsFactors=FALSE is useful since we often don’t want a variable with character string values to be categorical.

With download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also RData files. An RData file is very efficient format to store R data.

url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata,"~/Desktop/wine_local.RData")

# Load the wine data into your workspace using load()
load("~/Desktop/wine_local.RData")

Task for you

Download this file on your computer: http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/swimming_pools.csv using download.file()
Load this into R using file.choose() and read.file() and call the file pools.
Wrangle pools so your data table looks like:

Name	Address
Acacia Ridge Leisure Centre	1391 Beaudesert Road, Acacia Ridge
Bellbowrie Pool	Sugarwood Street, Bellbowrie
Carole Park	Cnr Boundary Road and Waterford Road Wacol

2. HTML Tables

Web pages are usually in HTML format. This format is much more general than the data table format; it’s used for text, page arrangement, etc. Sometimes, however, there is tabular data contained within the HTML page. This will often look in your browser like the following:

Part of a page on world records in the one-mile race

In Chrome you can see the html source code by clicking “View” -> “Developer” -> “view source”

Within HTML, such tables are represented with the HTML <\table> tag.

In Chrome you can see the html source code by clicking “View” -> “Developer” -> “view source”

When you have the URL of a page containing one or more tables, it can be easy to read them in to R as data tables. Here’s the pattern.

#install.packages(rvest)   Data Scraping package
library(rvest)
SetOfTables <- 
  "http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
  html_table(fill=TRUE)

The result, SetOfTables, is not a table, it is a list of the tables found in the web page. You can access any of those tables like this:

length( SetOfTables )

## [1] 7

To look at the first one,

head( SetOfTables[[1]], 2 ) # Note: double square brackets

Time	Athlete	Nationality	Date	Venue
4:28	Charles Westhall	United Kingdom	26 July 1855	London
4:28	Thomas Horspool	United Kingdom	28 September 1857	Manchester

You can look at the second, third, and so on in the same way. When you have found the table or tables you want, assign them to an object name. For instance, the two tables shown in the figure above happen to be the third and fourth in SetOfTables.

MyTable <- SetOfTables[[3]]
MyOtherTable <- SetOfTables[[4]]

head( MyTable, 2 )

Time	Athlete	Nationality	Date	Venue
4:52	Cadet Marshall	United Kingdom	2 September 1852	Addiscome
4:45	Thomas Finch	United Kingdom	3 November 1858	Oxford

head( MyOtherTable, 2 )

Time	Auto	Athlete	Nationality	Date	Venue
4:14.4		John Paul Jones	United States	31 May 1913[5]	Allston, Mass.
4:12.6		Norman Taber	United States	16 July 1915[5]	Allston, Mass.

The third and fourth tables embedded in the Wikipedia page on records in the one-mile race.

3. Data Cleaning/Scraping Example:

Grab table 4 (or another smilar table) from the wiipedia page on world records in the mile (or some similar event). Make a plot of the record time versus the date in which it occurred. Mark each point with the name of the athlete written above the point.

Step 1 (load HTML table to R)

library(rvest)
require(lubridate)

web_page <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
SetOfTables <- 
  web_page %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
  html_table(fill=TRUE)
T4 <- SetOfTables[[4]]
head(T4)

Time	Athlete	Nationality	Date	Venue
4:14.4	John Paul Jones	United States	31 May 1913[5]	Allston, Mass.
4:12.6	Norman Taber	United States	16 July 1915[5]	Allston, Mass.
4:10.4	Paavo Nurmi	Finland	23 August 1923[5]	Stockholm
4:09.2	Jules Ladoumègue	France	4 October 1931[5]	Paris
4:07.6	Jack Lovelock	New Zealand	15 July 1933[5]	Princeton, N.J.
4:06.8	Glenn Cunningham	United States	16 June 1934[5]	Princeton, N.J.

str(T4)

## 'data.frame':    32 obs. of  6 variables:
##  $ Time       : chr  "4:14.4" "4:12.6" "4:10.4" "4:09.2" ...
##  $ Auto       : chr  "" "" "" "" ...
##  $ Athlete    : chr  "John Paul Jones" "Norman Taber" "Paavo Nurmi" "Jules Ladoumègue" ...
##  $ Nationality: chr  " United States" " United States" " Finland" " France" ...
##  $ Date       : chr  "31 May 1913[5]" "16 July 1915[5]" "23 August 1923[5]" "4 October 1931[5]" ...
##  $ Venue      : chr  "Allston, Mass." "Allston, Mass." "Stockholm" "Paris" ...

Step 2 (Make Glyph ready table)

Time in T4 is a character vector, with class period (change in clock time between two date). We need to change this to a numeric vector of durations in minutes. This is a two step process.

lubridate::ms() takes a character vector of minute second pairs, of class period, and creates a vector with the specified number of minutes and seconds.

head(ms(T4$Time))

## [1] "4M 14.4S" "4M 12.6S" "4M 10.4S" "4M 9.2S"  "4M 7.6S"  "4M 6.8S"

lubridate::as.duration() coerces an object to a duration (in seconds).

lubridate::as.duration(lubridate::ms(T4$Time)) %>% head()

## estimate only: convert periods to intervals for accuracy

## [1] "254.4s (~4.24 minutes)" "252.6s (~4.21 minutes)" "250.4s (~4.17 minutes)"
## [4] "249.2s (~4.15 minutes)" "247.6s (~4.13 minutes)" "246.8s (~4.11 minutes)"

This is a character vector so we convert to a numeric using as.numeric. lubridate::dmy() function converts the character-string date stored in the HTML table to a POSIX date-number.

new_T4 <- T4 %>%
  mutate(Date = dmy(gsub("\\[.\\]$", "", Date)),
         Time = as.numeric(lubridate::as.duration(lubridate::ms(Time))/60))

## estimate only: convert periods to intervals for accuracy

head(new_T4)

Time	Athlete	Nationality	Date	Venue
4.240000	John Paul Jones	United States	1913-05-31	Allston, Mass.
4.210000	Norman Taber	United States	1915-07-16	Allston, Mass.
4.173333	Paavo Nurmi	Finland	1923-08-23	Stockholm
4.153333	Jules Ladoumègue	France	1931-10-04	Paris
4.126667	Jack Lovelock	New Zealand	1933-07-15	Princeton, N.J.
4.113333	Glenn Cunningham	United States	1934-06-16	Princeton, N.J.

Step 3 (Visualize)

new_T4 %>%
  ggplot(aes(x=Date, y=Time)) +
  geom_point() +
  geom_text(aes(label=Athlete), size=3, angle=45,hjust=-0.1) +
  ylim(3.5,4.4) + ylab("Time (min)") +
  xlim(ymd("1910-01-01"), ymd("2000-01-01"))

lec25

Today

1. Flat files such as CSV, tab delimited or text files

Task for you

2. HTML Tables

3. Data Cleaning/Scraping Example:

Iclicker