read.csv

The utils package, which is automatically loaded in your R session on startup, can import flat files in different forms. The read.table() call is the base function and there are wrapper functions such as read.csv() and read.delim() to make your life easier.

In this exercise, you’ll be working with the file swimming_pools.csv; it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au). The file contains the variable names in the first row. It uses a comma to separate values within rows. swimming_pools.csv is already available in your working directory; type dir() in the console to list the files in your working directory.

# List the files in your working directory
dir()
## [1] "01.1_Importing data from flat files.rmd"
## [2] "01.1_Importing_data_from_flat_files.rmd"
## [3] "01_Importing data from flat files.rmd"  
## [4] "01_Importing_data_from_flat_files.html" 
## [5] "ch1_slides.pdf"                         
## [6] "hotdogs"                                
## [7] "hotdogs.txt"                            
## [8] "swimming_pools.csv"
# Import swimming_pools.csv: pools
pools <- read.csv("swimming_pools.csv")

# Print the structure of pools
str(pools)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
##  $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

read.delim

Aside from .csv files, there are also the .txt files which are basically text files. The function to easily import this particular type of file is read.delim(). By default, it sets the sep argument to “” (fields in a record are delimited by tabs) and the header argument to TRUE (the first record contains the field names).

In this exercise, you will import hotdogs.txt, containing information on sodium and calorie levels in different hotdogs (Source: UCLA).

The dataset has 3 variables: type, calories and sodium; these variables are not available in the first line of the file, so think about which arguments to use! The file uses tabs as field separators.

# Import hotdogs.txt: hotdogs
hotdogs <- read.delim("hotdogs.txt", header=FALSE)

# Name the columns of hotdogs appropriately
names(hotdogs) <- c("type", "calories", "sodium")

# Summarize hotdogs
summary(hotdogs)
##       type       calories         sodium     
##  Beef   :20   Min.   : 86.0   Min.   :144.0  
##  Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
##  Poultry:17   Median :145.0   Median :405.0  
##               Mean   :145.4   Mean   :424.8  
##               3rd Qu.:172.8   3rd Qu.:503.5  
##               Max.   :195.0   Max.   :645.0

read.table

If you’re dealing with more exotic flat file formats, it’s a good idea to resort to the read.table() function. It’s the most basic importing function; you can specify tons of different arguments in this function.

Its default behavior is pretty different from the read.csv() and read.delim() wrappers. For example, the header argument defaults to FALSE and the sep argument is “” by default.

Up to you again! The data is still hotdogs.txt. It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the hotdogs folder which is in your current working directory (type dir() in the console to see it). You can build a path to it with:

file.path(“hotdogs”, “hotdogs.txt”)

# Create a path to the hotdogs.txt file: path
path <- file.path("hotdogs", "hotdogs.txt")

# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path, header = FALSE, sep="\t", col.names=c("type", "calories", "sodium"))

# Call head() on hotdogs
head(hotdogs)
##   type calories sodium
## 1 Beef      186    495
## 2 Beef      181    477
## 3 Beef      176    425
## 4 Beef      149    322
## 5 Beef      184    482
## 6 Beef      190    587

stringsAsFactors

You already learned by now how to use the header and sep arguments. In the video, Filip mentioned that stringsAsFactors can be used as an argument as well. It tells R whether it should convert strings in the flat file to factors.

For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to string in your text file will be character.

You’ll be working with the swimming_pools.csv file from the first exercise. The file has the column names in the first row and uses commas to separate values. It contains two columns (Name and Address), which shouldn’t be factors.

# Import swimming_pools.csv correctly: pools
pools <- read.csv("swimming_pools.csv", stringsAsFactors = FALSE)

# Check the structure of pools
str(pools)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...
# Import swimming_pools.csv with factors: pools_factor
pools_factor <- read.csv("swimming_pools.csv")

# Check the structure of pools_factor
str(pools_factor)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
##  $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

Arguments

Lily and Tom are having an argument because they want to share a hot dog but they can’t seem to agree on which one to choose. After some time, they simply decide that they will have one each. Lily wants to have the one with the fewest calories while Tom wants to have the one with the most sodium.

Next to calories and sodium, the hotdogs have one more variable: type. This can be one of three things: Beef, Meat or Poultry. As the type is not a unique string for each hotdog, it can be imported as a factor.

# Load in the hotdogs data set: hotdogs
hotdogs <- read.delim("hotdogs.txt", header=FALSE)
names(hotdogs) <- c("type", "calories", "sodium")

# Select the hot dog with the least calories: lily
lily <- hotdogs[which.min(hotdogs$calories), ]

# Select the observation with the most sodium: tom
tom <- hotdogs[which.max(hotdogs$sodium), ]

# Print lily and tom
lily
##       type calories sodium
## 50 Poultry       86    358
tom
##    type calories sodium
## 15 Beef      190    645

Column classes

In the previous exercises you’ve specified the column names of the hot dog data, but there’s more: you can also specify the column types or column classes of the resulting data frame.

You do this using the colClasses argument. Typically, you pass this argument a vector of classes:

read.delim(“my_file.txt”, colClasses = c(“character”, “numeric”, “logical”))

This approach is very useful if you have some columns that are categorical and others that are not. You don’t have to bother with stringsAsFactors anymore; just state for each column what the class should be.

If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.

The sample code contains a read.delim() call from before. Follow the instructions to import the data while specifying colClasses.

# Previous call to import hotdogs.txt
hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))

# Print a vector representing the classes of the columns
sapply(hotdogs, class)
##      type  calories    sodium 
##  "factor" "integer" "integer"
# Edit the colClasses argument to import the data correctly: hotdogs2. Edit the second read.delim() call. Assign the correct vector to the colClasses argument. NA should be replaced with a character vector. The first column should be a "factor", the second column should be dropped ("NULL") and the third argument should be a "numeric". Watch out for the double quotes!
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))


# Display the structure of hotdogs2
str(hotdogs2)
## 'data.frame':    54 obs. of  2 variables:
##  $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...