Learning Objectives:

At the end of the session, the participants are expected to:

  • learn how to import data to R from difference data sources.
  • learn how to export or write data into different data formats.
  • manage data in R in preparation for data analysis.

Importing Data into R

In this section, you will learn how to import data from different sources and at the same time to prepare your data in order to avoid errors during the importation of a file into R.

Preparing your Datasets

1.) Row and column names:

  • Use the first row as column headers (or column names). Generally, columns represent variables.

  • Use the first column as row names. Generally rows represent observations.

  • Each row name should be unique, so remove duplicated names.

Column names should be compatible with R naming conventions. As illustrated below, our data contains some issues that should be fixed before importing:

Figure 1

2.) Naming conventions

  • Avoid names with blank spaces. Good column names: Long_jump or Long.jump. Bad column name: Long jump.

  • Avoid names with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, < etc.

  • Avoid beginning variable names with a number. Use letter instead. Good column names: sport_100m or x100m. Bad column name: 100m

  • Column names must be unique. Duplicated names are not allowed.

  • R is case sensitive. This means that Name is different from Name or NAME.

  • Avoid blank rows in your data.

  • Delete any comments in your file

  • Replace missing values by NA (for not available)

  • If you have a column containing date, use the four digit format. Good format: 01/01/2016. Bad format: 01/01/16

3.) Final File:

Our finale file should look like this:

Figure 2

4.) Save the file.

Reading Data from TXT|CSV files

In this section, you’ll learn how to import data from .txt (tab-separated values) and .csv (comma-separated values) file formats into R.

R base functions for importing data

The R base function read.table() is a general function that can be used to read a file in table format. The data will be imported as a data frame.

Note that, depending on the format of your file, several variants of read.table() are available to make your life easier, including read.csv(), read.csv2(), read.delim() and read.delim2().

  • read.csv(): for reading “comma separated value” files (“.csv”).
  • read.csv2(): variant used in countries that use a comma “,” as decimal point and a semicolon “;” as field separators.
  • read.delim(): for reading “tab-separated value” files (“.txt”). By default, point (“.”) is used as decimal points.
  • read.delim2(): for reading “tab-separated value” files (“.txt”). By default, comma (“,”) is used as decimal points.

The simplified format of these functions are, as follow:

# Read tabular data into R
read.table(file, header = FALSE, sep = "", dec = ".")
# Read "comma separated value" files (".csv")
read.csv(file, header = TRUE, sep = ",", dec = ".", ...)
# Or use read.csv2: variant used in countries that 
# use a comma as decimal point and a semicolon as field separator.
read.csv2(file, header = TRUE, sep = ";", dec = ",", ...)
# Read TAB delimited files
read.delim(file, header = TRUE, sep = "\t", dec = ".", ...)
read.delim2(file, header = TRUE, sep = "\t", dec = ",", ...)
  • file: the path to the file containing the data to be imported into R.
  • sep: the field separator character. “ is used for tab-delimited file.
  • header: logical value. If TRUE, read.table() assumes that your file has a header row, so row 1 is the name of each column. If that’s not the case, you can add the argument header = FALSE.
  • dec: the character used in the file for decimal points.

Reading a local file

To import a local .txt or a .csv file, the syntax would be:

# Read a txt file, named "mtcars.txt"
test_data <- read.delim("test.txt")
test_data
# Read a csv file, named "mtcars.csv"
my_data <- read.csv("mtcars.csv")
my_data

The above R code assumes that the file “test.txt” or “mtcars.csv” is in your current working directory. To know your current working directory, type the function getwd() in R console.

It’s also possible to choose a file interactively using the function file.choose(), which I recommend if you’re a beginner in R programming:

or you can go to : Environment > Import Dataset > From text. That is,

Figure 3

Reading a file from internet

It’s possible to use the functions read.delim(), read.csv() and read.table() to import files from the web.

my_data <- read.delim("http://www.sthda.com/upload/boxplot_format.txt")
head(my_data)

Reading Data From Excel Files (xls|xlsx) into R

You can import data from excel though : Environment > Import Dataset > From Excel > Choose the file

Figure 4

  • In case where your excel file contains many sheets, you can specify the sheet name that contains the data that you want to import in R.

  • You can also specify the Range of cells to include in the import process.

Reading Data from SPSS, Stata and SAS

The process of importing data from SPSS, Stata and SAS can also be done through the Environment > Import Dataset > From SPSS or SAS or Stata.

Exporting Data from R

Here, you’ll learn how to export data from R to txt, csv, Excel (xls, xlsx) and R data file formats.

Figure 4

Writing Data from R to txt

You can use the base functions or the functions from the readr package to write data from R to txt.

  • R base functions for writing data: write.table(), write.csv(), write.csv2()

  • readr functions for writing data: write_tsv(), write_csv()

1.) Use the R base functions

# Loading mtcars data
data("mtcars")
# Write data to txt file: tab separated values
# sep = "\t"
write.table(mtcars, file = "mtcars.txt", sep = "\t",
            row.names = TRUE, col.names = NA)
# Write data to csv files:  
# decimal point = "." and value separators = comma (",")
write.csv(mtcars, file = "mtcars.csv")
# Write data to csv files: 
# decimal point = comma (",") and value separators = semicolon (";")
write.csv2(mtcars, file = "mtcars.csv")

2.) Using the readr package functions

# Loading mtcars data
data("mtcars")
library("readr")
# Writing mtcars data to a tsv file
write_tsv(mtcars, path = "mtcars.txt")
# Writing mtcars data to a csv file
write_csv(mtcars, path = "mtcars.csv")

Writing data from R to Excel files (xls|xlsx)

In order to do this, we need to install the package : writexl

install.packages(“writexl”)
library(writexl)
# Write the first data set in a new workbook
write_xlsx(USArrests,"myworkbook.xlsx")

You can also give it a named list of data frames, in which case each data frame becomes a sheet in the xlsx file:

write_xlsx(list(iris = iris, cars = cars, mtcars = mtcars), "mydata.xlsx")

Saving data into R data format: RDATA and RDS

  • Save one object to a file: saveRDS(object, file), readRDS(file)

  • Save multiple objects to a file: save(data1, data2, file), load(file)

  • Save your entire workspace: save.image(), load()

1.) Saving and restoring one single R object:

# Save a single object to a file
saveRDS(mtcars, "mtcars.rds")
# Restore it under a different name
my_data <- readRDS("mtcars.rds")

2.) Saving and restoring your entire work space:

# Save your workspace
save.image(file = "my_work_space.RData")
# Load the workspace again
load("my_work_space.RData")

Reshaping Data

In this section, you’ll learn modern conventions for preparing and reshaping data in order to facilitate analyses in R. The process is called tidying the data.

What is a tidy data set?

A data set is called tidy when:

  • each column represents a variable and

  • each row represents an observation

The opposite of tidy is messy data, which corresponds to any other arrangement of the data.

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It’s particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.

The tidyr package, provides four functions to help you change the layout of your data set:

  • gather(): gather (collapse) columns into rows

  • spread(): spread rows into columns

  • separate(): separate one column into multiple

  • unite(): unite multiple columns into one

Installing and loading tidyr

# Installing the tidyr packages
install.packages("tidyr")
# Loading
library("tidyr")

Example data sets

We’ll use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:

my_data <- USArrests[c(1, 10, 20, 30), ]
my_data

Row names are states, so let’s use the function cbind() to add a column named “state” in the data. This will make the data tidy and the analysis easier.

my_data <- cbind(state = rownames(my_data), my_data)
my_data

gather():

The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one.

Figure 5

1.) Simplified format:

gather(data, key, value, ...)
  • data: A data frame
  • key, value: Names of key and value columns to create in output
  • …: Specification of columns to gather. Allowed values are:
    • variable names
    • if you want to select all variables between a and e, use a:e
    • if you want to exclude a column name y use -y for more options, see: dplyr::select()

2.) Examples of usage:

Gather all columns except the column state

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2

Note that all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).

spread():

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].

Figure 6

1.) Simplified format

spread(data, key, value)
  • data: A data frame
  • key: The (unquoted) name of the column whose values will be used as column headings.
  • value:The (unquoted) names of the column whose values will populate the cells.

2.) Examples of usage:

Spread “my_data2” to turn back to the original data:

my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
my_data3

unite()

The function unite() takes multiple columns and paste them together into one.

Figure 7

1.) Simplified format:

unite(data, col, ..., sep = "_")
  • data: A data frame
  • col: The new (unquoted) name of column to add.
  • sep: Separator to use between values

2.) Examples of usage:

The R code below uses the data set “my_data” and unites the columns Murder and Assault

my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4

separate()

The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.

Figure 8

1.) Simplified format:

separate(data, col, into, sep = "[^[:alnum:]]+")
  • data: A data frame

  • col: Unquoted column names

  • into: Character vector specifying the names of new variables to be created.

  • sep: Separator between columns:

  • If character, is interpreted as a regular expression.

  • If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.

2.) Examples of usage:

Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault:

separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

Manipulate Data

In this section, you will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:

  • filtering and ordering rows,
  • renaming and adding columns,
install.packages("tidyverse")

Select Data Frame Columns

You will learn how to use the following functions:

  • pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
  • select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
  • select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
  • Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names

Figure 9

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data

Extract column values as a vector

my_data %>% pull(Species)
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Extract columns as a data table**

Select column by position

  • Select columns 1 to 3:
my_data %>% select(1:3)
# or you can use
my_data[,c(1:3)]
  • Select column 1 and 3 but not 2:
my_data %>% select(1, 3)
# or you can use
my_data[,c(1:3)]

Select columns by names

Select columns by names: Sepal.Length and Petal.Length

my_data %>% select(Sepal.Length, Petal.Length)

Select all columns from Sepal.Length to Petal.Length

my_data %>% select(Sepal.Length:Petal.Length)

There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.

# Select column whose name starts with "Petal"
my_data %>% select(starts_with("Petal"))
# Select column whose name ends with "Width"
my_data %>% select(ends_with("Width"))
# Select columns whose names contains "etal"
my_data %>% select(contains("etal"))
# Select columns whose name matches a regular expression
my_data %>% select(matches(".t."))
# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))

Select column based on a condition

my_data %>% select_if(is.numeric)

Remove columns

Note that, to remove a column from a data frame, prepend its name by minus -.

Removing Sepal.Length and Petal.Length columns:

my_data %>% select(-Sepal.Length, -Petal.Length)

Removing all columns from Sepal.Length to Petal.Length:

my_data %>% select(-(Sepal.Length:Petal.Length))

Removing all columns whose name starts with “Petal”:

my_data %>% select(-starts_with("Petal"))

Note that, if you want to drop columns by position, the syntax is as follow.

# Drop column 1
my_data %>% select(-1)
# Drop columns 1 to 3
my_data %>% select(-(1:3))
# Drop columns 1 and 3 but not 2
my_data %>% select(-1, -3)

Subset Data Frame Rows

Figure 10

Extract rows by position

  • Key R function: slice() [dplyr package]
my_data %>% slice(1:6)

Filter rows by logical criteria

  • Key R function: filter() [dplyr package]. Used to filter rows that meet some logical criteria.

One-column based criteria: Extract rows where Sepal.Length > 7:

my_data %>% filter(Sepal.Length > 7)

Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:

my_data %>% filter(Sepal.Length > 6.7, Sepal.Width <= 3)

Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:

my_data %>% filter(Sepal.Length > 6.7, Species == "versicolor")

Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):

my_data %>% filter(
  Sepal.Length > 6.7, 
  Species == "versicolor" | Species == "virginica"
  )

Or, equivalently, use this shortcut (%in% operator):

my_data %>% filter(
  Sepal.Length > 6.7, 
  Species %in% c("versicolor", "virginica" )
  )

Filter rows within a selection of variables

This section presents 3 functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.

These functions replicate the logical criteria over all variables or a selection of variables.

Create a new demo data set from my_data by removing the grouping column “Species”:

my_data2 <- my_data %>% select(-Species)
  • Select rows where all variables are greater than 2.4:
my_data2 %>% filter_all(all_vars(.> 2.4))
  • Select rows when any of the variables are greater than 2.4:
my_data2 %>% filter_all(any_vars(.> 2.4))
  • Vary the selection of columns on which to apply the filtering criteria. filter_at() takes a vars() specification. The following R code apply the filtering criteria on the columns Sepal.Length and Sepal.Width:
my_data2 %>% filter_at(vars(starts_with("Sepal")), any_vars(. > 2.4))

Remove missing values

We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing values:

# Create a data frame with missing data
friends_data <- data_frame(
  name = c("A", "B", "C", "D"),
  age = c(27, 25, 29, 26),
  height = c(180, NA, NA, 169),
  married = c("yes", "yes", "no", "no")
)
# Print
friends_data

Extract rows where height is NA:

friends_data %>% filter(is.na(height))

Exclude (drop) rows where height is NA:

friends_data %>% filter(!is.na(height))

In the R code above, !is.na() means that “we don’t want” NAs.

Select random rows from a data frame

It’s possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac(). We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.

set.seed(1234)
# Extract 5 random rows without replacement
my_data %>% sample_n(5, replace = FALSE)
# Extract 5% of rows, randomly without replacement
my_data %>% sample_frac(0.05, replace = FALSE)

If replace=FALSE is set, it means we want sampling without replacement

Select top n rows ordered by a variable

Select the top 5 rows ordered by Sepal.Length

my_data %>% top_n(5, Sepal.Length)

Group by the column Species and select the top 5 of each group ordered by Sepal.Length:

my_data %>% 
  group_by(Species) %>%
  top_n(5, Sepal.Length)

Summary

We have introduce how to filter a data frame rows using the dplyr package:

  • Filter rows by logical criteria: my_data %>% filter(Sepal.Length >7)
  • Select n random rows: my_data %>% sample_n(10)
  • Select a random fraction of rows: my_data %>% sample_frac(10)
  • Select top n rows by values: my_data %>% top_n(10, Sepal.Length)

Identify and Remove Duplicate Data

Find and drop duplicate elements

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

To find the position of duplicate elements in x, use this:

duplicated(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Extract duplicate elements:

x[duplicated(x)]
## [1] 1 4

If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:

x[!duplicated(x)]
## [1] 1 4 5 6

Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:

# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Length
my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)

The option .kep_all is used to keep all variables in the data.

Summary

In this chapter, we describe key functions for identifying and removing duplicate data:

  • Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal.Length)
  • R base function to extract unique elements from vectors and data frames: unique(my_data)
  • R base function to determine duplicate elements: duplicated(my_data)

Renaming columns

To rename the column Sepal.Length to sepal_length, the procedure is as follow:

1.) Get column names using the function names() or colnames() 2.) Change column names where name = Sepal.Length

# get column names
colnames(my_data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
# Rename column where names is "Sepal.Length"
names(my_data)[names(my_data) == "Sepal.Length"] <- "sepal_length"
names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
my_data

It’s also possible to rename by index in names vector as follow:

names(my_data)[1] <- "sepal_length"
names(my_data)[2] <- "sepal_width"

Compute and Add new Variables to a Data Frame

mutate: Add new variables by preserving existing ones

Add new columns (sepal_by_petal_*) by preserving existing ones:

my_data <- as_tibble(iris)
my_data %>% mutate(sepal_by_petal_l=Sepal.Length/Petal.Length)

transmute: Make new variables by dropping existing ones

Add new columns (sepal_by_petal_*) by dropping existing ones:

my_data %>%
  transmute(
    sepal_by_petal_l = Sepal.Length/Petal.Length,
    sepal_by_petal_w = Sepal.Width/Petal.Width
    )

Modify multiple columns at once

We start by creating a demo data set, my_data2, which contains only numeric columns. To do so, we’ll remove the column Species as follow:

my_data2 <- my_data %>%
  select(-Species)

The functions mutate_all() / transmute_all(), mutate_at() / transmute_at() and mutate_if() / transmute_if() can be used to modify multiple columns at once.

The simplified formats are as follow:

# Mutate variants
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ...)

# Transmute variants
transmute_all(.tbl, .funs, ...)
transmute_if(.tbl, .predicate, .funs, ...)
transmute_at(.tbl, .vars, .funs, ...)
  • .tbl: a tbl data frame
  • .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
  • …: Additional arguments for the function calls in .funs.
  • .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

In the following sections, we’ll present only the variants of mutate(). The transmute() variants can be used similarly.

Transform all column values

  • Divide all columns value by 2.54:
my_data2 %>%
  mutate_all(funs(./2.54))

Note that, the dot “.” represents any variables

  • Function names will be appended to column names if .funs has names or multiple inputs:
my_data2 %>%mutate_all(funs(cm = ./2.54))

Note that, the output variable name now includes the function name.

Transform specific columns

  • mutate_at(): transform specific columns selected by names:
my_data2 %>%
  mutate_at(
    c("Sepal.Length", "Petal.Width"),
    funs(cm = ./2.54)
    )
  • mutate_if(): transform specific columns selected by a predicate function.

    mutate_if() is particularly useful for transforming variables from one type to another.

my_data %>% mutate_if(is.factor, as.character)

In this case, the species variable is originally a factor. We convert it into a character type variable.

Round all numeric variables:

my_data %>% mutate_if(is.numeric, round, digits = 0)

Summary

This article describe how to add new variable columns into a data frame using the dplyr functions: mutate(), transmute() and variants.

  • mutate(iris, sepal = 2*Sepal.Length): Computes and appends new variable(s).
  • transmute(iris, sepal = 2*Sepal.Length): Makes new variable(s) and drops existing ones.