At the end of the session, the participants are expected to:
In this section, you will learn how to import data from different sources and at the same time to prepare your data in order to avoid errors during the importation of a file into R.
1.) Row and column names:
Use the first row as column headers (or column names). Generally, columns represent variables.
Use the first column as row names. Generally rows represent observations.
Each row name should be unique, so remove duplicated names.
Column names should be compatible with R naming conventions. As illustrated below, our data contains some issues that should be fixed before importing:
Figure 1
2.) Naming conventions
Avoid names with blank spaces. Good column names: Long_jump or Long.jump. Bad column name: Long jump.
Avoid names with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, < etc.
Avoid beginning variable names with a number. Use letter instead. Good column names: sport_100m or x100m. Bad column name: 100m
Column names must be unique. Duplicated names are not allowed.
R is case sensitive. This means that Name is different from Name or NAME.
Avoid blank rows in your data.
Delete any comments in your file
Replace missing values by NA (for not available)
If you have a column containing date, use the four digit format. Good format: 01/01/2016. Bad format: 01/01/16
3.) Final File:
Our finale file should look like this:
Figure 2
4.) Save the file.
In this section, you’ll learn how to import data from .txt (tab-separated values) and .csv (comma-separated values) file formats into R.
The R base function read.table() is a general function that can be used to read a file in table format. The data will be imported as a data frame.
Note that, depending on the format of your file, several variants of read.table() are available to make your life easier, including read.csv(), read.csv2(), read.delim() and read.delim2().
The simplified format of these functions are, as follow:
# Read tabular data into R
read.table(file, header = FALSE, sep = "", dec = ".")
# Read "comma separated value" files (".csv")
read.csv(file, header = TRUE, sep = ",", dec = ".", ...)
# Or use read.csv2: variant used in countries that
# use a comma as decimal point and a semicolon as field separator.
read.csv2(file, header = TRUE, sep = ";", dec = ",", ...)
# Read TAB delimited files
read.delim(file, header = TRUE, sep = "\t", dec = ".", ...)
read.delim2(file, header = TRUE, sep = "\t", dec = ",", ...)To import a local .txt or a .csv file, the syntax would be:
The above R code assumes that the file “test.txt” or “mtcars.csv” is in your current working directory. To know your current working directory, type the function getwd() in R console.
It’s also possible to choose a file interactively using the function file.choose(), which I recommend if you’re a beginner in R programming:
or you can go to : Environment > Import Dataset > From text. That is,
Figure 3
You can import data from excel though : Environment > Import Dataset > From Excel > Choose the file
Figure 4
In case where your excel file contains many sheets, you can specify the sheet name that contains the data that you want to import in R.
You can also specify the Range of cells to include in the import process.
The process of importing data from SPSS, Stata and SAS can also be done through the Environment > Import Dataset > From SPSS or SAS or Stata.
Here, you’ll learn how to export data from R to txt, csv, Excel (xls, xlsx) and R data file formats.
Figure 4
You can use the base functions or the functions from the readr package to write data from R to txt.
R base functions for writing data: write.table(), write.csv(), write.csv2()
readr functions for writing data: write_tsv(), write_csv()
1.) Use the R base functions
# Loading mtcars data
data("mtcars")
# Write data to txt file: tab separated values
# sep = "\t"
write.table(mtcars, file = "mtcars.txt", sep = "\t",
row.names = TRUE, col.names = NA)
# Write data to csv files:
# decimal point = "." and value separators = comma (",")
write.csv(mtcars, file = "mtcars.csv")
# Write data to csv files:
# decimal point = comma (",") and value separators = semicolon (";")
write.csv2(mtcars, file = "mtcars.csv")2.) Using the readr package functions
In order to do this, we need to install the package : writexl
library(writexl)
# Write the first data set in a new workbook
write_xlsx(USArrests,"myworkbook.xlsx")You can also give it a named list of data frames, in which case each data frame becomes a sheet in the xlsx file:
Save one object to a file: saveRDS(object, file), readRDS(file)
Save multiple objects to a file: save(data1, data2, file), load(file)
Save your entire workspace: save.image(), load()
1.) Saving and restoring one single R object:
# Save a single object to a file
saveRDS(mtcars, "mtcars.rds")
# Restore it under a different name
my_data <- readRDS("mtcars.rds")2.) Saving and restoring your entire work space:
In this section, you’ll learn modern conventions for preparing and reshaping data in order to facilitate analyses in R. The process is called tidying the data.
A data set is called tidy when:
each column represents a variable and
each row represents an observation
The opposite of tidy is messy data, which corresponds to any other arrangement of the data.
Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.
The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It’s particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.
The tidyr package, provides four functions to help you change the layout of your data set:
gather(): gather (collapse) columns into rows
spread(): spread rows into columns
separate(): separate one column into multiple
unite(): unite multiple columns into one
Example data sets
We’ll use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:
Row names are states, so let’s use the function cbind() to add a column named “state” in the data. This will make the data tidy and the analysis easier.
The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one.
Figure 5
1.) Simplified format:
2.) Examples of usage:
Gather all columns except the column state
Note that all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).
The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].
Figure 6
1.) Simplified format
2.) Examples of usage:
Spread “my_data2” to turn back to the original data:
The function unite() takes multiple columns and paste them together into one.
Figure 7
1.) Simplified format:
2.) Examples of usage:
The R code below uses the data set “my_data” and unites the columns Murder and Assault
The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.
Figure 8
1.) Simplified format:
data: A data frame
col: Unquoted column names
into: Character vector specifying the names of new variables to be created.
sep: Separator between columns:
If character, is interpreted as a regular expression.
If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.
2.) Examples of usage:
Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault:
In this section, you will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:
You will learn how to use the following functions:
Figure 9
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Select column by position
Select columns by names: Sepal.Length and Petal.Length
Select all columns from Sepal.Length to Petal.Length
There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.
# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))Note that, to remove a column from a data frame, prepend its name by minus -.
Removing Sepal.Length and Petal.Length columns:
Removing all columns from Sepal.Length to Petal.Length:
Removing all columns whose name starts with “Petal”:
Note that, if you want to drop columns by position, the syntax is as follow.
Figure 10
One-column based criteria: Extract rows where Sepal.Length > 7:
Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:
Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):
Or, equivalently, use this shortcut (%in% operator):
This section presents 3 functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.
These functions replicate the logical criteria over all variables or a selection of variables.
Create a new demo data set from my_data by removing the grouping column “Species”:
We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing values:
# Create a data frame with missing data
friends_data <- data_frame(
name = c("A", "B", "C", "D"),
age = c(27, 25, 29, 26),
height = c(180, NA, NA, 169),
married = c("yes", "yes", "no", "no")
)
# Print
friends_dataExtract rows where height is NA:
Exclude (drop) rows where height is NA:
In the R code above, !is.na() means that “we don’t want” NAs.
It’s possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac(). We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.
If replace=FALSE is set, it means we want sampling without replacement
Select the top 5 rows ordered by Sepal.Length
Group by the column Species and select the top 5 of each group ordered by Sepal.Length:
We have introduce how to filter a data frame rows using the dplyr package:
The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.
Given the following vector:
To find the position of duplicate elements in x, use this:
## [1] FALSE TRUE FALSE FALSE TRUE FALSE
Extract duplicate elements:
## [1] 1 4
If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
## [1] 1 4 5 6
Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
! is a logical negation. !duplicated() means that we don’t want duplicate rows.
Given the following vector:
You can extract unique elements as follow:
## [1] 1 4 5 6
It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().
Remove duplicate rows based on all columns:
Remove duplicate rows based on certain columns (variables):
# Remove duplicated rows based on
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)The option .kep_all is used to keep all variables in the data.
In this chapter, we describe key functions for identifying and removing duplicate data:
To rename the column Sepal.Length to sepal_length, the procedure is as follow:
1.) Get column names using the function names() or colnames() 2.) Change column names where name = Sepal.Length
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# Rename column where names is "Sepal.Length"
names(my_data)[names(my_data) == "Sepal.Length"] <- "sepal_length"
names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
my_dataIt’s also possible to rename by index in names vector as follow:
Add new columns (sepal_by_petal_*) by preserving existing ones:
Add new columns (sepal_by_petal_*) by dropping existing ones:
my_data %>%
transmute(
sepal_by_petal_l = Sepal.Length/Petal.Length,
sepal_by_petal_w = Sepal.Width/Petal.Width
)We start by creating a demo data set, my_data2, which contains only numeric columns. To do so, we’ll remove the column Species as follow:
The functions mutate_all() / transmute_all(), mutate_at() / transmute_at() and mutate_if() / transmute_if() can be used to modify multiple columns at once.
The simplified formats are as follow:
# Mutate variants
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ...)
# Transmute variants
transmute_all(.tbl, .funs, ...)
transmute_if(.tbl, .predicate, .funs, ...)
transmute_at(.tbl, .vars, .funs, ...)In the following sections, we’ll present only the variants of mutate(). The transmute() variants can be used similarly.
Note that, the dot “.” represents any variables
Note that, the output variable name now includes the function name.
mutate_if(): transform specific columns selected by a predicate function.
mutate_if() is particularly useful for transforming variables from one type to another.
In this case, the species variable is originally a factor. We convert it into a character type variable.
Round all numeric variables:
This article describe how to add new variable columns into a data frame using the dplyr functions: mutate(), transmute() and variants.