Data Loading and Data Wrangling with R
Lab Overview
- Setting up an R script
- Loading .RData and .csv data files
- Working with tibbles
- Introduction to data wrangling in R
1. Getting Started
Download Lab 1’s materials from Moodle:
Save provided R script in your
codefolder in BRM-Labs project folder.Save provided data sets in your
datafolder in BRM-Labs project folder.
Open the provided lab 1’s R script.
2. R Script Setup
The first thing on an R Script should be the setup of
the work environment. In the setup, we should:
2.1 Clean the work environment
Caution: this action cannot be undone!
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
2.2 Activate packages
Use the function library to activate/load R
packages.
# Load packages
library(tidyverse)
library(stargazer)
These two packages should now appear as “selected” in the
Packages tab (bottom right of the screen).
3. Loading Data
In order to load data into R you need to pay attention to the data format. Different data formats need to be loaded into R using different functions.
3.1 .RData
This “.RData” file format is specific to R. It is loaded
using the load() function. The following command loads the
ceosal2.RData file and automatically stores its contents in
the work environment. Note that when using the load()
function, we do not need to give a name to the file we are loading in
order for it to be saved in our Environment.
# Load data
load("data/ceosal2.RData")
“.RData” files can store multiple objects, so it is important to check what objects were loaded. You can see that two new objects appeared in your work environment:
Data: This is a table with data on CEO salaries.Desc: This is a table with the description of each variable in Data.
3.2 .csv Data
Since there are many software alternatives for data analysis
(e.g. Excel, SPSS, Stata, MatLab), data are often distributed in “open”
formats such as “.csv” (comma separated values). “.csv” files
are text files in which each line of text corresponds to a row of data
and each column of data is separated by a comma. To load a “.csv” file
we can use the function read_delim() specifying that the
delimiter is a comma (this function will also work for files that use
different delimiters such as tab or semicolon), or we can use the
function read_csv().
The following command reads the ceosal2.csv file and
stores its contents in an object. Note that when using the
read_delim() or the read_csv() functions, we
need to give a name to the file we are loading in order for it to be
saved in our Environment (otherwise the file is simply printed
to the console.)
# Read comma delimited file
tb.ceosal2 <- read_delim("data/ceosal2.csv", delim= ",")
# or
tb.ceosal2 <- read_csv("data/ceosal2.csv")
4. Tibbles
The read_delim() and read_csv() functions
automatically read the data into a tibble. “A
tibble is a more advanced and capable version of
R’s default format for data (data.frame).”1. This is
the data format we will use during this class.
For our own convenience, we include the prefix tb. in
the name of all objects of the tibble type. Also, it is
useful to give names to our objects or datasets that reflect their
contents. We should choose names that are clear and meaningful to us and
to others who might read our code.
We can create a tibble from an existing data object with
as_tibble(). This will be useful if, for instance, we used
the load() function (which does not automatically convert
the data into the tibble format).
# Load data and convert to tibble format
load("data/ceosal2.RData")
# Check if "data" tibble
is_tibble(data)
## [1] FALSE
# Convert to tibble and rename
tb.ceosal2 <- as_tibble(data)
# Check if is tibble
is_tibble(tb.ceosal2)
## [1] TRUE
The tb.ceosal2 just created is a duplicate of the
dataset data. As we don’t need the original dataset, we can
delete by using the rm() function. rm stands
for “remove” and it deletes objects from our work environment. This
action cannot be undone!
# Remove original data.frame "data"
rm(data)
Piping
One useful tool that we will be using to manipulate tibbles is the
pipe, %>%. Pipes allow us to express a sequence of
multiple operations. They can greatly simplify our code and make it more
intuitive2. A useful way to read the pipe operator is
to think of it as the expression “and then”.
To insert a pipe you can use the shortcut type: Ctrl +
Shift + M (Windows) or Cmd +
Shift + M (Mac).
5. Data Exploration and Wrangling
There are many options for exploring data in R.
5.1 Table info
We can get the names of the variables in a dataset by using the
function names():
# Print the names of variables in dataset
names(tb.ceosal2)
## [1] "salary" "age" "college" "grad" "comten" "ceoten"
## [7] "sales" "profits" "mktval" "lsalary" "lsales" "lmktval"
## [13] "comtensq" "ceotensq" "profmarg"
An easy way to change the name of a variable is using the
rename() function.
# Rename variables
tb.ceosal2 <- tb.ceosal2 %>%
rename(logsales=lsales)
Note that we are creating a new object tb.ceosal2 (that
is, we are overwriting our original data set) that contains the same
data as the original but with a new name logsales for the
variable previously called lsales. Note also that we have
used the pipe operator. We can read the code as: take
tb.ceosal2 “and then” rename the variable
lsales and store this changed data set as
tb.ceosal2 (overwrite the original).
We can count the number of columns in the data set using
ncol():
# Count number of columns/variables in dataset
ncol(tb.ceosal2)
## [1] 15
We can count the number of observations (rows) using
nrow() or summarise(). The
summarise() function “creates a new data frame. It will
have one (or more) rows for each combination of grouping variables; if
there are no grouping variables, the output will have a single row
summarizing all observations in the input.”3. The function
n() returns the number of observations in a current
group.
# Count the number of observations (rows)
nrow(tb.ceosal2)
## [1] 177
# or:
tb.ceosal2 %>%
summarize(n_row = n())
Below is a list of useful functions for exploring and interacting with your data in R:
| Function | Description | Example |
|---|---|---|
names(df) |
Returns the variable names in the dataset | names(tb.ceosal2) |
ncol(df) |
Returns the number of columns | ncol(tb.ceosal2) |
nrow(df) |
Returns the number of rows | nrow(tb.ceosal2) |
head(df, n) |
Shows the first n rows | head(tb.ceosal2, 3) |
slice_head(n = x) |
Selects the first x rows | tb.ceosal2 %>% slice_head(n = 3) |
tail(df, n) |
Shows the last n rows | tail(tb.ceosal2, 3) |
slice_tail(n = x) |
Selects the last x rows | tb.ceosal2 %>% slice_tail(n = 3) |
slice(a:b) |
Extracts rows a to b | tb.ceosal2 %>% slice(5:8) |
select(var) |
Selects a variable | tb.ceosal2 %>% select(salary) |
select(var1, var2) |
Selects multiple variables | tb.ceosal2 %>% select(salary, age) |
rename(new = old) |
Renames a variable | tb.ceosal2 %>% rename(logsales = lsales) |
df[row, col] |
Accesses a specific element | tb.ceosal2[1, 1] |
df[row, ] |
Accesses a full row | tb.ceosal2[1, ] |
df[, col] |
Accesses a full column | tb.ceosal2[, 1] |
summarise(n = n()) |
Counts number of rows | tb.ceosal2 %>% summarise(n = n()) |
View(df) |
Opens the dataset in spreadsheet view | View(tb.ceosal2) |
print(df) |
Prints the dataset | print(tb.ceosal2) |
⚠️ Note on Capitalization: R is case-sensitive. This means
Mean()will not work, butmean()will. If a function or variable doesn’t work as expected, double-check your spelling and capitalization first.
5.2 Ordering data
The function arrange() allows us to order the data.
# Ordering the data in ascending order (default) (output omitted)
tb.ceosal2 %>%
arrange(age)
# Ordering the data in descending order (output omitted)
tb.ceosal2 %>%
arrange(desc(age))
Ordering the data using multiple conditions:
# Order data by age (ascending) and then by grad (ascending) (output omitted)
tb.ceosal2 %>%
arrange(age, grad)
# Order data by grad (descending) and then by age (ascending) (output omitted)
tb.ceosal2 %>%
arrange(desc(grad), age)
Note that these commands do not permanently order the data
set — they only output the reordered set. To permanently order
the data we would have to replace the contents of the object, doing
something like
tb.ceosal2 <- tb.ceosal2 %>% arrange(age,grad).
5.3 Subsetting data
We can use the filter() function to subset the data
based on any criteria we may want. For instance, we may want to focus
our analysis on only young CEOs:
# Keep only CEOs who are 45 years old or younger
tb.ceosal2 %>%
filter(age <= 45)
# Create new tibble with the filtered data
tb.young.ceosal2 <- tb.ceosal2 %>%
filter(age <= 45)
Note that the first command simply lists all the rows for which CEO age is below the specified threshold.
Subsetting the data using multiple conditions
Use the symbol & for and and the symbol
| for or. Use >/<
for greater/less than and >=/<= for
greater or equal/less or equal to, and == for equal.
# Select CEOs who 45 years old or younger AND have a graduate degree.
tb.ceosal2 %>%
filter(age <= 45 & grad == 1)
# Select CEOs who are younger than 35 OR have a salary higher than 5000.
tb.ceosal2 %>%
filter(age < 35 | salary > 5000)
5.4 Creating new variables
To add a new variable to a tibble we use the function
mutate().
Create a new variable that transforms/is a function of an existing variable(s):
# Add new variables that are functions of existing variables
tb.ceosal2 <- tb.ceosal2 %>%
mutate(log_salary = log(salary))
tb.ceosal2 <- tb.ceosal2 %>%
mutate(age_squared = age^2)
Create new variables that contain values specified by us:
# Add new variables that take on specific values
tb.ceosal2 <- tb.ceosal2 %>%
mutate(one = 1)
Deleting a variable:
# Delete a variable
tb.ceosal2 <- tb.ceosal2 %>%
select(-log_salary)
We can use the function head() or names()
to verify the creation/deletion of variables.
5.5 Grouping Data
Oftentimes we want to summarize data by a given grouping variable or
set of variables. The function group_by allow us to group
data. The following operations will be performed to each group. For
instance:
# Count number of CEOs with and without graduate degree
tb.ceosal2 %>%
group_by(grad) %>%
summarize(n_ceo = n()) %>%
ungroup()
After performing the intended operation we can use the function
ungroup() to remove the grouping of the data.
Grouping can be done using multiple variables as follows:
# Count number of CEOs with and without graduate and colleges degree
tb.ceosal2 %>%
group_by(grad, college) %>%
summarize(n_ceo = n()) %>%
ungroup()
6. Recommended Assignment
- Complete DataCamp’s second chapter of the Introduction to the Tidyverse course: Data visualization