FSAR | Lab #1 | Fall 2025

Data Loading and Data Wrangling with R

Lab Overview

Setting up an R script
Loading .RData and .csv data files
Working with tibbles
Introduction to data wrangling in R

1. Getting Started

Download Lab 1’s materials from Moodle:
- Save provided R script in your code folder in BRM-Labs project folder.
- Save provided data sets in your data folder in BRM-Labs project folder.
Open the provided lab 1’s R script.

2. R Script Setup

The first thing on an R Script should be the setup of the work environment. In the setup, we should:

2.1 Clean the work environment

Caution: this action cannot be undone!

# Clean work environment 
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment

2.2 Activate packages

Use the function library to activate/load R packages.

# Load packages
library(tidyverse)
library(stargazer)

These two packages should now appear as “selected” in the Packages tab (bottom right of the screen).

3. Loading Data

In order to load data into R you need to pay attention to the data format. Different data formats need to be loaded into R using different functions.

3.1 .RData

This “.RData” file format is specific to R. It is loaded using the load() function. The following command loads the ceosal2.RData file and automatically stores its contents in the work environment. Note that when using the load() function, we do not need to give a name to the file we are loading in order for it to be saved in our Environment.

# Load data
load("data/ceosal2.RData")

“.RData” files can store multiple objects, so it is important to check what objects were loaded. You can see that two new objects appeared in your work environment:

Data: This is a table with data on CEO salaries.
Desc: This is a table with the description of each variable in Data.

3.2 .csv Data

Since there are many software alternatives for data analysis (e.g. Excel, SPSS, Stata, MatLab), data are often distributed in “open” formats such as “.csv” (comma separated values). “.csv” files are text files in which each line of text corresponds to a row of data and each column of data is separated by a comma. To load a “.csv” file we can use the function read_delim() specifying that the delimiter is a comma (this function will also work for files that use different delimiters such as tab or semicolon), or we can use the function read_csv().

The following command reads the ceosal2.csv file and stores its contents in an object. Note that when using the read_delim() or the read_csv() functions, we need to give a name to the file we are loading in order for it to be saved in our Environment (otherwise the file is simply printed to the console.)

# Read comma delimited file
tb.ceosal2 <- read_delim("data/ceosal2.csv", delim= ",")
# or
tb.ceosal2 <- read_csv("data/ceosal2.csv")

4. Tibbles

The read_delim() and read_csv() functions automatically read the data into a tibble. “A tibble is a more advanced and capable version of R’s default format for data (data.frame).”¹. This is the data format we will use during this class.

For our own convenience, we include the prefix tb. in the name of all objects of the tibble type. Also, it is useful to give names to our objects or datasets that reflect their contents. We should choose names that are clear and meaningful to us and to others who might read our code.

We can create a tibble from an existing data object with as_tibble(). This will be useful if, for instance, we used the load() function (which does not automatically convert the data into the tibble format).

# Load data and convert to tibble format
load("data/ceosal2.RData")

# Check if "data" tibble
is_tibble(data)

## [1] FALSE

# Convert to tibble and rename
tb.ceosal2 <- as_tibble(data)

# Check if is tibble
is_tibble(tb.ceosal2)

## [1] TRUE

The tb.ceosal2 just created is a duplicate of the dataset data. As we don’t need the original dataset, we can delete by using the rm() function. rm stands for “remove” and it deletes objects from our work environment. This action cannot be undone!

# Remove original data.frame "data"
rm(data)

Piping

One useful tool that we will be using to manipulate tibbles is the pipe, %>%. Pipes allow us to express a sequence of multiple operations. They can greatly simplify our code and make it more intuitive². A useful way to read the pipe operator is to think of it as the expression “and then”.

To insert a pipe you can use the shortcut type: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac).

5. Data Exploration and Wrangling

There are many options for exploring data in R.

5.1 Table info

We can get the names of the variables in a dataset by using the function names():

# Print the names of variables in dataset
names(tb.ceosal2)

##  [1] "salary"   "age"      "college"  "grad"     "comten"   "ceoten"  
##  [7] "sales"    "profits"  "mktval"   "lsalary"  "lsales"   "lmktval" 
## [13] "comtensq" "ceotensq" "profmarg"

An easy way to change the name of a variable is using the rename() function.

# Rename variables
tb.ceosal2 <- tb.ceosal2 %>%
  rename(logsales=lsales)

Note that we are creating a new object tb.ceosal2 (that is, we are overwriting our original data set) that contains the same data as the original but with a new name logsales for the variable previously called lsales. Note also that we have used the pipe operator. We can read the code as: take tb.ceosal2 “and then” rename the variable lsales and store this changed data set as tb.ceosal2 (overwrite the original).

We can count the number of columns in the data set using ncol():

# Count number of columns/variables in dataset
ncol(tb.ceosal2)

## [1] 15

We can count the number of observations (rows) using nrow() or summarise(). The summarise() function “creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input.”³. The function n() returns the number of observations in a current group.

# Count the number of observations (rows)
nrow(tb.ceosal2)

## [1] 177

# or:
tb.ceosal2 %>% 
  summarize(n_row = n())

Below is a list of useful functions for exploring and interacting with your data in R:

Function	Description	Example
`names(df)`	Returns the variable names in the dataset	`names(tb.ceosal2)`
`ncol(df)`	Returns the number of columns	`ncol(tb.ceosal2)`
`nrow(df)`	Returns the number of rows	`nrow(tb.ceosal2)`
`head(df, n)`	Shows the first n rows	`head(tb.ceosal2, 3)`
`slice_head(n = x)`	Selects the first x rows	`tb.ceosal2 %>% slice_head(n = 3)`
`tail(df, n)`	Shows the last n rows	`tail(tb.ceosal2, 3)`
`slice_tail(n = x)`	Selects the last x rows	`tb.ceosal2 %>% slice_tail(n = 3)`
`slice(a:b)`	Extracts rows a to b	`tb.ceosal2 %>% slice(5:8)`
`select(var)`	Selects a variable	`tb.ceosal2 %>% select(salary)`
`select(var1, var2)`	Selects multiple variables	`tb.ceosal2 %>% select(salary, age)`
`rename(new = old)`	Renames a variable	`tb.ceosal2 %>% rename(logsales = lsales)`
`df[row, col]`	Accesses a specific element	`tb.ceosal2[1, 1]`
`df[row, ]`	Accesses a full row	`tb.ceosal2[1, ]`
`df[, col]`	Accesses a full column	`tb.ceosal2[, 1]`
`summarise(n = n())`	Counts number of rows	`tb.ceosal2 %>% summarise(n = n())`
`View(df)`	Opens the dataset in spreadsheet view	`View(tb.ceosal2)`
`print(df)`	Prints the dataset	`print(tb.ceosal2)`

⚠️ Note on Capitalization: R is case-sensitive. This means Mean() will not work, but mean() will. If a function or variable doesn’t work as expected, double-check your spelling and capitalization first.

to top

5.2 Ordering data

The function arrange() allows us to order the data.

# Ordering the data in ascending order (default) (output omitted)
tb.ceosal2 %>% 
  arrange(age)

# Ordering the data in descending order (output omitted)
tb.ceosal2 %>% 
  arrange(desc(age))

Ordering the data using multiple conditions:

# Order data by age (ascending) and then by grad (ascending) (output omitted)
tb.ceosal2 %>% 
  arrange(age, grad)       

# Order data by grad (descending) and then by age (ascending) (output omitted)
tb.ceosal2 %>% 
  arrange(desc(grad), age)

Note that these commands do not permanently order the data set — they only output the reordered set. To permanently order the data we would have to replace the contents of the object, doing something like tb.ceosal2 <- tb.ceosal2 %>% arrange(age,grad).

to top

5.3 Subsetting data

We can use the filter() function to subset the data based on any criteria we may want. For instance, we may want to focus our analysis on only young CEOs:

# Keep only CEOs who are 45 years old or younger
tb.ceosal2 %>% 
  filter(age <= 45)   

# Create new tibble with the filtered data
tb.young.ceosal2 <-  tb.ceosal2 %>% 
  filter(age <= 45)

Note that the first command simply lists all the rows for which CEO age is below the specified threshold.

Subsetting the data using multiple conditions

Use the symbol & for and and the symbol | for or. Use >/< for greater/less than and >=/<= for greater or equal/less or equal to, and == for equal.

# Select CEOs who 45 years old or younger AND have a graduate degree.
tb.ceosal2 %>% 
  filter(age <= 45 & grad == 1)

# Select CEOs who are younger than 35 OR have a salary higher than 5000.
tb.ceosal2 %>% 
  filter(age < 35 | salary > 5000)

to top

5.4 Creating new variables

To add a new variable to a tibble we use the function mutate().

Create a new variable that transforms/is a function of an existing variable(s):

# Add new variables that are functions of existing variables
tb.ceosal2 <- tb.ceosal2 %>% 
  mutate(log_salary = log(salary))

tb.ceosal2 <- tb.ceosal2 %>% 
  mutate(age_squared = age^2)

Create new variables that contain values specified by us:

# Add new variables that take on specific values
tb.ceosal2 <- tb.ceosal2 %>% 
  mutate(one = 1)

Deleting a variable:

# Delete a variable
tb.ceosal2 <- tb.ceosal2 %>% 
  select(-log_salary)

We can use the function head() or names() to verify the creation/deletion of variables.

to top

5.5 Grouping Data

Oftentimes we want to summarize data by a given grouping variable or set of variables. The function group_by allow us to group data. The following operations will be performed to each group. For instance:

# Count number of CEOs with and without graduate degree
tb.ceosal2 %>% 
  group_by(grad) %>%  
  summarize(n_ceo = n()) %>% 
  ungroup()

After performing the intended operation we can use the function ungroup() to remove the grouping of the data.

Grouping can be done using multiple variables as follows:

# Count number of CEOs with and without graduate and colleges degree
tb.ceosal2 %>% 
  group_by(grad, college) %>%  
  summarize(n_ceo = n()) %>% 
  ungroup()

to top

6. Recommended Assignment

Complete DataCamp’s second chapter of the Introduction to the Tidyverse course: Data visualization

Resources

From tibble.tidyverse.org ↩︎
From: https://cfss.uchicago.edu/notes/pipes/↩︎
From: https://dplyr.tidyverse.org/reference/summarise.html ↩︎