Exploring A Dataset

Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this course, we’ll focus on datasets that are saved in “spreadsheet”-type format. This is probably the most common way data are collected and saved in many fields, including linguistics. These “spreadsheet”-type datasets are called data frames in R. Data frames are like rectangular spreadsheets: they are representations of datasets in R where the rows correspond to observations and the columns correspond to variables that describe the observations.

`ldt` data frame

The ldt dataset is a set of randomly selected words from the English Lexicon Project data (Balota et al. 2007). This dataset appears in Levshina’s (2015) textbook How to do Linguistics with R. Here’s how she describes it:

The English Lexicon Project provides behavioural and descriptive data from hundreds of native spekers for over forty thousand words in American English, as well as for non-words. (pp. 41-42)

Let’s begin by exploring the ldt data frame and get an idea of its structure. First we run the following code in the console (type or paste, then hit enter).

It displays the contents of the ltd data frame in your console. Note that depending on the size of your monitor, the output may vary slightly.

ldt

Let’s unpack this output:

This particular data frame has
- 100 rows corresponding to different observations. Here, each observation is a different word.
- 4 columns corresponding to 4 variables describing each observation.
Word, Length, Freq, and Mean_RT are the different columns, in other words, the different variables of this data set.
- Length is the number of letters in the word
- Freq is a measure of how frequently the word is used (see Lund & Burgess 1996 for the details)
- Mean_RT is the average reaction time, measured in milliseconds, in a lexical decision task. (Participants have to decide if the word they see is a word or not.)
We then have a preview of the first 10 rows of observations corresponding to the first 10 words. R is only showing the first 10 rows.

Unfortunately, this output does not allow us to explore the data very well. Let’s look at some different ways to explore data frames.

Exploring data frames

There are many ways to get a feel for the data contained in a data frame such as ldt. We present three functions that take as their “argument” (their input) the data frame in question. We also include a fourth method for exploring one particular column of a data frame:

Using the View() function (RStudio’s built-in spreadsheet viewer)
Using the glimpse() function (included in the dplyr package)
Using the kable() function (included in the knitr package)
Using the $ “extraction operator” (used to view a single variable/column in a data frame)

1. View():

Run View(ldt) in your console, either by typing it or cutting & pasting it into the console pane, and explore this data frame in the resulting pop-up viewer. You should get into the habit of always viewing any data frames you encounter. Note the uppercase V in View. R is case-sensitive, so you’ll get an error message if you run view(ldt) instead of View(ldt).

By running View(ldt), we can explore the different variables listed in the columns.

Note that if you look in the leftmost column of the View(ldt) output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row is representing. In other words, this will allow you to identify what object is being described in a given row. This is often called the observational unit. The observational unit in this example is an individual word. You can identify the observational unit by determining what “thing” is being measured or described by each of the variables.

2. glimpse():

The second way to explore a data frame is using the glimpse() function. This function provides us with an alternative perspective for exploring a data frame than the View() function:

## Observations: 100
## Variables: 4
## $ Word    <chr> "marveled", "persuaders", "midmost", "crutch", "resuspen…
## $ Length  <dbl> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8, 1…
## $ Freq    <dbl> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523, 4…
## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.67,…

Observe that glimpse() will give you the first few entries of each variable in a row after the variable name. In addition, the data type of the variable is given immediately after each variable’s name inside < >. Here, dbl refers “double”, which is computer coding terminology for a quantitative/numerical variable. In contrast, chr refers to “character”, which is computer terminology for text data.

3. kable():

The final way to explore the entirety of a data frame is using the kable() function from the knitr package. Let’s explore the different carrier codes for all the airlines in our dataset two ways. Run both of these lines of code in the console:

library(knitr)
kable(ldt)

At first glance, it may not appear that there is much difference in the outputs. However when using tools for producing reproducible reports such as R Markdown, the latter code produces output that is much more legible and reader-friendly.

4. $ operator

Lastly, the $ operator allows us to extract and then explore a single variable within a data frame. For example, run the following in your console

ldt$Word

We used the $ operator to extract only the name variable and return it as a vector of length 100. We’ll only be occasionally exploring data frames using the $ operator, instead favoring the View() and glimpse() functions.

Identification & measurement variables

There is a subtle difference between the kinds of variables that you will encounter in data frames: identification variables and measurement variables. For example, let’s explore the airports data frame by showing the output of glimpse(ldt):

glimpse(ldt)

## Observations: 100
## Variables: 4
## $ Word    <chr> "marveled", "persuaders", "midmost", "crutch", "resuspen…
## $ Length  <dbl> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8, 1…
## $ Freq    <dbl> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523, 4…
## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.67,…

The variable word is what we will call an identification variable, a variable that uniquely identify each observational unit. In this case, the identification variables uniquely identify words. Such variables are mainly used in practice to uniquely identify each row in a data frame. The remaining variables (Length, Freq, Mean_RT) are often called measurement or characteristic variables: variables that describe properties of each observational unit.

Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed. While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the left-most columns of your data frame.

Help files

Another nice feature of R are help files, which provide documentation for various functions and datasets. You can bring up help files by adding a ? before the name of a function or data frame and then run this in the console. You will then be presented with a page showing the corresponding documentation if it exists. For example, let’s look at the help file for the glimpse function.

?View()

The help file should pop-up in the Help pane of RStudio. If you have questions about a function or data frame included in an R package, you should get in the habit of consulting the help file right away.

Exploring A Dataset

ldt data frame

Exploring data frames

Identification & measurement variables

Help files

`ldt` data frame