In this tutorial, we’ll work on reading in and describing data,
making use of some tidyverse packages and functions along
the way.
First order of business: load up the tidyverse
library(tidyverse)
Reading in Data
There are a few options for reading in data in R. Most of the time,
we’ll be working with .csv data files (i.e., rectangular spreadsheets in
a universal plain-text format).
For such files, we can use:
read.csv()
read_csv() <- this is a tidyverse
option
Let’s read in the Nettle (1999) data used in Winter Ch. 1 and 2:
nettle <- read_csv("nettle_1999_climate.csv")
Examining Data
Let’s take a look at this data. We’ll use two functions that allow us
to check out the top few rows and the bottom few rows:
head(nettle)
tail(nettle)
You can also click the little spreadsheet icon next to the
nettle object in the top-right window of RStudio.
Now we’ll practice using a few dplyr commands that help
us focus or narrow down the dataset.
filter() identifies rows that meet some kind of logical
condition. We typically use >, <, >=, <=, ==, and != (not
equal) for logical conditions.
filter(nettle, Langs > 200)
filter(nettle, Country == "Botswana")
We might also want to narrow down the number of variables to look at,
using select():
select(nettle, Country, Langs)
A Little Data Cleaning
We may want to change some variable names in this data. Let’s use the
rename() function to do that:
nettle <- rename(nettle, Languages = Langs)
And let’s get rid of a variable that we don’t need right now:
nettle <- nettle %>% select(Country, Population, Languages)
You’ll notice that I did something different here - I used what is
called the pipe operator (%>%). This has the effect of
taking the thing on the left side of the pipe and sticking it into the
first argument of the function on the right side. In this case, it meant
that I didn’t have to put nettle as the first argument in
select().
Explore the Data Visually
We’ll use ggplot() to get a sense of the data and some
relationships within it. Let’s start with a histogram:
nettle %>% ggplot(aes(x = Languages))+
geom_histogram()+
theme_bw()

This tells us very quickly that the number of languages per country
is a variable with a very positive skew.
nettle %>% ggplot(aes(x = Population)) +
geom_histogram()+
theme_bw()

Population, on the other hand, has a rather symmetrical distribution.
Most countries are around 4 (million? I’m guessing).
Let’s try a new plot to see if there’s any relation between our two
variables of interest - a scatterplot.
nettle %>% ggplot(aes(x = Population, y = Languages))+
geom_point()+
theme_bw()

This is interesting - there’s not a really clear/clean pattern,
though it does seem like there are more languages in countries with
larger populations. But the country with the most languages is pretty
average in terms of population.
#Some Descriptive Statistics
Last week we covered some basic functions for computing means, SDs,
and other statistics useful for describing data. This time, we’ll do
something a little more sophisticated using the summarize()
function.
nettle %>% summarise(pop_mean = mean(Population),
pop_sd = sd(Population),
pop_median = median(Population),
pop_min = min(Population),
pop_max = max(Population),
pop_range = pop_max - pop_min)
Neat, right? We can do the same thing for Languages - just copy and
paste the code, and then use ctrl-f to do some find-and-replace. Replace
“pop_” with “lang_” and “Population” with “Languages”.
nettle %>% summarise(langs_mean = mean(Languages),
langs_sd = sd(Languages),
langs_median = median(Languages),
langs_min = min(Languages),
langs_max = max(Languages),
langs_range = langs_max - langs_min)
More Practice: Perry & Winter’s Iconicity data
Let’s read in some other data.
iconic <- read_csv("perry_winter_2017_iconicity.csv")
mod <- read_csv("lynott_connell_2009_modality.csv")
We just read in two sets of data. Each file contains information
about words (i.e., variables related to different semantic aspects of
words), but different types of information. Eventually we can combine
the two data sets.
Next, let’s do some more plotting.
iconic %>% ggplot(aes(x = Iconicity))+
geom_histogram(fill = "peachpuff3")+
geom_vline(aes(xintercept = 0), linetype = 2)+
theme_minimal()

Alright, let’s join the two data sets. We’ll use a type of join
called left_join():
both <- left_join(iconic, mod)
We’ll also narrow down this combined dataset to just the words of
particular classes and rename a variable:
both <- filter(both, POS %in% c("Adjective", "Verb", "Noun")) %>%
rename(Modality = DominantModality)
Now for some more plotting:
both %>% ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
geom_boxplot()+
theme_minimal()

This looks neat! But we have a lot of NAs in the data - the modality
dataset contained many fewer words than the iconicity dataset. Let’s
focus on just the words that we have both iconicity and modality
information for.
both %>% filter(!is.na(Modality)) %>%
ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
geom_boxplot()+
theme_minimal()

Summarizing counts…
both %>% count(Modality)
Similarly, table() and xtabs():
table(both$Modality)
Auditory Gustatory Haptic Olfactory Visual
67 47 67 24 202
xtabs(~Modality, data = both)
Modality
Auditory Gustatory Haptic Olfactory Visual
67 47 67 24 202
What else can we do with this data?
