In this tutorial, we’ll work on reading in and describing data,
making use of some tidyverse packages and functions along
the way.
First order of business: load up the tidyverse
library(tidyverse)
Reading in Data
There are a few options for reading in data in R. Most of the time,
we’ll be working with .csv data files (i.e., rectangular spreadsheets in
a universal plain-text format).
For such files, we can use:
read.csv()
read_csv() <- this is a tidyverse
option
Let’s read in the Nettle (1999) data used in Winter Ch. 1 and 2:
nettle <- read_csv("nettle_1999_climate.csv")
Examining Data
Let’s take a look at this data. We’ll use two functions that allow us
to check out the top few rows and the bottom few rows:
head(nettle)
tail(nettle)
You can also click the little spreadsheet icon next to the
nettle object in the top-right window of RStudio.
Now we’ll practice using a few dplyr commands that help
us focus or narrow down the dataset.
filter() identifies rows that meet some kind of logical
condition. We typically use >, <, >=, <=, ==, and != (not
equal) for logical conditions.
filter(nettle, Langs > 200)
filter(nettle, Country == "Botswana")
We might also want to narrow down the number of variables to look at,
using select():
select(nettle, Country, Langs)
A Little Data Cleaning
We may want to change some variable names in this data. Let’s use the
rename() function to do that:
nettle <- rename(nettle, Languages = Langs)
And let’s get rid of a variable that we don’t need right now:
nettle <- nettle %>% select(Country, Population, Languages)
You’ll notice that I did something different here - I used what is
called the pipe operator (%>%). This has the effect of
taking the thing on the left side of the pipe and sticking it into the
first argument of the function on the right side. In this case, it meant
that I didn’t have to put nettle as the first argument in
select().
Explore the Data Visually
We’ll use ggplot() to get a sense of the data and some
relationships within it. Let’s start with a histogram:
nettle %>% ggplot(aes(x = Languages))+
geom_histogram()+
theme_bw()

This tells us very quickly that the number of languages per country
is a variable with a very positive skew.
nettle %>% ggplot(aes(x = Population)) +
geom_histogram()+
theme_bw()

Population, on the other hand, has a rather symmetrical distribution.
Most countries are around 4 (million? I’m guessing).
Let’s try a new plot to see if there’s any relation between our two
variables of interest - a scatterplot.
nettle %>% ggplot(aes(x = Population, y = Languages))+
geom_point()+
theme_bw()

This is interesting - there’s not a really clear/clean pattern,
though it does seem like there are more languages in countries with
larger populations. But the country with the most languages is pretty
average in terms of population.
#Some Descriptive Statistics
Last week we covered some basic functions for computing means, SDs,
and other statistics useful for describing data. This time, we’ll do
something a little more sophisticated using the summarize()
function.
nettle %>% summarise(pop_mean = mean(Population),
pop_sd = sd(Population),
pop_median = median(Population),
pop_min = min(Population),
pop_max = max(Population),
pop_range = pop_max - pop_min)
Neat, right? We can do the same thing for Languages - just copy and
paste the code, and then use ctrl-f to do some find-and-replace. Replace
“pop_” with “lang_” and “Population” with “Languages”.
nettle %>% summarise(langs_mean = mean(Languages),
langs_sd = sd(Languages),
langs_median = median(Languages),
langs_min = min(Languages),
langs_max = max(Languages),
langs_range = langs_max - langs_min)
More Practice: Perry & Winter’s Iconicity data
Let’s read in some other data.
iconic <- read_csv("perry_winter_2017_iconicity.csv")
mod <- read_csv("lynott_connell_2009_modality.csv")
We just read in two sets of data. Each file contains information
about words (i.e., variables related to different semantic aspects of
words), but different types of information. Eventually we can combine
the two data sets.
Next, let’s do some more plotting.
iconic %>% ggplot(aes(x = Iconicity))+
geom_histogram(fill = "peachpuff3")+
geom_vline(aes(xintercept = 0), linetype = 2)+
theme_minimal()

Alright, let’s join the two data sets. We’ll use a type of join
called left_join():
both <- left_join(iconic, mod)
We’ll also narrow down this combined dataset to just the words of
particular classes and rename a variable:
both <- filter(both, POS %in% c("Adjective", "Verb", "Noun")) %>%
rename(Modality = DominantModality)
Now for some more plotting:
both %>% ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
geom_boxplot()+
theme_minimal()

This looks neat! But we have a lot of NAs in the data - the modality
dataset contained many fewer words than the iconicity dataset. Let’s
focus on just the words that we have both iconicity and modality
information for.
both %>% filter(!is.na(Modality)) %>%
ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
geom_boxplot()+
theme_minimal()

Summarizing counts…
both %>% count(Modality)
Similarly, table() and xtabs():
table(both$Modality)
Auditory Gustatory Haptic Olfactory Visual
67 47 67 24 202
xtabs(~Modality, data = both)
Modality
Auditory Gustatory Haptic Olfactory Visual
67 47 67 24 202
What else can we do with this data?
---
title: "Describing Data"
output: html_notebook
---

In this tutorial, we'll work on reading in and describing data, making use of some `tidyverse` packages and functions along the way.

First order of business: load up the `tidyverse`

```{r}
library(tidyverse)
```
# Reading in Data

There are a few options for reading in data in R. Most of the time, we'll be working with .csv data files (i.e., rectangular spreadsheets in a universal plain-text format).

For such files, we can use:

* `read.csv()`
* `read_csv()` <- this is a `tidyverse` option

Let's read in the Nettle (1999) data used in Winter Ch. 1 and 2:

```{r}
nettle <- read_csv("nettle_1999_climate.csv")
```

# Examining Data

Let's take a look at this data. We'll use two functions that allow us to check out the top few rows and the bottom few rows:

```{r}
head(nettle)
```

```{r}
tail(nettle)
```

You can also click the little spreadsheet icon next to the `nettle` object in the top-right window of RStudio.

Now we'll practice using a few `dplyr` commands that help us focus or narrow down the dataset.

`filter()` identifies rows that meet some kind of logical condition. We typically use >, <, >=, <=, ==, and != (not equal) for logical conditions. 

```{r}
filter(nettle, Langs > 200)
```

```{r}
filter(nettle, Country == "Botswana")
```

We might also want to narrow down the number of variables to look at, using `select()`:

```{r}
select(nettle, Country, Langs)
```

# A Little Data Cleaning

We may want to change some variable names in this data. Let's use the `rename()` function to do that:

```{r}
nettle <- rename(nettle, Languages = Langs)
```

And let's get rid of a variable that we don't need right now:

```{r}
nettle <- nettle %>% select(Country, Population, Languages)
```

You'll notice that I did something different here - I used what is called the pipe operator (`%>%`). This has the effect of taking the thing on the left side of the pipe and sticking it into the first argument of the function on the right side. In this case, it meant that I didn't have to put `nettle` as the first argument in `select()`.

# Explore the Data Visually

We'll use `ggplot()` to get a sense of the data and some relationships within it. Let's start with a histogram:

```{r}
nettle %>% ggplot(aes(x = Languages))+
  geom_histogram()+
  theme_bw()
```

This tells us very quickly that the number of languages per country is a variable with a very positive skew.

```{r}
nettle %>% ggplot(aes(x = Population)) +
  geom_histogram()+
  theme_bw()
```

Population, on the other hand, has a rather symmetrical distribution. Most countries are around 4 (million? I'm guessing).

Let's try a new plot to see if there's any relation between our two variables of interest - a scatterplot.

```{r}
nettle %>% ggplot(aes(x = Population, y = Languages))+
  geom_point()+
  theme_bw()
```

This is interesting - there's not a really clear/clean pattern, though it does seem like there are more languages in countries with larger populations. But the country with the most languages is pretty average in terms of population.

#Some Descriptive Statistics

Last week we covered some basic functions for computing means, SDs, and other statistics useful for describing data. This time, we'll do something a little more sophisticated using the `summarize()` function.

```{r}
nettle %>% summarise(pop_mean = mean(Population),
                     pop_sd = sd(Population),
                     pop_median = median(Population),
                     pop_min = min(Population),
                     pop_max = max(Population),
                     pop_range = pop_max - pop_min)
```

Neat, right? We can do the same thing for Languages - just copy and paste the code, and then use ctrl-f to do some find-and-replace. Replace "pop_" with "lang_" and "Population" with "Languages".

```{r}
nettle %>% summarise(langs_mean = mean(Languages),
                     langs_sd = sd(Languages),
                     langs_median = median(Languages),
                     langs_min = min(Languages),
                     langs_max = max(Languages),
                     langs_range = langs_max - langs_min)
```

# More Practice: Perry & Winter's Iconicity data

Let's read in some other data.

```{r}
iconic <- read_csv("perry_winter_2017_iconicity.csv")
mod <- read_csv("lynott_connell_2009_modality.csv")
```

We just read in two sets of data. Each file contains information about words (i.e., variables related to different semantic aspects of words), but different types of information. Eventually we can combine the two data sets.

Next, let's do some more plotting.

```{r}
iconic %>% ggplot(aes(x = Iconicity))+
  geom_histogram(fill = "peachpuff3")+
  geom_vline(aes(xintercept = 0), linetype = 2)+
  theme_minimal()
```

Alright, let's join the two data sets. We'll use a type of join called `left_join()`:

```{r}
both <- left_join(iconic, mod)
```

We'll also narrow down this combined dataset to just the words of particular classes and rename a variable:

```{r}
both <- filter(both, POS %in% c("Adjective", "Verb", "Noun")) %>%
  rename(Modality = DominantModality)
```

Now for some more plotting:

```{r}
both %>% ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()
```

This looks neat! But we have a lot of NAs in the data - the modality dataset contained many fewer words than the iconicity dataset. Let's focus on just the words that we have both iconicity and modality information for.

```{r}
both %>% filter(!is.na(Modality)) %>%
  ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()
```
Summarizing counts...

```{r}
both %>% count(Modality)
```

Similarly, `table()` and `xtabs()`:

```{r}
table(both$Modality)

xtabs(~Modality, data = both)
```

What else can we do with this data?