In this tutorial, we’ll work on reading in and describing data, making use of some tidyverse packages and functions along the way.

First order of business: load up the tidyverse

library(tidyverse)

Reading in Data

There are a few options for reading in data in R. Most of the time, we’ll be working with .csv data files (i.e., rectangular spreadsheets in a universal plain-text format).

For such files, we can use:

Let’s read in the Nettle (1999) data used in Winter Ch. 1 and 2:

nettle <- read_csv("nettle_1999_climate.csv")

Examining Data

Let’s take a look at this data. We’ll use two functions that allow us to check out the top few rows and the bottom few rows:

head(nettle)
tail(nettle)

You can also click the little spreadsheet icon next to the nettle object in the top-right window of RStudio.

Now we’ll practice using a few dplyr commands that help us focus or narrow down the dataset.

filter() identifies rows that meet some kind of logical condition. We typically use >, <, >=, <=, ==, and != (not equal) for logical conditions.

filter(nettle, Langs > 200)
filter(nettle, Country == "Botswana")

We might also want to narrow down the number of variables to look at, using select():

select(nettle, Country, Langs)

A Little Data Cleaning

We may want to change some variable names in this data. Let’s use the rename() function to do that:

nettle <- rename(nettle, Languages = Langs)

And let’s get rid of a variable that we don’t need right now:

nettle <- nettle %>% select(Country, Population, Languages)

You’ll notice that I did something different here - I used what is called the pipe operator (%>%). This has the effect of taking the thing on the left side of the pipe and sticking it into the first argument of the function on the right side. In this case, it meant that I didn’t have to put nettle as the first argument in select().

Explore the Data Visually

We’ll use ggplot() to get a sense of the data and some relationships within it. Let’s start with a histogram:

nettle %>% ggplot(aes(x = Languages))+
  geom_histogram()+
  theme_bw()

This tells us very quickly that the number of languages per country is a variable with a very positive skew.

nettle %>% ggplot(aes(x = Population)) +
  geom_histogram()+
  theme_bw()

Population, on the other hand, has a rather symmetrical distribution. Most countries are around 4 (million? I’m guessing).

Let’s try a new plot to see if there’s any relation between our two variables of interest - a scatterplot.

nettle %>% ggplot(aes(x = Population, y = Languages))+
  geom_point()+
  theme_bw()

This is interesting - there’s not a really clear/clean pattern, though it does seem like there are more languages in countries with larger populations. But the country with the most languages is pretty average in terms of population.

#Some Descriptive Statistics

Last week we covered some basic functions for computing means, SDs, and other statistics useful for describing data. This time, we’ll do something a little more sophisticated using the summarize() function.

nettle %>% summarise(pop_mean = mean(Population),
                     pop_sd = sd(Population),
                     pop_median = median(Population),
                     pop_min = min(Population),
                     pop_max = max(Population),
                     pop_range = pop_max - pop_min)

Neat, right? We can do the same thing for Languages - just copy and paste the code, and then use ctrl-f to do some find-and-replace. Replace “pop_” with “lang_” and “Population” with “Languages”.

nettle %>% summarise(langs_mean = mean(Languages),
                     langs_sd = sd(Languages),
                     langs_median = median(Languages),
                     langs_min = min(Languages),
                     langs_max = max(Languages),
                     langs_range = langs_max - langs_min)

More Practice: Perry & Winter’s Iconicity data

Let’s read in some other data.

iconic <- read_csv("perry_winter_2017_iconicity.csv")
mod <- read_csv("lynott_connell_2009_modality.csv")

We just read in two sets of data. Each file contains information about words (i.e., variables related to different semantic aspects of words), but different types of information. Eventually we can combine the two data sets.

Next, let’s do some more plotting.

iconic %>% ggplot(aes(x = Iconicity))+
  geom_histogram(fill = "peachpuff3")+
  geom_vline(aes(xintercept = 0), linetype = 2)+
  theme_minimal()

Alright, let’s join the two data sets. We’ll use a type of join called left_join():

both <- left_join(iconic, mod)

We’ll also narrow down this combined dataset to just the words of particular classes and rename a variable:

both <- filter(both, POS %in% c("Adjective", "Verb", "Noun")) %>%
  rename(Modality = DominantModality)

Now for some more plotting:

both %>% ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()

This looks neat! But we have a lot of NAs in the data - the modality dataset contained many fewer words than the iconicity dataset. Let’s focus on just the words that we have both iconicity and modality information for.

both %>% filter(!is.na(Modality)) %>%
  ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()

Summarizing counts…

both %>% count(Modality)

Similarly, table() and xtabs():

table(both$Modality)

 Auditory Gustatory    Haptic Olfactory    Visual 
       67        47        67        24       202 
xtabs(~Modality, data = both)
Modality
 Auditory Gustatory    Haptic Olfactory    Visual 
       67        47        67        24       202 

What else can we do with this data?

---
title: "Describing Data"
output: html_notebook
---

In this tutorial, we'll work on reading in and describing data, making use of some `tidyverse` packages and functions along the way.

First order of business: load up the `tidyverse`

```{r}
library(tidyverse)
```
# Reading in Data

There are a few options for reading in data in R. Most of the time, we'll be working with .csv data files (i.e., rectangular spreadsheets in a universal plain-text format).

For such files, we can use:

* `read.csv()`
* `read_csv()` <- this is a `tidyverse` option

Let's read in the Nettle (1999) data used in Winter Ch. 1 and 2:

```{r}
nettle <- read_csv("nettle_1999_climate.csv")
```

# Examining Data

Let's take a look at this data. We'll use two functions that allow us to check out the top few rows and the bottom few rows:

```{r}
head(nettle)
```

```{r}
tail(nettle)
```

You can also click the little spreadsheet icon next to the `nettle` object in the top-right window of RStudio.

Now we'll practice using a few `dplyr` commands that help us focus or narrow down the dataset.

`filter()` identifies rows that meet some kind of logical condition. We typically use >, <, >=, <=, ==, and != (not equal) for logical conditions. 

```{r}
filter(nettle, Langs > 200)
```

```{r}
filter(nettle, Country == "Botswana")
```

We might also want to narrow down the number of variables to look at, using `select()`:

```{r}
select(nettle, Country, Langs)
```

# A Little Data Cleaning

We may want to change some variable names in this data. Let's use the `rename()` function to do that:

```{r}
nettle <- rename(nettle, Languages = Langs)
```

And let's get rid of a variable that we don't need right now:

```{r}
nettle <- nettle %>% select(Country, Population, Languages)
```

You'll notice that I did something different here - I used what is called the pipe operator (`%>%`). This has the effect of taking the thing on the left side of the pipe and sticking it into the first argument of the function on the right side. In this case, it meant that I didn't have to put `nettle` as the first argument in `select()`.

# Explore the Data Visually

We'll use `ggplot()` to get a sense of the data and some relationships within it. Let's start with a histogram:

```{r}
nettle %>% ggplot(aes(x = Languages))+
  geom_histogram()+
  theme_bw()
```

This tells us very quickly that the number of languages per country is a variable with a very positive skew.

```{r}
nettle %>% ggplot(aes(x = Population)) +
  geom_histogram()+
  theme_bw()
```

Population, on the other hand, has a rather symmetrical distribution. Most countries are around 4 (million? I'm guessing).

Let's try a new plot to see if there's any relation between our two variables of interest - a scatterplot.

```{r}
nettle %>% ggplot(aes(x = Population, y = Languages))+
  geom_point()+
  theme_bw()
```

This is interesting - there's not a really clear/clean pattern, though it does seem like there are more languages in countries with larger populations. But the country with the most languages is pretty average in terms of population.

#Some Descriptive Statistics

Last week we covered some basic functions for computing means, SDs, and other statistics useful for describing data. This time, we'll do something a little more sophisticated using the `summarize()` function.

```{r}
nettle %>% summarise(pop_mean = mean(Population),
                     pop_sd = sd(Population),
                     pop_median = median(Population),
                     pop_min = min(Population),
                     pop_max = max(Population),
                     pop_range = pop_max - pop_min)
```

Neat, right? We can do the same thing for Languages - just copy and paste the code, and then use ctrl-f to do some find-and-replace. Replace "pop_" with "lang_" and "Population" with "Languages".

```{r}
nettle %>% summarise(langs_mean = mean(Languages),
                     langs_sd = sd(Languages),
                     langs_median = median(Languages),
                     langs_min = min(Languages),
                     langs_max = max(Languages),
                     langs_range = langs_max - langs_min)
```

# More Practice: Perry & Winter's Iconicity data

Let's read in some other data.

```{r}
iconic <- read_csv("perry_winter_2017_iconicity.csv")
mod <- read_csv("lynott_connell_2009_modality.csv")
```

We just read in two sets of data. Each file contains information about words (i.e., variables related to different semantic aspects of words), but different types of information. Eventually we can combine the two data sets.

Next, let's do some more plotting.

```{r}
iconic %>% ggplot(aes(x = Iconicity))+
  geom_histogram(fill = "peachpuff3")+
  geom_vline(aes(xintercept = 0), linetype = 2)+
  theme_minimal()
```

Alright, let's join the two data sets. We'll use a type of join called `left_join()`:

```{r}
both <- left_join(iconic, mod)
```

We'll also narrow down this combined dataset to just the words of particular classes and rename a variable:

```{r}
both <- filter(both, POS %in% c("Adjective", "Verb", "Noun")) %>%
  rename(Modality = DominantModality)
```

Now for some more plotting:

```{r}
both %>% ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()
```

This looks neat! But we have a lot of NAs in the data - the modality dataset contained many fewer words than the iconicity dataset. Let's focus on just the words that we have both iconicity and modality information for.

```{r}
both %>% filter(!is.na(Modality)) %>%
  ggplot(aes(x = Modality, y = Iconicity, fill = Modality))+
  geom_boxplot()+
  theme_minimal()
```
Summarizing counts...

```{r}
both %>% count(Modality)
```

Similarly, `table()` and `xtabs()`:

```{r}
table(both$Modality)

xtabs(~Modality, data = both)
```

What else can we do with this data?