1. Getting Started

We have primarily used nice, clean data in datasets. However, sometimes we will need to work with our data to extract the information we want. This process is called data cleaning and will inevitably be a large part of your project and all analyses you do in the future.

Text data is one of the most common types of data you will need to work with.

Today, we’ll learn about text data by using a dataset of about 4000 popular movies from The Movie Database (TMDb).1

Name Description
title Movie name
tagline Tagline
release_date Release date (MM/DD/YY)
genres Genres (separated by comma)
budget Budget (in dollars)
revenue Revenue (in dollars)
runtime Runtime (in minutes)
status Released, rumored, post-production, etc.
keywords List of keywords
vote_average Average quality vote on TMDb (from 0 - 10)
vote_count Number of votes on TMDb
production Production companies (separated by comma)
english English is spoken language (1) or not (0)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(patchwork)

movies <- read_csv("data/movies.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   title = col_character(),
##   tagline = col_character(),
##   release_date = col_date(format = ""),
##   genres = col_character(),
##   budget = col_double(),
##   revenue = col_double(),
##   runtime = col_double(),
##   status = col_character(),
##   keywords = col_character(),
##   vote_average = col_double(),
##   vote_count = col_double(),
##   production = col_character(),
##   countries = col_character(),
##   english = col_double()
## )

Exercise

  1. Before we dive in, let’s get accustomed to the data. Looking at the columns in the table above, develop a research question relating an explanatory variable (EV) to a dependent variable (DV). Develop a simple hypothesis and an explanation for why the hypothesis might be true. You might start by thinking about the simple numeric columns in the dataset and whether there could be a relationship between them (budget, revenue, vote_average, etc.).

  2. Next, create a simple plot to test your hypothesis. What do you find?

2. String Detection

So far, we have compared data in relatively simple ways - equality with ==, greater than or less than for numbers, etc. However, comparisons can be difficult with text data.

For example, you may simply want to filter the movies dataset to all Action movies. However, look at the genres column:

# look at first five entries of genres column
movies$genres[1:5]
## [1] "Action, Adventure, Fantasy, Science Fiction"
## [2] "Adventure, Fantasy, Action"                 
## [3] "Action, Adventure, Crime"                   
## [4] "Action, Crime, Drama, Thriller"             
## [5] "Action, Adventure, Science Fiction"

Here, you can see that filtering to genres == "Action" will not work. In plain words, what do we want to do here?

Well, we don’t care if the genres column is exactly equal to “Action”. Instead, we want to know whether the word “Action” is anywhere in the string. For example:

# this is an action movie!
"Action, Adventure, Crime"
## [1] "Action, Adventure, Crime"
# this is not
"Adventure, Fantasy, Drama"
## [1] "Adventure, Fantasy, Drama"

str_detect() is a function that will check whether or not a string (the genres column) contains a smaller string of our choice (“Action”). Let’s try it:

str_detect(string = "Action, Adventure, Crime",
           pattern = "Action")
## [1] TRUE
str_detect(string = "Adventure, Fantasy, Drama",
           pattern = "Action")
## [1] FALSE

Exercise

  1. Some letters are probably pretty rare in movie titles. Using str_detect(), create a column called hasQ that is TRUE if a movie title has a capital Q in the title. Then filter to those movies and save them to an object called q_movies.
q_movies <- movies %>%
  mutate(hasQ = str_detect(title, "Q")) %>%
  filter(hasQ)
  1. The arrange() function will sort a dataset by the values of a column. By default, arrange() will sort in ascending order (smallest to largest). Check the documentation for arrange to find the desc() function, which you can use together with arrange() to sort in descending order. Scroll down to the examples to see how this works with the %>%. Then, sort q_movies in descending order by budget to find the movie with a capital Q in the title that has the largest budget.
q_movies %>% 
  arrange(desc(budget))
## # A tibble: 21 x 15
##    title tagline release_date genres budget revenue runtime status keywords
##    <chr> <chr>   <date>       <chr>   <dbl>   <dbl>   <dbl> <chr>  <chr>   
##  1 Quan… For lo… 2008-10-30   Adven… 2.00e8  5.86e8     106 Relea… killing…
##  2 Gala… A come… 1999-12-23   Comed… 4.50e7  9.07e7     102 Relea… space b…
##  3 John… Give a… 2002-02-15   Drama… 3.60e7  1.02e8     116 Relea… father …
##  4 Quee… This t… 2002-02-10   Drama… 3.50e7  4.55e7     101 Relea… queen, …
##  5 The … Think … 1995-02-09   Actio… 3.20e7  1.86e7     107 Relea… gunslin…
##  6 The … <NA>    2002-11-22   Drama… 3.00e7  2.77e7     101 Relea… terror,…
##  7 Supe… Nuclea… 1987-07-23   Actio… 1.70e7  1.93e7      90 Relea… saving …
##  8 The … Our Le… 2006-09-15   Drama  1.50e7  1.23e8     103 Relea… upper c…
##  9 Conf… She's … 2004-02-17   Comedy 1.50e7  2.93e7      89 Relea… rock st…
## 10 Quil… There … 2000-11-22   Drama  1.35e7  7.06e6     124 Relea… asylum,…
## # … with 11 more rows, and 6 more variables: vote_average <dbl>,
## #   vote_count <dbl>, production <chr>, countries <chr>, english <dbl>,
## #   hasQ <lgl>

Plotting that information

Awesome, now let’s try it on our dataset. We will use it to make a new column called action. action will be TRUE if it’s an Action movie and FALSE otherwise.

movies <- movies %>%
  mutate(action = str_detect(string = genres, pattern = "Action"))
movies %>%
  ggplot(aes(x = action, y = budget, fill = action)) + 
    geom_boxplot() + 
    coord_flip() + 
    theme_bw() + 
    labs(title = "Do action movies have larger budgets?")
## Warning: Removed 653 rows containing non-finite values (stat_boxplot).

movies %>%
  group_by(action) %>%
  summarise(mean_b = mean(budget, na.rm = TRUE),
            median_b = median(budget, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   action    mean_b median_b
##   <lgl>      <dbl>    <dbl>
## 1 FALSE  30322312. 20000000
## 2 TRUE   60849401. 45000000

Exercise

  1. Choose a film production company from the production column (you can call View(movies) to look through some examples). Create a column that is TRUE if the film was produced by that company and FALSE otherwise. Then, create a boxplot that compares the budget of films made by that company with all other companies. Describe what you find.
movies %>%
  mutate(disney = str_detect(production, "Walt Disney")) %>%
  ggplot(aes(x = disney, y = budget, fill = disney)) + 
    geom_boxplot() + 
    theme_bw()
## Warning: Removed 653 rows containing non-finite values (stat_boxplot).

## Alternate way to do this
movies %>%
  mutate(disney = case_when(str_detect(production, "Walt Disney") ~ "Disney",
                             str_detect(production, "Marvel") ~ "Marvel",
                             TRUE ~ "Other")) %>%
  ggplot(aes(x = disney, y = budget, fill = disney)) + 
    geom_boxplot()
## Warning: Removed 653 rows containing non-finite values (stat_boxplot).

3. Working with Dates

R also has built-in classes for dates, which allow us to manipulate and compare dates.

Without any changes, R can often guess correctly and compare dates naturally. For example:

movies$release_date[1:5]
## [1] "2009-12-10" "2007-05-19" "2015-10-26" "2012-07-16" "2012-03-07"
movies %>%
  ggplot(aes(x = release_date, y = budget)) + 
    geom_point() + theme_bw()
## Warning: Removed 653 rows containing missing values (geom_point).

However, what if you want to do more complex things with dates? For example, what if you want to figure out what day, month, or year is in the date. This would be useful if you wanted to, for example, choose every movie released in 2020 or October.

Thankfully, there are functions in tidyverse for this:

# get the first entry of release date
example <- movies$release_date[1]
year(example)
## [1] 2009
month(example)
## [1] 12
day(example)
## [1] 10

Once you’ve created a day, month, or year from release_date, you can use it just like you’ve used other values. For example, you can filter by numeric values by year:

# create a column for year
movies %>%
  mutate(year = year(release_date)) %>%
  filter(year %in% 2000:2020) %>%
  ggplot(aes(x = budget, y = revenue)) + 
    geom_point()
## Warning: Removed 800 rows containing missing values (geom_point).

Exercise

  1. With your group, figure out a way to add up the number of horror movies released in each month of the year. Which month has the most? Use the functions you know so far to calculate these numbers.
month_sum <- movies %>%
  mutate(month = month(release_date),
         horror = str_detect(string = genres, pattern = "Horror")
         ) %>%
  group_by(month) %>%
  summarise(h_count = sum(horror))
## `summarise()` ungrouping output (override with `.groups` argument)
month_sum
## # A tibble: 12 x 2
##    month h_count
##    <dbl>   <int>
##  1     1      49
##  2     2      35
##  3     3      32
##  4     4      38
##  5     5      29
##  6     6      30
##  7     7      36
##  8     8      53
##  9     9      50
## 10    10      76
## 11    11      19
## 12    12      17
  1. Save the values you calculated in question 1 to an object. Use that object to create a plot visualizing your results (a geom_col() might look nice, but feel free to be creative).
month_sum %>% 
  ggplot(aes(x = factor(month), y = h_count)) +
    geom_col() + 
    theme_linedraw()

4. Word clouds

Notice the keywords column tells you a little bit about what is featured in each movie:

movies$keywords[1]
## [1] "culture clash, future, space war, space colony, society, space travel, futuristic, romance, space, alien, tribe, alien planet, cgi, marine, soldier, battle, love affair, anti war, power relations, mind and soul, 3d"

What if we were interested in exploring patterns of keywords between different types of movies? I wrote the function count_keywords() below for us to use. Given a dataset df and a number top_n, the function will return the top keywords used in that dataset.

The function is a little more complicated than might be able to read now (it uses a class called ‘lists’), but you may be able to figure out what it is doing. feel free to ask, but don’t worry too much about it!

count_keywords <- function(df, top_n) {

  df %>%
    # create list from keywords, separated by comma
    mutate(keywords_split = strsplit(keywords, ", ")) %>% 
    select(keywords_split) %>% 
    # unnest list into separate rows
    unnest(cols = c(keywords_split)) %>% 
    rename(keyword = keywords_split) %>% 
    group_by(keyword) %>%
    summarise(count = n()) %>% 
    # remove irrelevant keywords
    filter(!(keyword %in% c("", 
                          "duringcreditsstinger", 
                          "aftercreditsstinger"))) %>% 
    ungroup() %>%
    # sort by count
    arrange(desc(count)) %>%
    # take top_n rows by count
    slice(1:top_n)

}

movies %>% 
  mutate(scifi = str_detect(genres, "Fantasy")) %>% 
  filter(scifi == TRUE) %>% 
  count_keywords(top_n = 10) %>% 
  ggplot(aes(x = keyword, y = count)) + 
    geom_col() + 
    theme_minimal() + 
    coord_flip()
## `summarise()` ungrouping output (override with `.groups` argument)

We can also visualize the counts with the text itself. Word clouds do just that. Load the ggwordcloud library and we’ll try it out:

library(ggwordcloud)

movies %>%
   mutate(scifi = str_detect(genres, "Science Fiction")) %>%
   filter(scifi == TRUE) %>%
   count_keywords(top_n = 50) %>% 
   ggplot(aes(label = keyword, size = count)) +
    geom_text_wordcloud(color = "darkblue") + 
    scale_size_area(max_size = 10)
## `summarise()` ungrouping output (override with `.groups` argument)

Exercise

  1. Think of a research question involving keywords and movie descriptions. For example, did the types of Science Fiction movies being made significantly change between the Atomic Age and today? Create two wordclouds to compare the keywords used in two different types of movies. For example, two different production companies, or the same genre in two different time periods.
library(ggwordcloud)

dm <- movies %>%
   mutate(scifi = str_detect(production, "Disney")) %>%
   filter(scifi == TRUE) %>%
   count_keywords(top_n = 20) %>% 
   ggplot(aes(label = keyword, size = count)) +
    geom_text_wordcloud(color = "darkblue") +
    theme_minimal() + 
    scale_size_area(max_size = 7) + 
    labs(title = "Disney Movies")
## `summarise()` ungrouping output (override with `.groups` argument)
mm <- movies %>%
   mutate(scifi = str_detect(production, "Marvel")) %>%
   filter(scifi == TRUE) %>%
   count_keywords(top_n = 20) %>% 
   ggplot(aes(label = keyword, size = count)) +
    geom_text_wordcloud(color = "darkred") + 
    theme_minimal() + 
  scale_size_area(max_size = 7) + 
    labs(title = "Marvel Movies")
## `summarise()` ungrouping output (override with `.groups` argument)
dm + mm


  1. TMDb dataset comes from, created using their API: Kaggle.↩︎