We have primarily used nice, clean data in datasets. However, sometimes we will need to work with our data to extract the information we want. This process is called data cleaning and will inevitably be a large part of your project and all analyses you do in the future.
Text data is one of the most common types of data you will need to work with.
Today, we’ll learn about text data by using a dataset of about 4000 popular movies from The Movie Database (TMDb).1
| Name | Description |
|---|---|
title |
Movie name |
tagline |
Tagline |
release_date |
Release date (MM/DD/YY) |
genres |
Genres (separated by comma) |
budget |
Budget (in dollars) |
revenue |
Revenue (in dollars) |
runtime |
Runtime (in minutes) |
status |
Released, rumored, post-production, etc. |
keywords |
List of keywords |
vote_average |
Average quality vote on TMDb (from 0 - 10) |
vote_count |
Number of votes on TMDb |
production |
Production companies (separated by comma) |
english |
English is spoken language (1) or not (0) |
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## title = col_character(),
## tagline = col_character(),
## release_date = col_date(format = ""),
## genres = col_character(),
## budget = col_double(),
## revenue = col_double(),
## runtime = col_double(),
## status = col_character(),
## keywords = col_character(),
## vote_average = col_double(),
## vote_count = col_double(),
## production = col_character(),
## countries = col_character(),
## english = col_double()
## )
Before we dive in, let’s get accustomed to the data. Looking at the columns in the table above, develop a research question relating an explanatory variable (EV) to a dependent variable (DV). Develop a simple hypothesis and an explanation for why the hypothesis might be true. You might start by thinking about the simple numeric columns in the dataset and whether there could be a relationship between them (budget, revenue, vote_average, etc.).
Next, create a simple plot to test your hypothesis. What do you find?
So far, we have compared data in relatively simple ways - equality with ==, greater than or less than for numbers, etc. However, comparisons can be difficult with text data.
For example, you may simply want to filter the movies dataset to all Action movies. However, look at the genres column:
## [1] "Action, Adventure, Fantasy, Science Fiction"
## [2] "Adventure, Fantasy, Action"
## [3] "Action, Adventure, Crime"
## [4] "Action, Crime, Drama, Thriller"
## [5] "Action, Adventure, Science Fiction"
Here, you can see that filtering to genres == "Action" will not work. In plain words, what do we want to do here?
Well, we don’t care if the genres column is exactly equal to “Action”. Instead, we want to know whether the word “Action” is anywhere in the string. For example:
## [1] "Action, Adventure, Crime"
## [1] "Adventure, Fantasy, Drama"
str_detect() is a function that will check whether or not a string (the genres column) contains a smaller string of our choice (“Action”). Let’s try it:
## [1] TRUE
## [1] FALSE
str_detect(), create a column called hasQ that is TRUE if a movie title has a capital Q in the title. Then filter to those movies and save them to an object called q_movies.arrange() function will sort a dataset by the values of a column. By default, arrange() will sort in ascending order (smallest to largest). Check the documentation for arrange to find the desc() function, which you can use together with arrange() to sort in descending order. Scroll down to the examples to see how this works with the %>%. Then, sort q_movies in descending order by budget to find the movie with a capital Q in the title that has the largest budget.## # A tibble: 21 x 15
## title tagline release_date genres budget revenue runtime status keywords
## <chr> <chr> <date> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Quan… For lo… 2008-10-30 Adven… 2.00e8 5.86e8 106 Relea… killing…
## 2 Gala… A come… 1999-12-23 Comed… 4.50e7 9.07e7 102 Relea… space b…
## 3 John… Give a… 2002-02-15 Drama… 3.60e7 1.02e8 116 Relea… father …
## 4 Quee… This t… 2002-02-10 Drama… 3.50e7 4.55e7 101 Relea… queen, …
## 5 The … Think … 1995-02-09 Actio… 3.20e7 1.86e7 107 Relea… gunslin…
## 6 The … <NA> 2002-11-22 Drama… 3.00e7 2.77e7 101 Relea… terror,…
## 7 Supe… Nuclea… 1987-07-23 Actio… 1.70e7 1.93e7 90 Relea… saving …
## 8 The … Our Le… 2006-09-15 Drama 1.50e7 1.23e8 103 Relea… upper c…
## 9 Conf… She's … 2004-02-17 Comedy 1.50e7 2.93e7 89 Relea… rock st…
## 10 Quil… There … 2000-11-22 Drama 1.35e7 7.06e6 124 Relea… asylum,…
## # … with 11 more rows, and 6 more variables: vote_average <dbl>,
## # vote_count <dbl>, production <chr>, countries <chr>, english <dbl>,
## # hasQ <lgl>
Awesome, now let’s try it on our dataset. We will use it to make a new column called action. action will be TRUE if it’s an Action movie and FALSE otherwise.
movies %>%
ggplot(aes(x = action, y = budget, fill = action)) +
geom_boxplot() +
coord_flip() +
theme_bw() +
labs(title = "Do action movies have larger budgets?")## Warning: Removed 653 rows containing non-finite values (stat_boxplot).
movies %>%
group_by(action) %>%
summarise(mean_b = mean(budget, na.rm = TRUE),
median_b = median(budget, na.rm = TRUE))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## action mean_b median_b
## <lgl> <dbl> <dbl>
## 1 FALSE 30322312. 20000000
## 2 TRUE 60849401. 45000000
production column (you can call View(movies) to look through some examples). Create a column that is TRUE if the film was produced by that company and FALSE otherwise. Then, create a boxplot that compares the budget of films made by that company with all other companies. Describe what you find.movies %>%
mutate(disney = str_detect(production, "Walt Disney")) %>%
ggplot(aes(x = disney, y = budget, fill = disney)) +
geom_boxplot() +
theme_bw()## Warning: Removed 653 rows containing non-finite values (stat_boxplot).
## Alternate way to do this
movies %>%
mutate(disney = case_when(str_detect(production, "Walt Disney") ~ "Disney",
str_detect(production, "Marvel") ~ "Marvel",
TRUE ~ "Other")) %>%
ggplot(aes(x = disney, y = budget, fill = disney)) +
geom_boxplot()## Warning: Removed 653 rows containing non-finite values (stat_boxplot).
R also has built-in classes for dates, which allow us to manipulate and compare dates.
Without any changes, R can often guess correctly and compare dates naturally. For example:
## [1] "2009-12-10" "2007-05-19" "2015-10-26" "2012-07-16" "2012-03-07"
## Warning: Removed 653 rows containing missing values (geom_point).
However, what if you want to do more complex things with dates? For example, what if you want to figure out what day, month, or year is in the date. This would be useful if you wanted to, for example, choose every movie released in 2020 or October.
Thankfully, there are functions in tidyverse for this:
## [1] 2009
## [1] 12
## [1] 10
Once you’ve created a day, month, or year from release_date, you can use it just like you’ve used other values. For example, you can filter by numeric values by year:
# create a column for year
movies %>%
mutate(year = year(release_date)) %>%
filter(year %in% 2000:2020) %>%
ggplot(aes(x = budget, y = revenue)) +
geom_point()## Warning: Removed 800 rows containing missing values (geom_point).
month_sum <- movies %>%
mutate(month = month(release_date),
horror = str_detect(string = genres, pattern = "Horror")
) %>%
group_by(month) %>%
summarise(h_count = sum(horror))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month h_count
## <dbl> <int>
## 1 1 49
## 2 2 35
## 3 3 32
## 4 4 38
## 5 5 29
## 6 6 30
## 7 7 36
## 8 8 53
## 9 9 50
## 10 10 76
## 11 11 19
## 12 12 17
geom_col() might look nice, but feel free to be creative).Notice the keywords column tells you a little bit about what is featured in each movie:
## [1] "culture clash, future, space war, space colony, society, space travel, futuristic, romance, space, alien, tribe, alien planet, cgi, marine, soldier, battle, love affair, anti war, power relations, mind and soul, 3d"
What if we were interested in exploring patterns of keywords between different types of movies? I wrote the function count_keywords() below for us to use. Given a dataset df and a number top_n, the function will return the top keywords used in that dataset.
The function is a little more complicated than might be able to read now (it uses a class called ‘lists’), but you may be able to figure out what it is doing. feel free to ask, but don’t worry too much about it!
count_keywords <- function(df, top_n) {
df %>%
# create list from keywords, separated by comma
mutate(keywords_split = strsplit(keywords, ", ")) %>%
select(keywords_split) %>%
# unnest list into separate rows
unnest(cols = c(keywords_split)) %>%
rename(keyword = keywords_split) %>%
group_by(keyword) %>%
summarise(count = n()) %>%
# remove irrelevant keywords
filter(!(keyword %in% c("",
"duringcreditsstinger",
"aftercreditsstinger"))) %>%
ungroup() %>%
# sort by count
arrange(desc(count)) %>%
# take top_n rows by count
slice(1:top_n)
}
movies %>%
mutate(scifi = str_detect(genres, "Fantasy")) %>%
filter(scifi == TRUE) %>%
count_keywords(top_n = 10) %>%
ggplot(aes(x = keyword, y = count)) +
geom_col() +
theme_minimal() +
coord_flip()## `summarise()` ungrouping output (override with `.groups` argument)
We can also visualize the counts with the text itself. Word clouds do just that. Load the ggwordcloud library and we’ll try it out:
library(ggwordcloud)
movies %>%
mutate(scifi = str_detect(genres, "Science Fiction")) %>%
filter(scifi == TRUE) %>%
count_keywords(top_n = 50) %>%
ggplot(aes(label = keyword, size = count)) +
geom_text_wordcloud(color = "darkblue") +
scale_size_area(max_size = 10)## `summarise()` ungrouping output (override with `.groups` argument)
library(ggwordcloud)
dm <- movies %>%
mutate(scifi = str_detect(production, "Disney")) %>%
filter(scifi == TRUE) %>%
count_keywords(top_n = 20) %>%
ggplot(aes(label = keyword, size = count)) +
geom_text_wordcloud(color = "darkblue") +
theme_minimal() +
scale_size_area(max_size = 7) +
labs(title = "Disney Movies")## `summarise()` ungrouping output (override with `.groups` argument)
mm <- movies %>%
mutate(scifi = str_detect(production, "Marvel")) %>%
filter(scifi == TRUE) %>%
count_keywords(top_n = 20) %>%
ggplot(aes(label = keyword, size = count)) +
geom_text_wordcloud(color = "darkred") +
theme_minimal() +
scale_size_area(max_size = 7) +
labs(title = "Marvel Movies")## `summarise()` ungrouping output (override with `.groups` argument)