(Exploring the Moives Dataset with rstudio/gt package)

Tables can be an effective way of communicating data. Though not as powerful in telling stories as charts, by cramming a lot of numbers into limited space, it can provide readers with accurate and potentially useful information which readers can interpret in their own ways.

I’ve come across this new R package gt(Easily generate information-rich, publication-quality tables from R) and decided to give it a try.

With the gt package, anyone can make wonderful-looking tables using the R programming language.

Admittedly, the tables in the documents might not be the optimal way of presentation. They serve as a demonstration of what gt can do, and maybe also helpful enough for analyst in constructing their stories about this dataset.

(Dataset Source: The Movies Dataset on Kaggle.)

# install.packages("packrat")
# install.packages("pacman")
# install.packages("devtools")
# install.packages("tidyverse")
# devtools::install_github("rstudio/gt")
library(gt)
pacman::p_load("tidyverse")
pacman::p_load("data.table")
pacman::p_load("jsonlite")
pacman::p_load("lubridate")

Read and clean the Metadata

Some of the fields read in here are not used in the later sections. However, they are kept as a reminder of the potentially useful information.

metadata <- fread("data/movies_metadata.csv", select=c('adult', 'genres', 'release_date', 'original_language', 'original_title', 'id', 'imdb_id'), fill=T)
metadata <- metadata[!is.na(as.integer(id)) & (original_language == 'en') & (adult == "False"), ]
## Warning in eval(.massagei(isub), x, parent.frame()): NAs introduced by
## coercion
metadata[, id := as.integer(id)]
# Replace singles quote in 'genre' so 'fromJSON' function can parse it
metadata[, genre := gsub("\'","\"", metadata$genre)]
# Remove redundant fields
metadata[, c('original_language', 'adult', "imdb_id") := NULL]

Parse Genres

Genre information is stored as JSON texts. We need to parse it and do some transformation.

genres <- metadata[, unlist(lapply(genre, fromJSON), recursive=F)['name'], by=id]
sorted.genres <- genres[, .N, by=name][order(-N)]
sorted.genres[1:20]
##                name     N
##  1:           Drama 13222
##  2:          Comedy  9311
##  3:        Thriller  5983
##  4:         Romance  4816
##  5:          Action  4728
##  6:          Horror  3790
##  7:     Documentary  3369
##  8:           Crime  3150
##  9:       Adventure  2675
## 10: Science Fiction  2447
## 11:          Family  2273
## 12:         Mystery  1918
## 13:         Fantasy  1589
## 14:           Music  1390
## 15:       Animation  1246
## 16:             War   886
## 17:         History   883
## 18:         Western   843
## 19:        TV Movie   713
## 20:         Foreign   688

Assuming genres are assigned in order of representativeness, taking at most three top genres for each movie. Furthermore, only the top 6 genres are considered.

# Dummy variable for the later dcast operation
genres[, dummy := 1]
# Only use the first 3 assigned genres 
encoded.genres <- dcast(na.omit(genres[,.SD[1:3], by=id])[name %in% sorted.genres[1:6, name]], 
                        id ~  name, value.var='dummy', fill=0)
data.w.genre <- merge(encoded.genres,
      metadata[, .(id, original_title, release_date)], 
      all.x=T, by="id")

Genre Statistics

Here I used a for loop to collect and calculate statstics genre by genre. I later figured out the better way to do this is probably using a melt call followed by several ratio column calculations. The later sections will use the melt approach instead.

i <- 0
tmp.raw <- list() 
tmp.ratio <- list()
for(name in sorted.genres[1:6, name]){
    i <- i + 1
    dt.tmp <- data.w.genre[get(name) == 1, lapply(.SD, sum), .SDcols=sorted.genres[1:6, name]]
    tmp.ratio[[i]] <- copy(dt.tmp)
    dt.tmp[, genre := name]
    tmp.raw[[i]] <- dt.tmp
    tmp.ratio[[i]] <- tmp.ratio[[i]] / dt.tmp[, get(name)]
    tmp.ratio[[i]][, genre := name]
}
genre.stats <- merge(rbindlist(tmp.raw), rbindlist(tmp.ratio), by="genre", suffixes=c("", ".ratio"), sort=F)
rm(tmp.raw, tmp.ratio, i)

The actual table declaration and rendering part. Unfortunately it requires a lot of manual labeling and assignments. But I think it is verbose and readable enough.

gt_tbl <- gt(data = genre.stats, rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Metadata Stats"
  ) %>%
  tab_spanner(
    label = "Genre Overlappings",
    columns = vars(Drama, Comedy, Thriller, Romance, Action, Horror)
  ) %>%
  tab_spanner(
    label = "Genre Overlappings(%)",
    columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio)
  ) %>%
  fmt_percent(
    columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio),
    decimals = 1,
    drop_trailing_zeros = F
  ) %>%   
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  cols_label(
    Drama.ratio = "Drama",
    Comedy.ratio = "Comedy",
    Thriller.ratio = "Thriller",
    Romance.ratio = "Romance", 
    Action.ratio = "Action", 
    Horror.ratio = "Horror"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(12)),
    locations = list(
        cells_column_labels(columns = 1:12), 
        cells_stub(),
        cells_data()
    )) %>%   
  tab_style(
    style = cells_styles(
      text_decorate = "underline",
      text_weight = "bold"),
    locations = list(
        cells_data(columns=c(1), rows=c(1)),
        cells_data(columns=c(2), rows=c(2)),
        cells_data(columns=c(3), rows=c(3)),
        cells_data(columns=c(4), rows=c(4)),
        cells_data(columns=c(5), rows=c(5)),
        cells_data(columns=c(6), rows=c(6))
    )) %>%   
  tab_style(
    style = cells_styles(
      text_color="lightgrey"),
    locations = list(
        cells_data(columns=c(7), rows=c(1)),
        cells_data(columns=c(8), rows=c(2)),
        cells_data(columns=c(9), rows=c(3)),
        cells_data(columns=c(10), rows=c(4)),
        cells_data(columns=c(11), rows=c(5)),
        cells_data(columns=c(12), rows=c(6))
    ))
Movie Metadata Stats
Genre Genre Overlappings Genre Overlappings(%)
Drama Comedy Thriller Romance Action Horror Drama Comedy Thriller Romance Action Horror
Drama 13008 2639 1906 2656 1441 550 100.0% 20.3% 14.7% 20.4% 11.1% 4.2%
Comedy 2639 9175 284 2099 868 509 28.8% 100.0% 3.1% 22.9% 9.5% 5.5%
Thriller 1906 284 4986 163 1272 1340 38.2% 5.7% 100.0% 3.3% 25.5% 26.9%
Romance 2656 2099 163 4325 150 58 61.4% 48.5% 3.8% 100.0% 3.5% 1.3%
Action 1441 868 1272 150 4657 318 30.9% 18.6% 27.3% 3.2% 100.0% 6.8%
Horror 550 509 1340 58 318 3691 14.9% 13.8% 36.3% 1.6% 8.6% 100.0%
Source: "The Movie Dataset" on Kaggle

How to read the table: For example, Drama (the first row) has 13008 movies (the first column), and 2639 of them are also under Comedy, which is 20.3% of the 13008 movies.

The diagonal elements of ‘Genre Overlappings’ section are emphasized because they represent the movie count of the respective genre. Similarily, the diagonal elements of ‘Genre Overlappings(%)’ section are somewhat hiddened since they convey almost no useful information.

Reading Ratings

Now we turn our attention to movie ratings (from movielens). We only consider the average rating of a movie (which of course is not perfect).

ratings <- fread("data/ratings.csv", select=c('movieId', 'rating'))
ratings.stats <- ratings[, .(n_ratings=.N, avg_rating=mean(rating)), by=movieId]
rm(ratings)
head(ratings.stats)
##    movieId n_ratings avg_rating
## 1:     110     66512   4.016057
## 2:     147      4967   3.595933
## 3:     858     57070   4.339811
## 4:    1221     36679   4.263475
## 5:    1246     25752   3.912803
## 6:    1968     26611   3.827553

Here we remove move with less than 500 ratings.

data.w.genre.ratings <- merge(data.w.genre, ratings.stats, by.y="movieId", by.x="id")
# Only keep moives with >= 500 ratings
data.w.genre.ratings <- data.w.genre.ratings[n_ratings >= 500]
head(data.w.genre.ratings)
##    id Action Comedy Drama Horror Romance Thriller  original_title
## 1:  5      0      1     0      0       0        0      Four Rooms
## 2:  6      1      0     0      0       0        1  Judgment Night
## 3: 11      1      0     0      0       0        0       Star Wars
## 4: 13      0      1     1      0       1        0    Forrest Gump
## 5: 14      0      0     1      0       0        0 American Beauty
## 6: 15      0      0     1      0       0        0    Citizen Kane
##    release_date n_ratings avg_rating
## 1:   1995-12-09     15258   3.079565
## 2:   1993-10-15     27895   3.841764
## 3:   1977-05-25     19475   3.660591
## 4:   1994-07-06      1838   3.326442
## 5:   1999-09-15      6807   3.430807
## 6:   1941-04-30      3125   2.727040

Rating statistics by Genre

Using a melt call followed by a bunch of data.table operations to prepare the metrics:

rating.by.genre <- melt(
    data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
    )[value==1][
    , .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=genre]
head(rating.by.genre)
##       genre n_movies min_rating avg_rating med_rating max_rating sd_rating
## 1:   Action      300   1.417214   3.338321   3.399981   4.255074 0.5158960
## 2:   Comedy      370   1.620400   3.361114   3.429385   4.339811 0.4851807
## 3:    Drama      658   1.271583   3.350580   3.403381   4.429015 0.5023833
## 4:   Horror      141   1.607782   3.381620   3.487408   4.124958 0.5152463
## 5:  Romance      206   1.271583   3.365214   3.401239   4.339811 0.4942946
## 6: Thriller      303   1.620400   3.324058   3.389824   4.429015 0.5150434
##    n_ratings avg_n_ratings max_n_ratings min_n_ratings
## 1:   1830601      6102.003         87901           502
## 2:   2107525      5696.014         60024           501
## 3:   4170892      6338.742         91082           501
## 4:    721031      5113.695         50375           519
## 5:   1187182      5763.019         57070           501
## 6:   2008452      6628.554         91082           508

Follows roughly the same recipe (unfortunately I did not find a reasonable way to make use of row group feature of gt for this dataset):

gt_tbl <- gt(data = rating.by.genre, rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Average Ratings by Genre",
    subtitle = "with # of Ratings >= 500"
  ) %>%
  tab_spanner(
    label = "Average Rating (1-5)",
    columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating)
  ) %>%
  fmt_number(
    columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating),
    decimals = 2,
    drop_trailing_zeros = F
  ) %>%       
  tab_spanner(
    label = "# of Ratings",
    columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings)
  ) %>%
  fmt_number(
    columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings),
    decimals = 0,
    drop_trailing_zeros = F
  ) %>%    
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  cols_label(
    min_rating = "Min", 
    avg_rating = "Avg", 
    med_rating = "Med",
    max_rating = "Max",
    n_ratings = "Total", 
    min_n_ratings = "Min", 
    avg_n_ratings = "Avg",
    max_n_ratings = "Max",
    n_movies="Movies",
    sd_rating="Stdev"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(14)),
    locations = list(
        cells_column_labels(columns = 1:10), 
        cells_stub(),
        cells_data()
    ))
Movie Average Ratings by Genre
with # of Ratings >= 500
Genre Movies Average Rating (1-5) # of Ratings
Min Avg Med Max Stdev Total Avg Max Min
Action 300 1.42 3.34 3.40 4.26 0.52 1,830,601 6,102 87,901 502
Comedy 370 1.62 3.36 3.43 4.34 0.49 2,107,525 5,696 60,024 501
Drama 658 1.27 3.35 3.40 4.43 0.50 4,170,892 6,339 91,082 501
Horror 141 1.61 3.38 3.49 4.12 0.52 721,031 5,114 50,375 519
Romance 206 1.27 3.37 3.40 4.34 0.49 1,187,182 5,763 57,070 501
Thriller 303 1.62 3.32 3.39 4.43 0.52 2,008,452 6,629 91,082 508
Source: "The Movie Dataset" on Kaggle

Suprisingly, the distribution of ratings are quite similar across all genres. Maybe movielens has done some normalization on the ratings?

Rating statistics by Genre and Year

We’d also like to know if the distributions of ratings change over time. First we plot the histogram of the years in which the movies were released. This is a case where a histogram is far more readable than a table.

data.w.genre.ratings[,release_year:=year(ymd(release_date))]
ggplot(data.w.genre.ratings[release_year >= 1950][order(release_year),.N, by=release_year], aes(x=release_year, y=N)) + 
    geom_bar(stat="identity") + ggtitle("# of Movie by Year") + theme_bw() + scale_x_continuous(breaks=seq(1950, 2020, 10))

Now we only take movies released after 1979 and before 2010 into account, and put them in to buckets each representing a decade.

data.w.genre.ratings <- data.w.genre.ratings[(release_year >= 1980) & (release_year < 2010)]
data.w.genre.ratings[, release_decade := (release_year %/% 10) * 10 ]

The same old melt trick:

rating.by.genre.year <- melt(
    data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings", "release_decade"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
    )[value==1][
    , .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=.(genre, release_decade)]

Even more manual labeling and a dcast call to handle the extra decade dimension:

gt_tbl <- gt(
    data = dcast(rating.by.genre.year, genre ~ release_decade, value.var = c("avg_rating", "sd_rating", "avg_n_ratings", "n_movies"))[, .(genre,
        n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980, 
        n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990,
        n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000
    )], 
    rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Average Ratings by Genre & Year",
    subtitle = "with # of Ratings >= 500"
  ) %>%
  tab_spanner(
    label = "1980s",
    columns = vars(n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980)
  ) %>%
  tab_spanner(
    label = "1990s",
    columns = vars(n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990)
  ) %>%
  tab_spanner(
    label = "2000s",
    columns = vars(n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000)
  ) %>%
  fmt_number(
    columns = vars(avg_rating_1980, sd_rating_1980, avg_rating_1990, sd_rating_1990, avg_rating_2000, sd_rating_2000),
    decimals = 2,
    drop_trailing_zeros = F
  ) %>%       
  fmt_number(
    columns = vars(avg_n_ratings_1980, avg_n_ratings_1990, avg_n_ratings_2000, n_movies_1980, n_movies_1990, n_movies_2000),
    decimals = 0,
    drop_trailing_zeros = F
  ) %>%    
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  tab_footnote(
    footnote = "#: Number of movies",
    cells_column_labels(columns = c(1, 5, 9))
  ) %>%
  tab_footnote(
    footnote = "Avg #: Average number of ratings.",
    cells_column_labels(columns = c(4, 8, 12))
  ) %>%    
  cols_label(
    n_movies_1980 = "#",
    n_movies_1990 = "#",
    n_movies_2000 = "#",
    avg_rating_1980 = "Avg", 
    avg_rating_1990 = "Avg", 
    avg_rating_2000 = "Avg", 
    sd_rating_1980 = "Stdev", 
    sd_rating_1990 = "Stdev",
    sd_rating_2000 = "Stdev",
    avg_n_ratings_1980 = "Avg #", 
    avg_n_ratings_1990 = "Avg #", 
    avg_n_ratings_2000 = "Avg #"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(14)),
    locations = list(
        cells_column_labels(columns = 1:12), 
        cells_stub(),
        cells_data()
    ))
Movie Average Ratings by Genre & Year
with # of Ratings >= 500
Genre 1980s 1990s 2000s
#1 Avg Stdev Avg #2 #1 Avg Stdev Avg #2 #1 Avg Stdev Avg #2
Action 43 3.36 0.49 5,755 82 3.34 0.53 5,202 107 3.31 0.53 6,535
Comedy 58 3.43 0.42 6,927 95 3.37 0.51 6,408 121 3.36 0.49 6,114
Drama 66 3.36 0.42 6,973 163 3.30 0.48 6,930 249 3.36 0.51 6,246
Horror 20 3.42 0.38 4,595 26 3.45 0.48 4,489 41 3.30 0.57 6,711
Romance 15 3.46 0.34 6,689 56 3.26 0.53 7,952 64 3.37 0.48 5,058
Thriller 25 3.34 0.41 6,843 83 3.35 0.51 6,268 121 3.31 0.53 6,799
Source: "The Movie Dataset" on Kaggle
1 #: Number of movies
2 Avg #: Average number of ratings.

It appears the variances of ratings in 1980s are lower than in 1990s and 2000s. (A boxplot or scatterplot might be more appropriate here.)

Acknowledgements