Movie Genre Overlappings, Ratings by Genre and Year

(Exploring the Moives Dataset with rstudio/gt package)

Tables can be an effective way of communicating data. Though not as powerful in telling stories as charts, by cramming a lot of numbers into limited space, it can provide readers with accurate and potentially useful information which readers can interpret in their own ways.

I’ve come across this new R package gt(Easily generate information-rich, publication-quality tables from R) and decided to give it a try.

With the gt package, anyone can make wonderful-looking tables using the R programming language.

Admittedly, the tables in the documents might not be the optimal way of presentation. They serve as a demonstration of what gt can do, and maybe also helpful enough for analyst in constructing their stories about this dataset.

(Dataset Source: The Movies Dataset on Kaggle.)

# install.packages("packrat")
# install.packages("pacman")
# install.packages("devtools")
# install.packages("tidyverse")
# devtools::install_github("rstudio/gt")
library(gt)
pacman::p_load("tidyverse")
pacman::p_load("data.table")
pacman::p_load("jsonlite")
pacman::p_load("lubridate")

Read and clean the Metadata

Some of the fields read in here are not used in the later sections. However, they are kept as a reminder of the potentially useful information.

metadata <- fread("data/movies_metadata.csv", select=c('adult', 'genres', 'release_date', 'original_language', 'original_title', 'id', 'imdb_id'), fill=T)
metadata <- metadata[!is.na(as.integer(id)) & (original_language == 'en') & (adult == "False"), ]

## Warning in eval(.massagei(isub), x, parent.frame()): NAs introduced by
## coercion

metadata[, id := as.integer(id)]
# Replace singles quote in 'genre' so 'fromJSON' function can parse it
metadata[, genre := gsub("\'","\"", metadata$genre)]
# Remove redundant fields
metadata[, c('original_language', 'adult', "imdb_id") := NULL]

Parse Genres

Genre information is stored as JSON texts. We need to parse it and do some transformation.

genres <- metadata[, unlist(lapply(genre, fromJSON), recursive=F)['name'], by=id]
sorted.genres <- genres[, .N, by=name][order(-N)]
sorted.genres[1:20]

##                name     N
##  1:           Drama 13222
##  2:          Comedy  9311
##  3:        Thriller  5983
##  4:         Romance  4816
##  5:          Action  4728
##  6:          Horror  3790
##  7:     Documentary  3369
##  8:           Crime  3150
##  9:       Adventure  2675
## 10: Science Fiction  2447
## 11:          Family  2273
## 12:         Mystery  1918
## 13:         Fantasy  1589
## 14:           Music  1390
## 15:       Animation  1246
## 16:             War   886
## 17:         History   883
## 18:         Western   843
## 19:        TV Movie   713
## 20:         Foreign   688

Assuming genres are assigned in order of representativeness, taking at most three top genres for each movie. Furthermore, only the top 6 genres are considered.

# Dummy variable for the later dcast operation
genres[, dummy := 1]
# Only use the first 3 assigned genres 
encoded.genres <- dcast(na.omit(genres[,.SD[1:3], by=id])[name %in% sorted.genres[1:6, name]], 
                        id ~  name, value.var='dummy', fill=0)
data.w.genre <- merge(encoded.genres,
      metadata[, .(id, original_title, release_date)], 
      all.x=T, by="id")

Genre Statistics

Here I used a for loop to collect and calculate statstics genre by genre. I later figured out the better way to do this is probably using a melt call followed by several ratio column calculations. The later sections will use the melt approach instead.

i <- 0
tmp.raw <- list() 
tmp.ratio <- list()
for(name in sorted.genres[1:6, name]){
    i <- i + 1
    dt.tmp <- data.w.genre[get(name) == 1, lapply(.SD, sum), .SDcols=sorted.genres[1:6, name]]
    tmp.ratio[[i]] <- copy(dt.tmp)
    dt.tmp[, genre := name]
    tmp.raw[[i]] <- dt.tmp
    tmp.ratio[[i]] <- tmp.ratio[[i]] / dt.tmp[, get(name)]
    tmp.ratio[[i]][, genre := name]
}
genre.stats <- merge(rbindlist(tmp.raw), rbindlist(tmp.ratio), by="genre", suffixes=c("", ".ratio"), sort=F)
rm(tmp.raw, tmp.ratio, i)

The actual table declaration and rendering part. Unfortunately it requires a lot of manual labeling and assignments. But I think it is verbose and readable enough.

gt_tbl <- gt(data = genre.stats, rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Metadata Stats"
  ) %>%
  tab_spanner(
    label = "Genre Overlappings",
    columns = vars(Drama, Comedy, Thriller, Romance, Action, Horror)
  ) %>%
  tab_spanner(
    label = "Genre Overlappings(%)",
    columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio)
  ) %>%
  fmt_percent(
    columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio),
    decimals = 1,
    drop_trailing_zeros = F
  ) %>%   
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  cols_label(
    Drama.ratio = "Drama",
    Comedy.ratio = "Comedy",
    Thriller.ratio = "Thriller",
    Romance.ratio = "Romance", 
    Action.ratio = "Action", 
    Horror.ratio = "Horror"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(12)),
    locations = list(
        cells_column_labels(columns = 1:12), 
        cells_stub(),
        cells_data()
    )) %>%   
  tab_style(
    style = cells_styles(
      text_decorate = "underline",
      text_weight = "bold"),
    locations = list(
        cells_data(columns=c(1), rows=c(1)),
        cells_data(columns=c(2), rows=c(2)),
        cells_data(columns=c(3), rows=c(3)),
        cells_data(columns=c(4), rows=c(4)),
        cells_data(columns=c(5), rows=c(5)),
        cells_data(columns=c(6), rows=c(6))
    )) %>%   
  tab_style(
    style = cells_styles(
      text_color="lightgrey"),
    locations = list(
        cells_data(columns=c(7), rows=c(1)),
        cells_data(columns=c(8), rows=c(2)),
        cells_data(columns=c(9), rows=c(3)),
        cells_data(columns=c(10), rows=c(4)),
        cells_data(columns=c(11), rows=c(5)),
        cells_data(columns=c(12), rows=c(6))
    ))

Movie Metadata Stats

Genre	Genre Overlappings						Genre Overlappings(%)
Genre	Drama	Comedy	Thriller	Romance	Action	Horror	Drama	Comedy	Thriller	Romance	Action	Horror
Drama	13008	2639	1906	2656	1441	550	100.0%	20.3%	14.7%	20.4%	11.1%	4.2%
Comedy	2639	9175	284	2099	868	509	28.8%	100.0%	3.1%	22.9%	9.5%	5.5%
Thriller	1906	284	4986	163	1272	1340	38.2%	5.7%	100.0%	3.3%	25.5%	26.9%
Romance	2656	2099	163	4325	150	58	61.4%	48.5%	3.8%	100.0%	3.5%	1.3%
Action	1441	868	1272	150	4657	318	30.9%	18.6%	27.3%	3.2%	100.0%	6.8%
Horror	550	509	1340	58	318	3691	14.9%	13.8%	36.3%	1.6%	8.6%	100.0%
Source: "The Movie Dataset" on Kaggle

How to read the table: For example, Drama (the first row) has 13008 movies (the first column), and 2639 of them are also under Comedy, which is 20.3% of the 13008 movies.

The diagonal elements of ‘Genre Overlappings’ section are emphasized because they represent the movie count of the respective genre. Similarily, the diagonal elements of ‘Genre Overlappings(%)’ section are somewhat hiddened since they convey almost no useful information.

Reading Ratings

Now we turn our attention to movie ratings (from movielens). We only consider the average rating of a movie (which of course is not perfect).

ratings <- fread("data/ratings.csv", select=c('movieId', 'rating'))
ratings.stats <- ratings[, .(n_ratings=.N, avg_rating=mean(rating)), by=movieId]
rm(ratings)
head(ratings.stats)

##    movieId n_ratings avg_rating
## 1:     110     66512   4.016057
## 2:     147      4967   3.595933
## 3:     858     57070   4.339811
## 4:    1221     36679   4.263475
## 5:    1246     25752   3.912803
## 6:    1968     26611   3.827553

Here we remove move with less than 500 ratings.

data.w.genre.ratings <- merge(data.w.genre, ratings.stats, by.y="movieId", by.x="id")
# Only keep moives with >= 500 ratings
data.w.genre.ratings <- data.w.genre.ratings[n_ratings >= 500]
head(data.w.genre.ratings)

##    id Action Comedy Drama Horror Romance Thriller  original_title
## 1:  5      0      1     0      0       0        0      Four Rooms
## 2:  6      1      0     0      0       0        1  Judgment Night
## 3: 11      1      0     0      0       0        0       Star Wars
## 4: 13      0      1     1      0       1        0    Forrest Gump
## 5: 14      0      0     1      0       0        0 American Beauty
## 6: 15      0      0     1      0       0        0    Citizen Kane
##    release_date n_ratings avg_rating
## 1:   1995-12-09     15258   3.079565
## 2:   1993-10-15     27895   3.841764
## 3:   1977-05-25     19475   3.660591
## 4:   1994-07-06      1838   3.326442
## 5:   1999-09-15      6807   3.430807
## 6:   1941-04-30      3125   2.727040

Rating statistics by Genre

Using a melt call followed by a bunch of data.table operations to prepare the metrics:

rating.by.genre <- melt(
    data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
    )[value==1][
    , .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=genre]
head(rating.by.genre)

##       genre n_movies min_rating avg_rating med_rating max_rating sd_rating
## 1:   Action      300   1.417214   3.338321   3.399981   4.255074 0.5158960
## 2:   Comedy      370   1.620400   3.361114   3.429385   4.339811 0.4851807
## 3:    Drama      658   1.271583   3.350580   3.403381   4.429015 0.5023833
## 4:   Horror      141   1.607782   3.381620   3.487408   4.124958 0.5152463
## 5:  Romance      206   1.271583   3.365214   3.401239   4.339811 0.4942946
## 6: Thriller      303   1.620400   3.324058   3.389824   4.429015 0.5150434
##    n_ratings avg_n_ratings max_n_ratings min_n_ratings
## 1:   1830601      6102.003         87901           502
## 2:   2107525      5696.014         60024           501
## 3:   4170892      6338.742         91082           501
## 4:    721031      5113.695         50375           519
## 5:   1187182      5763.019         57070           501
## 6:   2008452      6628.554         91082           508

Follows roughly the same recipe (unfortunately I did not find a reasonable way to make use of row group feature of gt for this dataset):

gt_tbl <- gt(data = rating.by.genre, rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Average Ratings by Genre",
    subtitle = "with # of Ratings >= 500"
  ) %>%
  tab_spanner(
    label = "Average Rating (1-5)",
    columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating)
  ) %>%
  fmt_number(
    columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating),
    decimals = 2,
    drop_trailing_zeros = F
  ) %>%       
  tab_spanner(
    label = "# of Ratings",
    columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings)
  ) %>%
  fmt_number(
    columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings),
    decimals = 0,
    drop_trailing_zeros = F
  ) %>%    
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  cols_label(
    min_rating = "Min", 
    avg_rating = "Avg", 
    med_rating = "Med",
    max_rating = "Max",
    n_ratings = "Total", 
    min_n_ratings = "Min", 
    avg_n_ratings = "Avg",
    max_n_ratings = "Max",
    n_movies="Movies",
    sd_rating="Stdev"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(14)),
    locations = list(
        cells_column_labels(columns = 1:10), 
        cells_stub(),
        cells_data()
    ))

Movie Average Ratings by Genre
with # of Ratings >= 500
Genre	Movies	Average Rating (1-5)					# of Ratings
Genre	Movies	Min	Avg	Med	Max	Stdev	Total	Avg	Max	Min
Action	300	1.42	3.34	3.40	4.26	0.52	1,830,601	6,102	87,901	502
Comedy	370	1.62	3.36	3.43	4.34	0.49	2,107,525	5,696	60,024	501
Drama	658	1.27	3.35	3.40	4.43	0.50	4,170,892	6,339	91,082	501
Horror	141	1.61	3.38	3.49	4.12	0.52	721,031	5,114	50,375	519
Romance	206	1.27	3.37	3.40	4.34	0.49	1,187,182	5,763	57,070	501
Thriller	303	1.62	3.32	3.39	4.43	0.52	2,008,452	6,629	91,082	508
Source: "The Movie Dataset" on Kaggle

Suprisingly, the distribution of ratings are quite similar across all genres. Maybe movielens has done some normalization on the ratings?

Rating statistics by Genre and Year

We’d also like to know if the distributions of ratings change over time. First we plot the histogram of the years in which the movies were released. This is a case where a histogram is far more readable than a table.

data.w.genre.ratings[,release_year:=year(ymd(release_date))]
ggplot(data.w.genre.ratings[release_year >= 1950][order(release_year),.N, by=release_year], aes(x=release_year, y=N)) + 
    geom_bar(stat="identity") + ggtitle("# of Movie by Year") + theme_bw() + scale_x_continuous(breaks=seq(1950, 2020, 10))

Now we only take movies released after 1979 and before 2010 into account, and put them in to buckets each representing a decade.

data.w.genre.ratings <- data.w.genre.ratings[(release_year >= 1980) & (release_year < 2010)]
data.w.genre.ratings[, release_decade := (release_year %/% 10) * 10 ]

The same old melt trick:

rating.by.genre.year <- melt(
    data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings", "release_decade"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
    )[value==1][
    , .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=.(genre, release_decade)]

Even more manual labeling and a dcast call to handle the extra decade dimension:

gt_tbl <- gt(
    data = dcast(rating.by.genre.year, genre ~ release_decade, value.var = c("avg_rating", "sd_rating", "avg_n_ratings", "n_movies"))[, .(genre,
        n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980, 
        n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990,
        n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000
    )], 
    rowname_col = "genre")
gt_tbl %>%
  tab_header(
    title = "Movie Average Ratings by Genre & Year",
    subtitle = "with # of Ratings >= 500"
  ) %>%
  tab_spanner(
    label = "1980s",
    columns = vars(n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980)
  ) %>%
  tab_spanner(
    label = "1990s",
    columns = vars(n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990)
  ) %>%
  tab_spanner(
    label = "2000s",
    columns = vars(n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000)
  ) %>%
  fmt_number(
    columns = vars(avg_rating_1980, sd_rating_1980, avg_rating_1990, sd_rating_1990, avg_rating_2000, sd_rating_2000),
    decimals = 2,
    drop_trailing_zeros = F
  ) %>%       
  fmt_number(
    columns = vars(avg_n_ratings_1980, avg_n_ratings_1990, avg_n_ratings_2000, n_movies_1980, n_movies_1990, n_movies_2000),
    decimals = 0,
    drop_trailing_zeros = F
  ) %>%    
  tab_source_note(
    source_note = md("Source:  [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
  ) %>%
  tab_footnote(
    footnote = "#: Number of movies",
    cells_column_labels(columns = c(1, 5, 9))
  ) %>%
  tab_footnote(
    footnote = "Avg #: Average number of ratings.",
    cells_column_labels(columns = c(4, 8, 12))
  ) %>%    
  cols_label(
    n_movies_1980 = "#",
    n_movies_1990 = "#",
    n_movies_2000 = "#",
    avg_rating_1980 = "Avg", 
    avg_rating_1990 = "Avg", 
    avg_rating_2000 = "Avg", 
    sd_rating_1980 = "Stdev", 
    sd_rating_1990 = "Stdev",
    sd_rating_2000 = "Stdev",
    avg_n_ratings_1980 = "Avg #", 
    avg_n_ratings_1990 = "Avg #", 
    avg_n_ratings_2000 = "Avg #"
  ) %>%   
  tab_stubhead_label(label = "Genre") %>%
  tab_style(
    style = cells_styles(
      text_size = px(14)),
    locations = list(
        cells_column_labels(columns = 1:12), 
        cells_stub(),
        cells_data()
    ))

Movie Average Ratings by Genre & Year
with # of Ratings >= 500
Genre	1980s				1990s				2000s
Genre	#¹	Avg	Stdev	Avg #²	#¹	Avg	Stdev	Avg #²	#¹	Avg	Stdev	Avg #²
Action	43	3.36	0.49	5,755	82	3.34	0.53	5,202	107	3.31	0.53	6,535
Comedy	58	3.43	0.42	6,927	95	3.37	0.51	6,408	121	3.36	0.49	6,114
Drama	66	3.36	0.42	6,973	163	3.30	0.48	6,930	249	3.36	0.51	6,246
Horror	20	3.42	0.38	4,595	26	3.45	0.48	4,489	41	3.30	0.57	6,711
Romance	15	3.46	0.34	6,689	56	3.26	0.53	7,952	64	3.37	0.48	5,058
Thriller	25	3.34	0.41	6,843	83	3.35	0.51	6,268	121	3.31	0.53	6,799
Source: "The Movie Dataset" on Kaggle
¹ #: Number of movies ² Avg #: Average number of ratings.

It appears the variances of ratings in 1980s are lower than in 1990s and 2000s. (A boxplot or scatterplot might be more appropriate here.)

Acknowledgements

Erik Bruin’s Kaggle notebook “Movie recommendation systems based on TMDB” has been a great source of ideas r.t. manipulating this dataset using R.