Tables can be an effective way of communicating data. Though not as powerful in telling stories as charts, by cramming a lot of numbers into limited space, it can provide readers with accurate and potentially useful information which readers can interpret in their own ways.
I’ve come across this new R package gt(Easily generate information-rich, publication-quality tables from R) and decided to give it a try.
With the gt package, anyone can make wonderful-looking tables using the R programming language.
Admittedly, the tables in the documents might not be the optimal way of presentation. They serve as a demonstration of what gt can do, and maybe also helpful enough for analyst in constructing their stories about this dataset.
(Dataset Source: The Movies Dataset on Kaggle.)
# install.packages("packrat")
# install.packages("pacman")
# install.packages("devtools")
# install.packages("tidyverse")
# devtools::install_github("rstudio/gt")
library(gt)
pacman::p_load("tidyverse")
pacman::p_load("data.table")
pacman::p_load("jsonlite")
pacman::p_load("lubridate")
Some of the fields read in here are not used in the later sections. However, they are kept as a reminder of the potentially useful information.
metadata <- fread("data/movies_metadata.csv", select=c('adult', 'genres', 'release_date', 'original_language', 'original_title', 'id', 'imdb_id'), fill=T)
metadata <- metadata[!is.na(as.integer(id)) & (original_language == 'en') & (adult == "False"), ]
## Warning in eval(.massagei(isub), x, parent.frame()): NAs introduced by
## coercion
metadata[, id := as.integer(id)]
# Replace singles quote in 'genre' so 'fromJSON' function can parse it
metadata[, genre := gsub("\'","\"", metadata$genre)]
# Remove redundant fields
metadata[, c('original_language', 'adult', "imdb_id") := NULL]
Genre information is stored as JSON texts. We need to parse it and do some transformation.
genres <- metadata[, unlist(lapply(genre, fromJSON), recursive=F)['name'], by=id]
sorted.genres <- genres[, .N, by=name][order(-N)]
sorted.genres[1:20]
## name N
## 1: Drama 13222
## 2: Comedy 9311
## 3: Thriller 5983
## 4: Romance 4816
## 5: Action 4728
## 6: Horror 3790
## 7: Documentary 3369
## 8: Crime 3150
## 9: Adventure 2675
## 10: Science Fiction 2447
## 11: Family 2273
## 12: Mystery 1918
## 13: Fantasy 1589
## 14: Music 1390
## 15: Animation 1246
## 16: War 886
## 17: History 883
## 18: Western 843
## 19: TV Movie 713
## 20: Foreign 688
Assuming genres are assigned in order of representativeness, taking at most three top genres for each movie. Furthermore, only the top 6 genres are considered.
# Dummy variable for the later dcast operation
genres[, dummy := 1]
# Only use the first 3 assigned genres
encoded.genres <- dcast(na.omit(genres[,.SD[1:3], by=id])[name %in% sorted.genres[1:6, name]],
id ~ name, value.var='dummy', fill=0)
data.w.genre <- merge(encoded.genres,
metadata[, .(id, original_title, release_date)],
all.x=T, by="id")
Here I used a for loop to collect and calculate statstics genre by genre. I later figured out the better way to do this is probably using a melt call followed by several ratio column calculations. The later sections will use the melt approach instead.
i <- 0
tmp.raw <- list()
tmp.ratio <- list()
for(name in sorted.genres[1:6, name]){
i <- i + 1
dt.tmp <- data.w.genre[get(name) == 1, lapply(.SD, sum), .SDcols=sorted.genres[1:6, name]]
tmp.ratio[[i]] <- copy(dt.tmp)
dt.tmp[, genre := name]
tmp.raw[[i]] <- dt.tmp
tmp.ratio[[i]] <- tmp.ratio[[i]] / dt.tmp[, get(name)]
tmp.ratio[[i]][, genre := name]
}
genre.stats <- merge(rbindlist(tmp.raw), rbindlist(tmp.ratio), by="genre", suffixes=c("", ".ratio"), sort=F)
rm(tmp.raw, tmp.ratio, i)
The actual table declaration and rendering part. Unfortunately it requires a lot of manual labeling and assignments. But I think it is verbose and readable enough.
gt_tbl <- gt(data = genre.stats, rowname_col = "genre")
gt_tbl %>%
tab_header(
title = "Movie Metadata Stats"
) %>%
tab_spanner(
label = "Genre Overlappings",
columns = vars(Drama, Comedy, Thriller, Romance, Action, Horror)
) %>%
tab_spanner(
label = "Genre Overlappings(%)",
columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio)
) %>%
fmt_percent(
columns = vars(Drama.ratio, Comedy.ratio, Thriller.ratio, Romance.ratio, Action.ratio, Horror.ratio),
decimals = 1,
drop_trailing_zeros = F
) %>%
tab_source_note(
source_note = md("Source: [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
) %>%
cols_label(
Drama.ratio = "Drama",
Comedy.ratio = "Comedy",
Thriller.ratio = "Thriller",
Romance.ratio = "Romance",
Action.ratio = "Action",
Horror.ratio = "Horror"
) %>%
tab_stubhead_label(label = "Genre") %>%
tab_style(
style = cells_styles(
text_size = px(12)),
locations = list(
cells_column_labels(columns = 1:12),
cells_stub(),
cells_data()
)) %>%
tab_style(
style = cells_styles(
text_decorate = "underline",
text_weight = "bold"),
locations = list(
cells_data(columns=c(1), rows=c(1)),
cells_data(columns=c(2), rows=c(2)),
cells_data(columns=c(3), rows=c(3)),
cells_data(columns=c(4), rows=c(4)),
cells_data(columns=c(5), rows=c(5)),
cells_data(columns=c(6), rows=c(6))
)) %>%
tab_style(
style = cells_styles(
text_color="lightgrey"),
locations = list(
cells_data(columns=c(7), rows=c(1)),
cells_data(columns=c(8), rows=c(2)),
cells_data(columns=c(9), rows=c(3)),
cells_data(columns=c(10), rows=c(4)),
cells_data(columns=c(11), rows=c(5)),
cells_data(columns=c(12), rows=c(6))
))
| Movie Metadata Stats | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genre | Genre Overlappings | Genre Overlappings(%) | |||||||||||
| Drama | Comedy | Thriller | Romance | Action | Horror | Drama | Comedy | Thriller | Romance | Action | Horror | ||
| Drama | 13008 | 2639 | 1906 | 2656 | 1441 | 550 | 100.0% | 20.3% | 14.7% | 20.4% | 11.1% | 4.2% | |
| Comedy | 2639 | 9175 | 284 | 2099 | 868 | 509 | 28.8% | 100.0% | 3.1% | 22.9% | 9.5% | 5.5% | |
| Thriller | 1906 | 284 | 4986 | 163 | 1272 | 1340 | 38.2% | 5.7% | 100.0% | 3.3% | 25.5% | 26.9% | |
| Romance | 2656 | 2099 | 163 | 4325 | 150 | 58 | 61.4% | 48.5% | 3.8% | 100.0% | 3.5% | 1.3% | |
| Action | 1441 | 868 | 1272 | 150 | 4657 | 318 | 30.9% | 18.6% | 27.3% | 3.2% | 100.0% | 6.8% | |
| Horror | 550 | 509 | 1340 | 58 | 318 | 3691 | 14.9% | 13.8% | 36.3% | 1.6% | 8.6% | 100.0% | |
| Source: "The Movie Dataset" on Kaggle | |||||||||||||
How to read the table: For example, Drama (the first row) has 13008 movies (the first column), and 2639 of them are also under Comedy, which is 20.3% of the 13008 movies.
The diagonal elements of ‘Genre Overlappings’ section are emphasized because they represent the movie count of the respective genre. Similarily, the diagonal elements of ‘Genre Overlappings(%)’ section are somewhat hiddened since they convey almost no useful information.
Now we turn our attention to movie ratings (from movielens). We only consider the average rating of a movie (which of course is not perfect).
ratings <- fread("data/ratings.csv", select=c('movieId', 'rating'))
ratings.stats <- ratings[, .(n_ratings=.N, avg_rating=mean(rating)), by=movieId]
rm(ratings)
head(ratings.stats)
## movieId n_ratings avg_rating
## 1: 110 66512 4.016057
## 2: 147 4967 3.595933
## 3: 858 57070 4.339811
## 4: 1221 36679 4.263475
## 5: 1246 25752 3.912803
## 6: 1968 26611 3.827553
Here we remove move with less than 500 ratings.
data.w.genre.ratings <- merge(data.w.genre, ratings.stats, by.y="movieId", by.x="id")
# Only keep moives with >= 500 ratings
data.w.genre.ratings <- data.w.genre.ratings[n_ratings >= 500]
head(data.w.genre.ratings)
## id Action Comedy Drama Horror Romance Thriller original_title
## 1: 5 0 1 0 0 0 0 Four Rooms
## 2: 6 1 0 0 0 0 1 Judgment Night
## 3: 11 1 0 0 0 0 0 Star Wars
## 4: 13 0 1 1 0 1 0 Forrest Gump
## 5: 14 0 0 1 0 0 0 American Beauty
## 6: 15 0 0 1 0 0 0 Citizen Kane
## release_date n_ratings avg_rating
## 1: 1995-12-09 15258 3.079565
## 2: 1993-10-15 27895 3.841764
## 3: 1977-05-25 19475 3.660591
## 4: 1994-07-06 1838 3.326442
## 5: 1999-09-15 6807 3.430807
## 6: 1941-04-30 3125 2.727040
Using a melt call followed by a bunch of data.table operations to prepare the metrics:
rating.by.genre <- melt(
data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
)[value==1][
, .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=genre]
head(rating.by.genre)
## genre n_movies min_rating avg_rating med_rating max_rating sd_rating
## 1: Action 300 1.417214 3.338321 3.399981 4.255074 0.5158960
## 2: Comedy 370 1.620400 3.361114 3.429385 4.339811 0.4851807
## 3: Drama 658 1.271583 3.350580 3.403381 4.429015 0.5023833
## 4: Horror 141 1.607782 3.381620 3.487408 4.124958 0.5152463
## 5: Romance 206 1.271583 3.365214 3.401239 4.339811 0.4942946
## 6: Thriller 303 1.620400 3.324058 3.389824 4.429015 0.5150434
## n_ratings avg_n_ratings max_n_ratings min_n_ratings
## 1: 1830601 6102.003 87901 502
## 2: 2107525 5696.014 60024 501
## 3: 4170892 6338.742 91082 501
## 4: 721031 5113.695 50375 519
## 5: 1187182 5763.019 57070 501
## 6: 2008452 6628.554 91082 508
Follows roughly the same recipe (unfortunately I did not find a reasonable way to make use of row group feature of gt for this dataset):
gt_tbl <- gt(data = rating.by.genre, rowname_col = "genre")
gt_tbl %>%
tab_header(
title = "Movie Average Ratings by Genre",
subtitle = "with # of Ratings >= 500"
) %>%
tab_spanner(
label = "Average Rating (1-5)",
columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating)
) %>%
fmt_number(
columns = vars(min_rating, avg_rating, med_rating, max_rating, sd_rating),
decimals = 2,
drop_trailing_zeros = F
) %>%
tab_spanner(
label = "# of Ratings",
columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings)
) %>%
fmt_number(
columns = vars(n_ratings, min_n_ratings, avg_n_ratings, max_n_ratings),
decimals = 0,
drop_trailing_zeros = F
) %>%
tab_source_note(
source_note = md("Source: [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
) %>%
cols_label(
min_rating = "Min",
avg_rating = "Avg",
med_rating = "Med",
max_rating = "Max",
n_ratings = "Total",
min_n_ratings = "Min",
avg_n_ratings = "Avg",
max_n_ratings = "Max",
n_movies="Movies",
sd_rating="Stdev"
) %>%
tab_stubhead_label(label = "Genre") %>%
tab_style(
style = cells_styles(
text_size = px(14)),
locations = list(
cells_column_labels(columns = 1:10),
cells_stub(),
cells_data()
))
| Movie Average Ratings by Genre | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| with # of Ratings >= 500 | |||||||||||
| Genre | Movies | Average Rating (1-5) | # of Ratings | ||||||||
| Min | Avg | Med | Max | Stdev | Total | Avg | Max | Min | |||
| Action | 300 | 1.42 | 3.34 | 3.40 | 4.26 | 0.52 | 1,830,601 | 6,102 | 87,901 | 502 | |
| Comedy | 370 | 1.62 | 3.36 | 3.43 | 4.34 | 0.49 | 2,107,525 | 5,696 | 60,024 | 501 | |
| Drama | 658 | 1.27 | 3.35 | 3.40 | 4.43 | 0.50 | 4,170,892 | 6,339 | 91,082 | 501 | |
| Horror | 141 | 1.61 | 3.38 | 3.49 | 4.12 | 0.52 | 721,031 | 5,114 | 50,375 | 519 | |
| Romance | 206 | 1.27 | 3.37 | 3.40 | 4.34 | 0.49 | 1,187,182 | 5,763 | 57,070 | 501 | |
| Thriller | 303 | 1.62 | 3.32 | 3.39 | 4.43 | 0.52 | 2,008,452 | 6,629 | 91,082 | 508 | |
| Source: "The Movie Dataset" on Kaggle | |||||||||||
Suprisingly, the distribution of ratings are quite similar across all genres. Maybe movielens has done some normalization on the ratings?
We’d also like to know if the distributions of ratings change over time. First we plot the histogram of the years in which the movies were released. This is a case where a histogram is far more readable than a table.
data.w.genre.ratings[,release_year:=year(ymd(release_date))]
ggplot(data.w.genre.ratings[release_year >= 1950][order(release_year),.N, by=release_year], aes(x=release_year, y=N)) +
geom_bar(stat="identity") + ggtitle("# of Movie by Year") + theme_bw() + scale_x_continuous(breaks=seq(1950, 2020, 10))
Now we only take movies released after 1979 and before 2010 into account, and put them in to buckets each representing a decade.
data.w.genre.ratings <- data.w.genre.ratings[(release_year >= 1980) & (release_year < 2010)]
data.w.genre.ratings[, release_decade := (release_year %/% 10) * 10 ]
The same old melt trick:
rating.by.genre.year <- melt(
data.w.genre.ratings, id=c("id", "avg_rating", "n_ratings", "release_decade"), measure.vars=c("Action", "Comedy", "Drama", "Horror", "Romance", "Thriller"), variable.name="genre"
)[value==1][
, .(n_movies=.N, min_rating=min(avg_rating), avg_rating=mean(avg_rating), med_rating=median(avg_rating), max_rating=max(avg_rating), sd_rating=sd(avg_rating), n_ratings=sum(n_ratings), avg_n_ratings=mean(n_ratings), max_n_ratings=max(n_ratings), min_n_ratings=min(n_ratings)), by=.(genre, release_decade)]
Even more manual labeling and a dcast call to handle the extra decade dimension:
gt_tbl <- gt(
data = dcast(rating.by.genre.year, genre ~ release_decade, value.var = c("avg_rating", "sd_rating", "avg_n_ratings", "n_movies"))[, .(genre,
n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980,
n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990,
n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000
)],
rowname_col = "genre")
gt_tbl %>%
tab_header(
title = "Movie Average Ratings by Genre & Year",
subtitle = "with # of Ratings >= 500"
) %>%
tab_spanner(
label = "1980s",
columns = vars(n_movies_1980, avg_rating_1980, sd_rating_1980, avg_n_ratings_1980)
) %>%
tab_spanner(
label = "1990s",
columns = vars(n_movies_1990, avg_rating_1990, sd_rating_1990, avg_n_ratings_1990)
) %>%
tab_spanner(
label = "2000s",
columns = vars(n_movies_2000, avg_rating_2000, sd_rating_2000, avg_n_ratings_2000)
) %>%
fmt_number(
columns = vars(avg_rating_1980, sd_rating_1980, avg_rating_1990, sd_rating_1990, avg_rating_2000, sd_rating_2000),
decimals = 2,
drop_trailing_zeros = F
) %>%
fmt_number(
columns = vars(avg_n_ratings_1980, avg_n_ratings_1990, avg_n_ratings_2000, n_movies_1980, n_movies_1990, n_movies_2000),
decimals = 0,
drop_trailing_zeros = F
) %>%
tab_source_note(
source_note = md("Source: [\"The Movie Dataset\"](rounakbanik/the-movies-dataset) on Kaggle")
) %>%
tab_footnote(
footnote = "#: Number of movies",
cells_column_labels(columns = c(1, 5, 9))
) %>%
tab_footnote(
footnote = "Avg #: Average number of ratings.",
cells_column_labels(columns = c(4, 8, 12))
) %>%
cols_label(
n_movies_1980 = "#",
n_movies_1990 = "#",
n_movies_2000 = "#",
avg_rating_1980 = "Avg",
avg_rating_1990 = "Avg",
avg_rating_2000 = "Avg",
sd_rating_1980 = "Stdev",
sd_rating_1990 = "Stdev",
sd_rating_2000 = "Stdev",
avg_n_ratings_1980 = "Avg #",
avg_n_ratings_1990 = "Avg #",
avg_n_ratings_2000 = "Avg #"
) %>%
tab_stubhead_label(label = "Genre") %>%
tab_style(
style = cells_styles(
text_size = px(14)),
locations = list(
cells_column_labels(columns = 1:12),
cells_stub(),
cells_data()
))
| Movie Average Ratings by Genre & Year | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| with # of Ratings >= 500 | |||||||||||||
| Genre | 1980s | 1990s | 2000s | ||||||||||
| #1 | Avg | Stdev | Avg #2 | #1 | Avg | Stdev | Avg #2 | #1 | Avg | Stdev | Avg #2 | ||
| Action | 43 | 3.36 | 0.49 | 5,755 | 82 | 3.34 | 0.53 | 5,202 | 107 | 3.31 | 0.53 | 6,535 | |
| Comedy | 58 | 3.43 | 0.42 | 6,927 | 95 | 3.37 | 0.51 | 6,408 | 121 | 3.36 | 0.49 | 6,114 | |
| Drama | 66 | 3.36 | 0.42 | 6,973 | 163 | 3.30 | 0.48 | 6,930 | 249 | 3.36 | 0.51 | 6,246 | |
| Horror | 20 | 3.42 | 0.38 | 4,595 | 26 | 3.45 | 0.48 | 4,489 | 41 | 3.30 | 0.57 | 6,711 | |
| Romance | 15 | 3.46 | 0.34 | 6,689 | 56 | 3.26 | 0.53 | 7,952 | 64 | 3.37 | 0.48 | 5,058 | |
| Thriller | 25 | 3.34 | 0.41 | 6,843 | 83 | 3.35 | 0.51 | 6,268 | 121 | 3.31 | 0.53 | 6,799 | |
| Source: "The Movie Dataset" on Kaggle | |||||||||||||
| 1 #: Number of movies 2 Avg #: Average number of ratings. |
|||||||||||||
It appears the variances of ratings in 1980s are lower than in 1990s and 2000s. (A boxplot or scatterplot might be more appropriate here.)