In this document I do some basic analyses on the list of Japanese films available for streaming on the Criterion Channel. (I’m a Criterion Channel charter subscriber and have an interest in both Japanese cinema and data visualization, so I thought this would be a fun thing to do.)
For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.
I use the tidyverse package of functions for general data manipulation, the tools package to get the md5sum
function, and the caTools package to get the runmean
function.
library("tidyverse")
library("tools")
library("caTools")
I use data from two sources:
cc-japanese-films.csv
: a hand-compiled Google spreadsheet of Japanese films on the Criterion Channel as of April 13, 2019, downloaded in CSV format.tcc-library-japan.csv
: a copy in CSV format of an auto-generated list of Criterion Channel films tagged as being associated with Japan, from the site tcclibrary.com.Each row of each file contains data on a single film. The first file (the hand-curated list) has the following variables:
Year
. The (four-digit) year in which the film was released.DirectorGiven
. The given name of the film’s director.DirectorFamily
. The family name of the film’s director.Film
. The title of the film.CCLink
. The URL of the film on the Criterion Channel site.Series
. The name of the series or franchise to which the film belongs (if applicable). (This is also used to mark sequels or films released in multiple parts.)SeriesNum
. The number of the film within its series or franchise (if applicable).The second file (the tcclibrary.com list) has the following variables:
Film
. The title of the film.Director
. The name of the film’s director, in Western name order.Country
: The country tag(s) for the film. For most films this will be a single tag indicating the country in which the file was produced. If a second tag is present it can indicate a co-production (e.g., Dersu Uzala) or a film produced in one country but about another (e.g., Hiroshima mon amour).Duration
: The length of the film in hh:mm:ss format.Year
. The (four-digit) year in which the film was released.I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.
stopifnot(md5sum("cc-japanese-films.csv")=="54970ee713ec8976337b4ce7c788fba4")
stopifnot(md5sum("tcc-library-japan.csv")=="a152728dccabe7747c1f307a946b1afd")
I begin by reading in the two CSV files; the col_types
parameter is a string identifying each column as a character string (“c”) or an integer (“i”).
films <- read_csv("cc-japanese-films.csv", col_types="iccccci")
tcclibrary <- read_csv("tcc-library-japan.csv", col_types="cccci")
I add a new field Director
to the hand-curated list to match the corresponding field in the tcclibrary.com list, replacing the previous fields for given name and family name.
films <- unite(films, Director, DirectorGiven, DirectorFamily, sep=" ")
I then check for discrepancies between the two lists, as follows:
anti_join
function to check for films that are in my hand-curated list but not in the TCC library list.filter
function to retain (and display) only the film’s title.anti_join(films, tcclibrary, by="Film") %>%
select(Film)
## # A tibble: 5 x 1
## Film
## <chr>
## 1 The 47 Ronin Part 1
## 2 The 47 Ronin Part 2
## 3 The Tale of Zatoichi
## 4 Lone Wolf and Cub: Sword of Vengeance
## 5 Hanzo the Razor: Who’s Got the Gold?
anti_join(tcclibrary, films, by="Film") %>%
select(Film)
## # A tibble: 10 x 1
## Film
## <chr>
## 1 THE 47 RONIN: Part 1
## 2 THE 47 RONIN: Part 2
## 3 Hiroshima mon amour
## 4 Zatoichi #1: The Tale of Zatoichi
## 5 Lone Wolf and Cub #1: Sword of Vengeance
## 6 Hanzo the Razor: Who’s Got the Gold
## 7 Mishima: A Life in Four Chapters
## 8 A Brief History of Time
## 9 Night on Earth
## 10 Yi Yi
The discrepancies can be accounted for as follows:
I verify that the listed year and director for each film match between the two lists, as follows:
Film
.Director
and Year
. The joined list renames these as Director.x
and Year.x
for fields coming from the hand-curated list, and Director.y
and Year.y
for fields coming from the tcclibrary.com list.films %>%
inner_join(tcclibrary, by="Film") %>%
filter(Director.x != Director.y | Year.x != Year.y) %>%
select(Film, Year.x, Year.y, Director.x, Director.y)
## # A tibble: 1 x 5
## Film Year.x Year.y Director.x Director.y
## <chr> <int> <int> <chr> <chr>
## 1 Sensation of the … 1966 1966 Taguchi/Nobumasa Suketar… Taguchi Suke…
The one anomaly is for the film Sensation of the Century, which has two directors and for which the directors’ names are formatted differently in the two lists.
I do analyses to answer the following questions:
I create a line plot showing the number of Japanese films on the Criterion Channel released in each year, along with a moving average to smooth out drastic year-to-year fluctuations in the number of releases.
I do this as follows:
NF
with the number of films in each group.NF=0
).runmean
function) and then add that to the list of films as a new variable NFMA
.ggplot
to create the plot, using geom_line
once to plot a line for the actual number of releases (NF
) and then again to plot a line for the five-year moving average (NFMA
). The former line I plot in gray in a slightly smaller width than normal, and the latter line I plot in blue in a slightly larger width.Year
), along with a plot title and subtitle.theme_minimal
theme for a clean look, and then tweak it slightly for readability, displaying the x-axis tick mark labels at an angle, moving the x- and y-axis labels slightly away from the tick mark labels, and moving the caption slightly lower.fpy <- films %>%
group_by(Year) %>%
summarize(NF = n())
fpy <- bind_rows(fpy, tibble(Year=setdiff(1920:2019, fpy$Year), NF=0))
mov_avg <- runmean(fpy$NF, 5)
fpy <- bind_cols(fpy, NFMA=mov_avg)
fpy %>%
ggplot() +
geom_line(mapping=aes(x=Year, y=NF), color="gray", size=0.75) +
geom_line(mapping=aes(x=Year, y=NFMA), color="blue", size=1.5) +
scale_x_continuous(breaks=seq(1925, 2015, 5)) +
ylab("Films") +
labs(title="Japanese Films on the Criterion Channel",
subtitle="Number of Available Films for Each Year (Actual and 5-Year Moving Average)",
caption="Data sources: The Criterion Channel and tcclibrary.com") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15)))
The release dates of the available films range from 1929 to 2013; the maximum number of films from any given year is 17.
The conventional wisdom (as found at Wikipedia for example) is that the 1950s were the golden age of Japanese cinema. However if we judge inclusion on the Criterion Channel as an endorsement, that golden age extended at least until the mid-1960s. It’s also interesting that the Criterion Channel has relatively few films from the mid-1970s on, with many of those being from Juzo Itami (see below); the 1930s are actually better represented in terms of the number of films per year.
I create a histogram showing the number of Japanese films on the Criterion Channel for each director, restricting the plot to directors with at least five releases available.
I do this as follows:
NF
with the number of films in each group.ggplot
to create the plot, using geom_histogram
to plot the number of films per director, reordering the directors on the x-axis to list them in decreasing order of number of films.theme_minimal
theme for a clean look, and again tweak it slightly for readability as I did for the plot above.fpd <- films %>%
group_by(Director) %>%
summarize(NF=n()) %>%
filter(NF >= 5)
fpd %>%
ggplot(mapping=aes(x=reorder(Director, -NF), y=NF)) +
geom_bar(stat="identity") +
xlab("Director") +
ylab("Films") +
labs(title="Japanese Directors with Five or More Films on the Criterion Channel",
subtitle="Number of Films Available for Each Director",
caption="Data sources: The Criterion Channel and tcclibrary.com") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=5))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15)))
Famous Japanese directors like Kurosawa and Ozu are well-represented on the Criterion Channel, but Keisuke Kinoshita, the director with the most releases (42), is not a household name in America. However he was a very popular filmmaker in Japan: his 1954 film Twenty-Four Eyes won the 1955 Kinema Junpo best film award (apparently beating out, among others, Kurosawa’s Seven Samurai and Mizoguchi’s Sansho the Bailiff), and is rated #6 on the Kinema Junpo list of top Japanese films. Kinoshita was also very prolific and had a very long career (see below). It’s therefore not surprising that so many of his films are available on the Criterion Channel.
I create a box plot (also known as a box-and-whiskers plot) for each Japanese director with ten or more releases on the Criterion Channel, summarizing the years in which they directed their available films. The plot is interpreted as follows:
The plot itself is created as follows:
FY
and NF
, representing the first year in which the director released a film available on the Criterion Channel and the number of that director’s films available.Director
, FY
, and NF
. I therefore join this list with the full list of films to produce a new list containing only films for the most prolific directors. This list will include the other variables of interest, most notably the year of release (Year
).Director
and Year
but also FY
(see the next step).ggplot
to create the plot, using geom_bloxplot
to create a boxplot for each director. I reorder the list of directors to display them in order of the first year for which they have a film available on the Criterion Channel.Director
) now becomes the y-axis, with the former y-axis (Year
) now the x-axis.theme_minimal
theme for a clean look, and again tweak it slightly for readability.prolific <- films %>%
group_by(Director) %>%
summarize(FY=min(Year), NF=n()) %>%
filter(NF >= 10)
films %>% inner_join(prolific, by="Director") %>%
select(Director, Year, FY) %>%
ggplot(mapping=aes(x=reorder(Director, FY), y=Year)) +
geom_boxplot() +
xlab("Director") +
scale_y_continuous(breaks=seq(1930, 2000, 5)) +
coord_flip() +
labs(title="Japanese Directors with Ten or More Films on the Criterion Channel",
subtitle="Early, Middle, and Late Periods, Career Midpoint, and Outliers",
caption="Data sources: The Criterion Channel and tcclibrary.com") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15)))
Several things are worth noting on this plot:
Japanese cinema has featured several successful film franchises, some of which have films available on the Criterion Channel. Here I look at the franchises among the available films, identified as film series with at least four films in the series. This eliminates two-part films, films with only a single sequel, and trilogies, leaving only what I’d consider to be a true franchise. For each such franchise I plot the release dates for the franchise’s films.
The plot itself is created as follows:
FY
and NF
, representing the first year for which a film in the series is available on the Criterion Channel and the number of available films in that series respectively.Series
, FY
, and NF
. I therefore join this list with the full list of films to produce a new list containing only films for the series of interest. This list will include the other variables of interest, most notably the year of release (Year
).Series
and Year
but also FY
(see the next step).ggplot
to create the plot, using geom_jitter
to plot a point for each film’s release date while trying to avoid having points for films released the same year overlap. I reorder the list of franchises to display them in order of the first year in which they have a release available on the Criterion Channel. I tweak the amount of jitter a bit to specify that jittering be done around the year value but not the franchise value. (I use the height
parameter for this because the y-axis will become the x-axis after flipping coordinates below.)Series
) now becomes the y-axis, with the former y-axis (Year
) now the x-axis.theme_minimal
theme for a clean look, and then tweak it slightly for better readability.franchises <- films %>%
filter(!is.na(Series)) %>%
group_by(Series) %>%
summarize(FY=min(Year), NF=n()) %>%
filter(NF >= 4) %>%
select(Series, FY)
f_films <- films %>%
inner_join(franchises, by="Series")
f_films %>%
ggplot(mapping=aes(x=reorder(Series, FY), y=Year)) +
geom_jitter(width=0, height=0.4) +
xlab("Franchise") +
scale_y_continuous(breaks=seq(1950, 1980, 5)) +
coord_flip() +
labs(title="Japanese Film Franchises on the Criterion Channel",
subtitle="Film Release Dates",
caption="Data sources: The Criterion Channel and tcclibrary.com") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15)))
The Godzilla franchise is the longest-lived franchise on the Criterion Channel, and continues to this day. The Zatoichi: The Blind Swordsman series has more entries however, and (unlike the Godzilla series) is essentially complete on the Criterion Channel: only the last film in the series, Zatoichi: Darkness Is His Ally, is not available.
The major caveat is that the Criterion Channel does not have all Japanese films, or all Japanese films by the directors whose films it does have, or even all the top Japanese films ranked by critical reputation. As but one example, it does not have Yasujiro Ozu’s Days of Youth, the earliest Ozu film to have survived. As another example, the Criterion Channel has only five of the top ten films on the Kinema Junpo list of top Japanese films.
I originally created a Google spreadsheet by hand to record Japanese films I wanted to watch on the Criterion Channel. I then read the article “Here’s a List of Every Film Available on The Criterion Channel” at The Film Stage and found out about the film list at tcclibrary.com created by the Reddit user sciencehuh. I accessed the site on April 12, 2019, copied the list of films tagged “Japan”, put them into a Google spreadsheet, downloaded the spreadsheet in CSV format, and used that to cross-check my hand-curated list.
Since I began work on these plots the Criterion Channel has released its own official list of films. However if I filter the list to show only films with country equal to Japan then only 314 results are returned, about twenty results less than are in my hand-curated spreadsheet. I have not tried to reconcile the sources, but it’s possible that the Criterion Channel list consolidates multiple films into one entry in some cases.
I am done with this project, at least for now. However in case someone else wants to continue it in some fashion, here are some suggestions for further work:
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bindrcpp_0.2.2 caTools_1.17.1.1 forcats_0.3.0 stringr_1.3.1
## [5] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1
## [9] tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 haven_1.1.2 lattice_0.20-38 colorspace_1.3-2
## [5] htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4 rlang_0.2.2
## [9] pillar_1.3.0 glue_1.3.0 withr_2.1.2 modelr_0.1.2
## [13] readxl_1.1.0 bindr_0.1.1 plyr_1.8.4 munsell_0.5.0
## [17] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 evaluate_0.11
## [21] labeling_0.3 knitr_1.20 fansi_0.3.0 broom_0.5.0
## [25] Rcpp_0.12.18 scales_1.0.0 backports_1.1.2 jsonlite_1.5
## [29] hms_0.4.2 digest_0.6.16 stringi_1.2.4 grid_3.5.3
## [33] rprojroot_1.3-2 cli_1.0.0 bitops_1.0-6 magrittr_1.5
## [37] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2 xml2_1.2.0
## [41] lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.10 httr_1.3.1
## [45] rstudioapi_0.7 R6_2.2.2 nlme_3.1-139 compiler_3.5.3
You can find the source code for this analysis and others at my misc-analysis public Gitlab repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.