Introduction

In this document I do some basic analyses on the list of Japanese films available for streaming on the Criterion Channel. (I’m a Criterion Channel charter subscriber and have an interest in both Japanese cinema and data visualization, so I thought this would be a fun thing to do.)

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the tidyverse package of functions for general data manipulation, the tools package to get the md5sum function, and the caTools package to get the runmean function.

library("tidyverse")
library("tools")
library("caTools")

Data sources

I use data from two sources:

  • cc-japanese-films.csv: a hand-compiled Google spreadsheet of Japanese films on the Criterion Channel as of April 13, 2019, downloaded in CSV format.
  • tcc-library-japan.csv: a copy in CSV format of an auto-generated list of Criterion Channel films tagged as being associated with Japan, from the site tcclibrary.com.

Each row of each file contains data on a single film. The first file (the hand-curated list) has the following variables:

  • Year. The (four-digit) year in which the film was released.
  • DirectorGiven. The given name of the film’s director.
  • DirectorFamily. The family name of the film’s director.
  • Film. The title of the film.
  • CCLink. The URL of the film on the Criterion Channel site.
  • Series. The name of the series or franchise to which the film belongs (if applicable). (This is also used to mark sequels or films released in multiple parts.)
  • SeriesNum. The number of the film within its series or franchise (if applicable).

The second file (the tcclibrary.com list) has the following variables:

  • Film. The title of the film.
  • Director. The name of the film’s director, in Western name order.
  • Country: The country tag(s) for the film. For most films this will be a single tag indicating the country in which the file was produced. If a second tag is present it can indicate a co-production (e.g., Dersu Uzala) or a film produced in one country but about another (e.g., Hiroshima mon amour).
  • Duration: The length of the film in hh:mm:ss format.
  • Year. The (four-digit) year in which the film was released.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("cc-japanese-films.csv")=="54970ee713ec8976337b4ce7c788fba4")
stopifnot(md5sum("tcc-library-japan.csv")=="a152728dccabe7747c1f307a946b1afd")

Reading in and preparing the data

I begin by reading in the two CSV files; the col_types parameter is a string identifying each column as a character string (“c”) or an integer (“i”).

films <- read_csv("cc-japanese-films.csv", col_types="iccccci")
tcclibrary <- read_csv("tcc-library-japan.csv", col_types="cccci")

I add a new field Director to the hand-curated list to match the corresponding field in the tcclibrary.com list, replacing the previous fields for given name and family name.

films <- unite(films, Director, DirectorGiven, DirectorFamily, sep=" ")

I then check for discrepancies between the two lists, as follows:

  1. I use the anti_join function to check for films that are in my hand-curated list but not in the TCC library list.
  2. I use the filter function to retain (and display) only the film’s title.
  3. I follow the same process in reverse to check for films that are in the TCC library list, but not in my hand-curated list.
anti_join(films, tcclibrary, by="Film") %>%
  select(Film)
## # A tibble: 5 x 1
##   Film                                 
##   <chr>                                
## 1 The 47 Ronin Part 1                  
## 2 The 47 Ronin Part 2                  
## 3 The Tale of Zatoichi                 
## 4 Lone Wolf and Cub: Sword of Vengeance
## 5 Hanzo the Razor: Who’s Got the Gold?
anti_join(tcclibrary, films, by="Film") %>%
  select(Film)
## # A tibble: 10 x 1
##    Film                                    
##    <chr>                                   
##  1 THE 47 RONIN: Part 1                    
##  2 THE 47 RONIN: Part 2                    
##  3 Hiroshima mon amour                     
##  4 Zatoichi #1: The Tale of Zatoichi       
##  5 Lone Wolf and Cub #1: Sword of Vengeance
##  6 Hanzo the Razor: Who’s Got the Gold     
##  7 Mishima: A Life in Four Chapters        
##  8 A Brief History of Time                 
##  9 Night on Earth                          
## 10 Yi Yi

The discrepancies can be accounted for as follows:

  • The hand-curated list contains what I consider to be more correct spellings for four of the Japanese films.
  • The tcclibrary.com list contains five films that are tagged “Japan” because they are co-productions and/or touch on Japanese subjects in some way, but are not directed by Japanese directors.

I verify that the listed year and director for each film match between the two lists, as follows:

  1. I join the two lists on the common field Film.
  2. The two lists have two other common fields, Director and Year. The joined list renames these as Director.x and Year.x for fields coming from the hand-curated list, and Director.y and Year.y for fields coming from the tcclibrary.com list.
  3. I filter for any films where the director and/or year do not match between the two lists.
  4. I retain and display only the films’ titles and the fields of interest.
films %>%
  inner_join(tcclibrary, by="Film") %>%
  filter(Director.x != Director.y | Year.x != Year.y) %>%
  select(Film, Year.x, Year.y, Director.x, Director.y)
## # A tibble: 1 x 5
##   Film               Year.x Year.y Director.x                Director.y   
##   <chr>               <int>  <int> <chr>                     <chr>        
## 1 Sensation of the …   1966   1966 Taguchi/Nobumasa Suketar… Taguchi Suke…

The one anomaly is for the film Sensation of the Century, which has two directors and for which the directors’ names are formatted differently in the two lists.

Analysis

I do analyses to answer the following questions:

Criterion Channel Japanese releases per year

I create a line plot showing the number of Japanese films on the Criterion Channel released in each year, along with a moving average to smooth out drastic year-to-year fluctuations in the number of releases.

I do this as follows:

  1. I group the films in the hand-curated list by year and create a new variable NF with the number of films in each group.
  2. I add to the resulting list a set of rows representing years in which there were no releases at all (NF=0).
  3. I calculate the five-year moving average of the number of films released (using the runmean function) and then add that to the list of films as a new variable NFMA.
  4. I use ggplot to create the plot, using geom_line once to plot a line for the actual number of releases (NF) and then again to plot a line for the five-year moving average (NFMA). The former line I plot in gray in a slightly smaller width than normal, and the latter line I plot in blue in a slightly larger width.
  5. I specify an x-axis running from 1925 to 2020, with tick marks every five years.
  6. I add a label for the y-axis (the x-axis label is taken from the variable Year), along with a plot title and subtitle.
  7. I use the theme_minimal theme for a clean look, and then tweak it slightly for readability, displaying the x-axis tick mark labels at an angle, moving the x- and y-axis labels slightly away from the tick mark labels, and moving the caption slightly lower.
fpy <- films %>%
  group_by(Year) %>%
  summarize(NF = n())

fpy <- bind_rows(fpy, tibble(Year=setdiff(1920:2019, fpy$Year), NF=0))

mov_avg <- runmean(fpy$NF, 5)

fpy <- bind_cols(fpy, NFMA=mov_avg)

fpy %>%
  ggplot() +
  geom_line(mapping=aes(x=Year, y=NF), color="gray", size=0.75) +
  geom_line(mapping=aes(x=Year, y=NFMA), color="blue", size=1.5) +
  scale_x_continuous(breaks=seq(1925, 2015, 5)) +
  ylab("Films") +
  labs(title="Japanese Films on the Criterion Channel",
       subtitle="Number of Available Films for Each Year (Actual and 5-Year Moving Average)",
       caption="Data sources: The Criterion Channel and tcclibrary.com") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15)))

The release dates of the available films range from 1929 to 2013; the maximum number of films from any given year is 17.

The conventional wisdom (as found at Wikipedia for example) is that the 1950s were the golden age of Japanese cinema. However if we judge inclusion on the Criterion Channel as an endorsement, that golden age extended at least until the mid-1960s. It’s also interesting that the Criterion Channel has relatively few films from the mid-1970s on, with many of those being from Juzo Itami (see below); the 1930s are actually better represented in terms of the number of films per year.

Which Japanese directors are best represented?

I create a histogram showing the number of Japanese films on the Criterion Channel for each director, restricting the plot to directors with at least five releases available.

I do this as follows:

  1. I group the films in the hand-curated list by director and create a new variable NF with the number of films in each group.
  2. I filter the results to include only directors with five or more films on the list.
  3. I use ggplot to create the plot, using geom_histogram to plot the number of films per director, reordering the directors on the x-axis to list them in decreasing order of number of films.
  4. I add labels for the x- and y-axis, along with a plot title, subtitle, and caption.
  5. I use the theme_minimal theme for a clean look, and again tweak it slightly for readability as I did for the plot above.
fpd <- films %>%
  group_by(Director) %>%
  summarize(NF=n()) %>%
  filter(NF >= 5)

fpd %>%
  ggplot(mapping=aes(x=reorder(Director, -NF), y=NF)) +
  geom_bar(stat="identity") +
  xlab("Director") +
  ylab("Films") +
  labs(title="Japanese Directors with Five or More Films on the Criterion Channel",
       subtitle="Number of Films Available for Each Director",
       caption="Data sources: The Criterion Channel and tcclibrary.com") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=5))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15)))

Famous Japanese directors like Kurosawa and Ozu are well-represented on the Criterion Channel, but Keisuke Kinoshita, the director with the most releases (42), is not a household name in America. However he was a very popular filmmaker in Japan: his 1954 film Twenty-Four Eyes won the 1955 Kinema Junpo best film award (apparently beating out, among others, Kurosawa’s Seven Samurai and Mizoguchi’s Sansho the Bailiff), and is rated #6 on the Kinema Junpo list of top Japanese films. Kinoshita was also very prolific and had a very long career (see below). It’s therefore not surprising that so many of his films are available on the Criterion Channel.

Directors’ careers as represented on the Criterion Channel

I create a box plot (also known as a box-and-whiskers plot) for each Japanese director with ten or more releases on the Criterion Channel, summarizing the years in which they directed their available films. The plot is interpreted as follows:

  • The vertical line in the middle of each box represents the median or mid-point year of the director’s career as represented on the Criterion Channel: half of their available films were released before that time, and half afterwards.
  • The left edge of each box represents the 25th percentile of the director’s available films: 25% of the films were released prior to this date.
  • The right edge of each box represents the 75th percentile of the director’s available films: 75% of the films were released prior to this date.
  • The box itself therefore represents the period during which 50% of the director’s available films were released. In statistical jargon the length of this period is known as the interquartile range or IQR. This can be thought of as the mid-career period for a director as represented on the Criterion Channel.
  • The “whiskers” to the left and right of the box contain years outside the IQR, representing the main early and late phases of a director’s career as represented on the Criterion Channel.
  • The left whisker contains all available early films released within a period no more than 1.5 times the length of the IQR. Thus, for example, if a director’s available mid-career films were released in an eight-year period (the IQR) from 1951 though 1958, the left whisker would extend to include any available films released in the 12-year period (1.5 times 8) from 1939 through 1950. (If the director’s first available film were released in, say, 1947, the left whisker would extend only to that year.)
  • The right whisker contains all available late films released within a period no more than 1.5 times the length of the IQR. Using our prior example of an eight-year IQR running from 1951 though 1958, the right whisker would extend to include any available films released in the 12-year period from 1959 through 1970. (Again, if the director’s last available film were released in, say, 1967, the right whisker would extend only to that year.)
  • For most directors the left and right whiskers will cover all their films available on the Criterion Channel. However in some cases a director may release a film much earlier or much later than the main portion of their career. If there are any such available films outside the early-career and late-career periods covered by the left and right whiskers respectively, those outliers are plotted as individual dots in the year of release.

The plot itself is created as follows:

  1. I group the films in the hand-curated list by director and create two new variables FY and NF, representing the first year in which the director released a film available on the Criterion Channel and the number of that director’s films available.
  2. I then filter the resulting list to include prolific directors only, defined as having at least ten films available on the Criterion Channel.
  3. The summarization step above retains only the fields Director, FY, and NF. I therefore join this list with the full list of films to produce a new list containing only films for the most prolific directors. This list will include the other variables of interest, most notably the year of release (Year).
  4. I retain only the fields needed for the plot, including Director and Year but also FY (see the next step).
  5. I use ggplot to create the plot, using geom_bloxplot to create a boxplot for each director. I reorder the list of directors to display them in order of the first year for which they have a film available on the Criterion Channel.
  6. I specify a y-axis running from 1930 to 2000, with tick marks every five years.
  7. I flip the coordinate system so that the former x-axis (Director) now becomes the y-axis, with the former y-axis (Year) now the x-axis.
  8. I use the theme_minimal theme for a clean look, and again tweak it slightly for readability.
prolific <- films %>%
  group_by(Director) %>%
  summarize(FY=min(Year), NF=n()) %>%
  filter(NF >= 10)

films %>% inner_join(prolific, by="Director") %>%
  select(Director, Year, FY) %>%
  ggplot(mapping=aes(x=reorder(Director, FY), y=Year)) +
  geom_boxplot() +
  xlab("Director") +
  scale_y_continuous(breaks=seq(1930, 2000, 5)) +
  coord_flip() +
  labs(title="Japanese Directors with Ten or More Films on the Criterion Channel",
       subtitle="Early, Middle, and Late Periods, Career Midpoint, and Outliers",
       caption="Data sources: The Criterion Channel and tcclibrary.com") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15)))

Several things are worth noting on this plot:

  • Although the careers of Yasujiro Ozu and Mikio Naruse (as represented on the Criterion Channel) span roughly the same time period, Ozu’s career is much more front-loaded compared to Naruse’s: Over half Ozu’s available films are from the 1930s, while most of Naruse’s available films were released in the 1950s and 1960s.
  • Both Akira Kurosawa and Keisuke Kinoshita had very long and mostly-overlapping careers as represented on the Criterion Channel. They also released films very late in their careers well after the period when they were at peak productivity.
  • Juzo Itami has one film available from the early 1960s, and no other films available until the 1980s and 1990s.

Japanese film franchises represented on the Criterion Channel

Japanese cinema has featured several successful film franchises, some of which have films available on the Criterion Channel. Here I look at the franchises among the available films, identified as film series with at least four films in the series. This eliminates two-part films, films with only a single sequel, and trilogies, leaving only what I’d consider to be a true franchise. For each such franchise I plot the release dates for the franchise’s films.

The plot itself is created as follows:

  1. I group the films in the hand-curated list by their series tag (if present) and create two new variables FY and NF, representing the first year for which a film in the series is available on the Criterion Channel and the number of available films in that series respectively.
  2. I then filter the resulting list to include only series for which at least four films are available on the Criterion Channel.
  3. The summarization step above retains only the fields Series, FY, and NF. I therefore join this list with the full list of films to produce a new list containing only films for the series of interest. This list will include the other variables of interest, most notably the year of release (Year).
  4. I retain only the fields needed for the plot, including Series and Year but also FY (see the next step).
  5. I use ggplot to create the plot, using geom_jitter to plot a point for each film’s release date while trying to avoid having points for films released the same year overlap. I reorder the list of franchises to display them in order of the first year in which they have a release available on the Criterion Channel. I tweak the amount of jitter a bit to specify that jittering be done around the year value but not the franchise value. (I use the height parameter for this because the y-axis will become the x-axis after flipping coordinates below.)
  6. I specify a y-axis running from 1950 to 1980, with tick marks every five years. (Again, this becomes the x-axis after flipping coordinates.)
  7. I flip the coordinate system so that the former x-axis (Series) now becomes the y-axis, with the former y-axis (Year) now the x-axis.
  8. I use the theme_minimal theme for a clean look, and then tweak it slightly for better readability.
franchises <- films %>%
  filter(!is.na(Series)) %>%
  group_by(Series) %>%
  summarize(FY=min(Year), NF=n()) %>%
  filter(NF >= 4) %>%
  select(Series, FY)

f_films <- films %>%
  inner_join(franchises, by="Series")

f_films %>%
  ggplot(mapping=aes(x=reorder(Series, FY), y=Year)) +
  geom_jitter(width=0, height=0.4) +
  xlab("Franchise") +
  scale_y_continuous(breaks=seq(1950, 1980, 5)) +
  coord_flip() +
  labs(title="Japanese Film Franchises on the Criterion Channel",
       subtitle="Film Release Dates",
       caption="Data sources: The Criterion Channel and tcclibrary.com") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15)))

The Godzilla franchise is the longest-lived franchise on the Criterion Channel, and continues to this day. The Zatoichi: The Blind Swordsman series has more entries however, and (unlike the Godzilla series) is essentially complete on the Criterion Channel: only the last film in the series, Zatoichi: Darkness Is His Ally, is not available.

Appendix

Caveats

The major caveat is that the Criterion Channel does not have all Japanese films, or all Japanese films by the directors whose films it does have, or even all the top Japanese films ranked by critical reputation. As but one example, it does not have Yasujiro Ozu’s Days of Youth, the earliest Ozu film to have survived. As another example, the Criterion Channel has only five of the top ten films on the Kinema Junpo list of top Japanese films.

References

I originally created a Google spreadsheet by hand to record Japanese films I wanted to watch on the Criterion Channel. I then read the article “Here’s a List of Every Film Available on The Criterion Channel” at The Film Stage and found out about the film list at tcclibrary.com created by the Reddit user sciencehuh. I accessed the site on April 12, 2019, copied the list of films tagged “Japan”, put them into a Google spreadsheet, downloaded the spreadsheet in CSV format, and used that to cross-check my hand-curated list.

Since I began work on these plots the Criterion Channel has released its own official list of films. However if I filter the list to show only films with country equal to Japan then only 314 results are returned, about twenty results less than are in my hand-curated spreadsheet. I have not tried to reconcile the sources, but it’s possible that the Criterion Channel list consolidates multiple films into one entry in some cases.

Suggestions for others

I am done with this project, at least for now. However in case someone else wants to continue it in some fashion, here are some suggestions for further work:

  • Reconcile the official Criterion Channel list with the data sources used for this analysis.
  • Extend the list used in this analysis to include other Japanese films not on the Criterion Channel that were directed by the same directors who have films available on the service.
  • Find as complete a list of Japanese films as possible and do similar analyses against such an exhaustive list.
  • Apply the methods used in this analysis to Criterion Channel films from other countries. (The United States and France would be good candidates, since they have a comparable number of films available on the Criterion Channel.)

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2   caTools_1.17.1.1 forcats_0.3.0    stringr_1.3.1   
##  [5] dplyr_0.7.6      purrr_0.2.5      readr_1.1.1      tidyr_0.8.1     
##  [9] tibble_1.4.2     ggplot2_3.0.0    tidyverse_1.2.1 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4 haven_1.1.2      lattice_0.20-38  colorspace_1.3-2
##  [5] htmltools_0.3.6  yaml_2.2.0       utf8_1.1.4       rlang_0.2.2     
##  [9] pillar_1.3.0     glue_1.3.0       withr_2.1.2      modelr_0.1.2    
## [13] readxl_1.1.0     bindr_0.1.1      plyr_1.8.4       munsell_0.5.0   
## [17] gtable_0.2.0     cellranger_1.1.0 rvest_0.3.2      evaluate_0.11   
## [21] labeling_0.3     knitr_1.20       fansi_0.3.0      broom_0.5.0     
## [25] Rcpp_0.12.18     scales_1.0.0     backports_1.1.2  jsonlite_1.5    
## [29] hms_0.4.2        digest_0.6.16    stringi_1.2.4    grid_3.5.3      
## [33] rprojroot_1.3-2  cli_1.0.0        bitops_1.0-6     magrittr_1.5    
## [37] lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2  xml2_1.2.0      
## [41] lubridate_1.7.4  assertthat_0.2.0 rmarkdown_1.10   httr_1.3.1      
## [45] rstudioapi_0.7   R6_2.2.2         nlme_3.1-139     compiler_3.5.3

Source code

You can find the source code for this analysis and others at my misc-analysis public Gitlab repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.