Executive summary

This is an experiment on downloading, compiling and preparing data about Argentinean movies, with the primary purpose to practice my R programming and data analysis skills. Since June 2018 I have been taking some online courses about these subjects and since late August I endeavoured to do this project to apply, excercise and enhance my skills. By doing this, I learnt a lot more.

This report presents the steps I took to do so, and also provides the code I used, as well as results and some comments about difficulties and workarounds I found on the way.

Readers unfamiliar with R programming language, or uninterested in the technical process may prefer to skip the ‘Downloading and preparing data’ and the ‘Assmebling data’, as they focus on programming steps. A table of contents with hyperlinks is available on the left.

This experiment can be summarized as an intent to identify films from Argentina in data sets from a comprehensive online film database; to obtain and put together different data and then asking and answering some questions. Given that the intent is to practice skills, some concepts have been used with additional freedom.

Preliminary Comments

This is intended as a learning-by-doing experiment, after taking some online courses and doing some other experiments on my own.

While working on this, I realized of some mistakes on a later stage which required me to go over again what I had already done, and so I did my best to update this report whenever needed trying to keep consistency and while not re-starting from scratch. There are some parts of the report that I generated evaluating code chunks (this I learnt while I was working on the third part). Many code chunks have evaluation turned off, especially those that process the very large raw datasets, and the displayed output is a copy-paste-format from my R console.

This excercise relies on some web scrapping (getting data from html files available online), filtering, arranging and joining together datasets and hopefully to get some meaningful insight or discovery - or most probably, replicating by myself a finding already published by someone else.

I would like to clarify that I am just getting started with data analysis and I am no expert in the seventh art; I do this for the sake of learning and challenging myself.

As my first RMarkdown experiment and because of some difficulties I found on the way, I lately realized that I should have reflected outputs more often (which I tried to fix retroactively).

Why Argentinean movies and what for? Because I once discovered I could get heaps of data from the IMDB. My country of origin is Argentina and I am curious about films produced there.

How representative and valid are these data/findings? As far as I can tell, the IMDB is a very comprehensive database. I began this experiment by late August 2018, so the data should be valid until that point in time. Due to some difficulties in data download and processing, I did not want to update the data source again.

Data source

IMDB datasets are available from their interface website and they seem not to contain information about nationality from movies. This might be somewhat difficult to define as many movies are produced internationally or cross-nationally.

Many of their datasets are focused in a particular aspects of movies or artists and so they need to be joined together.

Additionally, some information in this project has been obtained from the Wikipedia.

Acknowledgements and citation

Data processing in this project has been performed with R programming language and I also used the RStudio application for a better UI. R has many packages to improve the base features, which I will cite below and hoping I do not miss any:

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.5. https://CRAN.R-project.org/package=dplyr

Matt Dowle and Arun Srinivasan (2018). data.table: Extension of data.frame. R package version 1.11.4. https://CRAN.R-project.org/package=data.table

Hadley Wickham (2016). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.2. https://CRAN.R-project.org/package=rvest

Hadley Wickham (2017). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.2.0. https://CRAN.R-project.org/package=stringr

Ramnath Vaidyanathan, Yihui Xie, JJ Allaire, Joe Cheng and Kenton Russell (2018). htmlwidgets: HTML Widgets for R. R package version 1.2. https://CRAN.R-project.org/package=htmlwidgets

Richard Cotton (2017). rebus: Build Regular Expressions in a Human Readable Way. R package version 0.1-3. https://CRAN.R-project.org/package=rebus

Hadley Wickham (2017). httr: Tools for Working with URLs and HTTP. R package version 1.3.1. https://CRAN.R-project.org/package=httr

Filtering Argentine movies data

To filter movies, based on the available IMDB data, I will match those movies we know its director was Argentine.

Discussion begins I want to make it clear: this is an arbtirary decision I made and I criticize it as well, because an Argentine director may work in a completely foreign movie, nothing from or related to Argentina but the appointed director; inversely, Argentine producers can appoint directors from abroad; or even more confusing, directors born in another country made their career in Argentina. I ignore at to what point is the film industry trans-national (especially if we look into where the money comes from or go to), but there must be many cases in which is difficult or impossible to discern what country is a film actually or exclusively from.

Once again, this is just an experiment to test and hone my R skills; therefore, I just chose directors nationality as a proxy(-faulty)-variable of film nationality. End of discussion

I found a quite comprehensive list of Argentinean director name in this Wikipiedia page. Although this has proven a problem as I will tell later.

Downloading and preparing data

Obtaining directors name

The Wikipedia page has 2 sections listing Argentinean directors. Webscraping is performed with rvest package.

library(rvest)
directors_output1 <- character() # Define vector outside of the loop
for(i in 1:13){  # Length is number of letters in website (see url variable)
  url <- "https://es.wikipedia.org/wiki/Categor%C3%ADa:Directores_de_cine_de_Argentina"
  directors_input <- url %>%
  read_html() %>%
  html_nodes(xpath = paste('/html/body/div[3]/div[3]/div[4]/div[2]/div[2]/div/div/div[', i, ']/ul')) %>% # xpath is the HTML location.
  html_text() # Obtain HTML text from the node
  # The result is a single string with a "\n" n between every name
  directors_split <- strsplit(directors_input, "\n")
  directors_output1 <- c(directors_output1, directors_split)
}

This code chunk is duplicated for the second Wikipedia page and it returns a list with remaining Argentinean film directors. Then, put them together:

directors_full <- c(directors_output1, directors_output2) # Output variable is a list full of names, so...
directors_vfull <- unlist(directors_full) # Transform it to a plain regular vector
# What did I get? (Showing some interesting cases)
directors_vfull[25:35]
"Ciro Ayala", "Fernando Ayala",  "Pablo Bardauil", "Daniel Barone", "Luis Barone",  "Juan Batlle Planas (hijo)", "Guillermo Battaglia", "Tristán Bauer", "Luis Bayón Herrera", "Delfor María Beccaglia","Derlis María Beccaglia", ...   

Trimming directors name

Looking at the directors_vfull vector, we notice some names have parenthesis to the right (see *“Juan Battle Planas on the output above*)

library(rebus)
library(stringr)
library(htmlwidgets)
str_view(directors_vfull, pattern = SPC %R% capture(OPEN_PAREN %R% one_or_more(WRD) %R% CLOSE_PAREN) %R% END, match = TRUE)
# Showing first six in output
Juan Batlle Planas (hijo)
Armando Bó (guionista)
Carlos Borcosque (hijo)
Juan Cabral (director)
Miguel Ángel Cárcano (cineasta)
Carlos Echeverría (director)

We replace that part:

directors_trim <- str_replace_all(directors_vfull, pattern = SPC %R% capture(OPEN_PAREN %R% one_or_more(WRD) %R% CLOSE_PAREN) %R% END, "")
> directors_trim[29:34] # Notice the second one
[1] "Luis Barone"            "Juan Batlle Planas"     "Guillermo Battaglia"    "Tristán Bauer"         
[5] "Luis Bayón Herrera"     "Delfor María Beccaglia"

One last bit is about the accent on Spanish vowels. Upon testing my code on further steps I realized that many directors with accents in their names were not showing up. So, I created the accent buster function which replaces all accents, in strings:

accent_buster <- function(el){  
  output <- chartr("Á", "A", el)
  output <- chartr("É", "E", output)
  #... and so forth ..
  output <- chartr("á", "a", output)
  output <- chartr("é", "e", output)
}
directors_ready <- accent_buster(directors_trim)
head(directors_ready)
"Ariel Abadi", "Angel Acciaresi", "Jorge Luis Acha", "Ezequiel Acuña", "Alejandro Agresti", "Alejandro Arroz" 

Voila! We now have our Argentinean directors vector ready to filter IMDB data.

Downloading and extracting data

Download

Much for my dismay, I was not able to pull IMDB datasets from the R console because files are available online as .gz, that is, a file compression standard. So I had to download a .gz processor, download the files to my local drive from the IMDB User Interface and unzip them again in my local drive before loading them on R.

Filtering Argentine movies from data sets

Note in this part I had a lot of trial and error with working code chunks. For the purpose of clarity, I only add below the last working version of code.

There was a further setback about the size of the files to be processed in R. I tried many times, different variations of the following code, basically to pull the file by parts, filtering it with name matches with the r_directors_ready vector. However, it seems that the 500+ MB file it was too much for my humble traveler laptop. I kept getting errors because of “file too large”.

So, I finally came up with the idea to run the code in base R, instead of RStudio. It worked! First I manually ran each line of code and succeeded in compiling an output data frame with the Argentinean directors names.

library(dplyr)
library(data.table)
# Pulling a title code vector to filter with
arg_tdd <- fread("Data/datasets/imdb/arg_titles_directors_details.csv", header = TRUE)
title_filter <- pull(arg_tdd, 2)
# Preparing the output data table. "cast_filter.csv" was the name I put to the file after decompressing
output <- fread("Data/datasets/imdb/cast.tsv", header = TRUE, nrows = 0)
output <- output[,c(2:7)] # removing the automatic index column to the lieft in CSV files
h_names <- colnames(output)
# Extracting and filtering
names_raw <- fread("Data/datasets/imdb/cast.tsv", nrows = 1000000, skip = (X * 1000000)) # Where "X" was the number of iteration manually changed every time
colnames(names_raw) <- h_names
filter <- names_raw[which(names_raw$tconst %in% title_filter)]
rm(names_raw)
output <- rbind(output, filter, fill = TRUE)
rm(filter)

The code above was intedended to work as a function but I was not able to run it altogether in base R. Instead I ran line by line, debugging and double-checking on the way, manually replacing the iterator variables. There was something wrong when I tried to copy and paste the function onto plain R engine to test it (no warning, no error, just could not allocate the function element to run it). I saved the resulting data frame to import it onto RStudio and continue from there.

However, I did a new and improved version to work with other files from IMDB. The following helped me to process crew data (matching directors code from the names files with title codes):

filter_file <- function(source, iteration, reference, join_key){
  # source # file location "imdb/titles_info.tsv"
  # iteration # how many millions of rows are in the source
  # reference # a vector containing matches values to filter with
  # join_key # a string referring to the key variable between source and reference
  output <- fread(source, header = TRUE, nrows = 0)
  header_data <- colnames(output)
  for(i in 1:iteration){
    j <- i - 1
    data_raw <- fread(source, nrows = 1000000, skip = (j * 1000000))
    colnames(data_raw) <- header_data
    filter <- data_raw[which(data_raw[[join_key]] %in% reference)]
    rm(data_raw)
    output <- rbind(output, filter, fill = TRUE)
    rm(filter)}
  return(output)
}
arg_titles <- filter_file("imdb/crew.tsv", 6, argdirectors_imdb$nconst, "directors")
> head(arg_titles)
   V1    tconst directors   writers
1:  1 tt0000692 nm0303066       \\N
2:  2 tt0007646 nm0188105 nm0188105
3:  3 tt0007958 nm0214496 nm0554955
4:  4 tt0008807 nm0337395 nm0337395
5:  5 tt0009619 nm0188105       \\N
6:  6 tt0015576 nm0585931 nm0020082

With the filter_file() function I was able to process and filter the titles.basics.tsv (including title names, type, genre and year), title.crew.tsv (directors and title code).

There was another IMDB table that I worked with later and it was far more challenging: the one named title.principals.tsv, renamed cast in this experiment. This table contains principal cast data; it was a 1.25 GB database. Based in my previous experience, I decided to filter and extract data as I described before: line by line on base R.

I had to run the lines million by million of rows, 31 times. This time I did regret not being able to run a loop in base R but did not want to waste time with trial and error - probably this should be a next learning goal.

Additional fixes

While I had almost all data ready to work with, there were some further processing I needed to do before being able to joined them into my Argentine movie dataset.

Fixing characters

During the download and filtering steps, I saved the data as .csv files so I would not have to deal with big files later. Unfortunately, it seems that this file format does not support latin accents and ñ characters.

# Showing a glimpse of titles with unsupported characters. This data frame matches the downloaded and processed IMDB title.basics.tsv.gz
movie_details[c(1, 2, 4, 7, 10, 13), c(2, 5, 6)]
      tconst              originalTitle isAdult
1: tt0000692 El fusilamiento de Dorrego       0
2: tt0007646                El apóstol       0
3: tt0008807          El último malón       0
4: tt0020660                 Añoranzas       0
5: tt0020856      Enfundá la mandolina       0
6: tt0021316            Rosas de Otoño       0

I came up with the following function to fix all these characters:

fix_accents <- function(el){
  # check if input is a data frame
  if(is.data.frame(el) == FALSE){warning("input is not a data frame")}
    else{
      # coerce input as data frame class, removing potential data.table class
      el <- as.data.frame(el)
      # get columns and row numbers and header names from input 
      l <- ncol(el)
      r <- nrow(el)
      header_el <- colnames(el)
      # create output data frame
      result <- data.frame(matrix(NA, nrow = r, ncol = l))
      colnames(result) <- header_el
      for(i in 1:l){
        # if column is not character, copy to result data frame
       if(is.character(el[,i]) == FALSE){
         result[,i] <- el[,i]
       } else{
  # replace following characters
  output <- gsub("ñ", "ñ", el[,i], fixed = TRUE)
  output <- gsub("á", "a", output, fixed = TRUE)
  # ... more and more replace characters expressions here
  output <- gsub("¡", "¡", output, fixed = TRUE)
  result[,i] <- output
       }
      }
    } 
  return(result)
}
movie_details_fixed <- fix_accents(movie_details)

This is what the function does:

  1. Check if input(el) is a data.frame. If so…
  2. Coerces input as data.frame, removing data.table class (because it messes up with function methods)
  3. Gets input data frame elements and creates a “blank slate” data frame (result) to overwrite with processed information
  4. Enters the loop for every column in el
  5. If column is not character, overwrite result with it, otherwise…
  6. Replaces all funny character combinations into plain vowels
  7. Assigns processed text column into result
  8. Returns result

Fixing cast table

The cast data table (from the title.principals.tsv IMDB file) looked something like this:

head(cast, n = 10)

    V1    tconst ordering    nconst category      job characters
 1:  1 tt0000692        1 nm0742958    actor      \\N \\N
 2:  2 tt0000692        2 nm0349492    actor      \\N \\N
 3:  3 tt0000692        3 nm0143083    actor      \\N \\N
 4:  4 tt0000692        4 nm0303066 director      \\N \\N
 5:  5 tt0007646        1 nm0188105 director      \\N \\N
 6:  6 tt0007646        2 nm0884904 producer producer \\N
 7:  7 tt0007958       10 nm1124672    actor      \\N [""""""""Germán Castillo""""""""]
 8:  8 tt0007958        1 nm0306624    actor      \\N [""""""""Fabián""""""""]
 9:  9 tt0007958        2 nm0685286  actress      \\N [""""""""Rina""""""""]
10: 10 tt0007958        3 nm0351017    actor      \\N [""""""""Miguel Benav�dez""""""""]
# and so forth...

Notice for every title (tconst columns), there are many cast members, including actors, actressess, directors, and there is also irrelevant and annoying columns such as characters (names in the film).

The first step to trim this is to select relevant columns and filter out everyone but actors and actresses with some dplyr tools:

cast_filter <- cast %>%
    filter(category %in% c("actor", "actress")) %>%
    select(tconst, ordering, nconst) 
head(cast_filter, n = 10)
      tconst ordering    nconst
1  tt0000692        1 nm0742958
2  tt0000692        2 nm0349492
3  tt0000692        3 nm0143083
4  tt0007958       10 nm1124672
5  tt0007958        1 nm0306624
6  tt0007958        2 nm0685286
7  tt0007958        3 nm0351017
8  tt0007958        4 nm0677803
9  tt0008807        1 nm8500968
10 tt0008807        2 nm8500969

Now we have only title code, ordering (which is not really useful) and name codes. However, there are many lines with the same title codes, one for every name listed, and the number of rows for every unique title code is also variant.

What I want is to have a row for every title code, with all information in the same row. To achieve this, I created the following code chunk which took me about 5 days to complete. I believe that there must be a better way to do this, and probably an easy package-function pair that might do this very easily. I looked almost everywhere in the Internet, and experimented with reshape functions, reshape2 packages and some more that I can not remember right now, but nothing really worked on this. So here goes this coding mongrel with one large loop plus some more smaller ones (one nested loop, by the way).

spread_data <- function(data, key, value, numchar){
  # data is a dataset
  # key is the variable INDEX (eg. 1, 2, ...) to fetch   
  # value is the variable INDEX to gather
  # numchar is the total character length for every value
  #####
  library(tidyr)
  library(dplyr)
### Get reference names and useful details
  names <- colnames(data)
  key_name <- names[key]
  value_name <- names[value]
  u <- length(unique(data[,key])) # how many uniques (index)
  w <- unique(data[,key]) # vector with uniques key
### Create output element
  output <- data.frame(key_name = character(), value_name = character())
### Get all values for each key  and put them in a cell
  for(i in 1:u){
    # get how many elements are in the given key
    nk <- w[i]  # get an unique key
    n_index <- which(data[,key] == nk)  # index of rows with key == nk
    # Filter index and get relevant columns only
    selection <- data[which(data[,key] == nk),]
    selection <- selection[,c(key, value)]
    # Spread values as column names
    s <- spread(selection, value_name, key_name) 
    vals <- colnames(s) # get values vector from column names
    new_row <- cbind(nk, paste(vals, collapse = ", "))  # Parse names as new_row to append to output
    output <- rbind(output, new_row)
    rm(new_row)
  }
# Exit first loop. Next: create columns to allocate values
  colnames(output) <- c(key_name, value_name)
  #reconvert output element as a character only data frame
  output <- data.frame(lapply(output, as.character), stringsAsFactors = FALSE)
  # figure out number of columns to add in output data frame
  new_vals <- as.character(pull(output, 2))
  val_cols <- gsub(", ", "", new_vals) # remove separating commas
  num_new_cols <- max(nchar(val_cols))/numchar # count max number of values and divide with code length
  for(i in 1:num_new_cols){
    output[,(i+2)] <- NA # add as many new column as max number of values
  }
# Split new values into a source list; one sublist for every new row
  l_nv <- strsplit(new_vals, ", ", fixed = TRUE)
# Assign values from list into different columns  
  o_length <- nrow(output) # How many rows does output have?
  for(i in 1:o_length){    # For every row...
    for(j in 1:num_new_cols){   # ... and for every column...
      output[i, (j + 2)] <- l_nv[[i]][j]   # .. allocate its new value from the source list (or leave an NA).
    }
  }
  output[,2] <- NULL # Dumping long string source
  # setting new column names (keeping first column) for output element
  new_cols_name <- c(colnames(output)[1])
  for(i in 1:num_new_cols){
    new_cols_name <- c(new_cols_name, paste(value_name, "_", i, sep = ""))
  }
  colnames(output) <- new_cols_name
  return(output)
}
# ...END OF FUNCTION...
cast_fix <- spread_data(cast_filter, 1, 3, 9)

By the time I write this, I think I should improve the function with a couple of warnings in case reference inputs are invalid, but it did its job. Now we have a better looking data frame with name codes for every title code.

### MANUAL COLUMN NAME FIX
new_cols_name <- c("tconst")
for(i in 1:num_new_cols){
     new_cols_name <- c(new_cols_name, paste("cast", "_", i, sep = ""))
   }
colnames(cast_fix) <- new_cols_name
# Showing result
head(cast_fix)
        tconst    cast_1    cast_2    cast_3    cast_4    cast_5    cast_6       cast_7 cast_8 cast_9 cast_10
1 tt0000692 nm0143083 nm0349492 nm0742958      <NA>      <NA>      <NA>      <NA>   <NA>   <NA>    <NA>
2 tt0007958 nm0306624 nm0351017 nm0677803 nm0685286 nm1124672      <NA>      <NA>   <NA>   <NA>    <NA>
3 tt0008807 nm1281622 nm8500968 nm8500969      <NA>      <NA>      <NA>      <NA>   <NA>   <NA>    <NA>
4 tt0015576 nm0140050 nm0431730 nm0543742 nm0548056 nm0585931 nm0731699 nm0858977   <NA>   <NA>    <NA>
5 tt0020660 nm0306624 nm1201900 nm3427084 nm3427753      <NA>      <NA>      <NA>   <NA>   <NA>    <NA>
6 tt0020732 nm0306624 nm1201900 nm3427084 nm3427753      <NA>      <NA>      <NA>   <NA>   <NA>    <NA>

A critical unexpected mistake

Having fixed and joined all datasets together, I was just about to move onto the second, probably the most interesting part of this experiment. I started testing my dataset, matching the data with other information on the Internet, until I stumbled upon with an award winner Argentinean movie that was not present in my data set. Why was that?

The very first data I used to filter all the IMDB data was, much for my dismay, only for MALE Argentinean directors - no females. There is another Wikipedia category page listing female Argentinean directors. Moreover, there is a XXI century directors category page.

This means I had to scrap this web and do everything once again.

Further complications

Going over all the initial data tables was quick enough with the code I already had, but I could not process the cast dataset. I downloaded cast data probably around a month before realizing this mistake and, because of its size, I deleted it. Since the first version I had, the file grew from 1.2 to 1.3 GB and my base R engine was unable to process it at all.

I struggled a lot about getting over a solution to this, but eventually I managed to create a virtual machine instance of Windows Server 2016 with Google Cloud service. It was probably not the best or most effective way to work with a VM, but I just downloaded base R and transferred my working files through a free uploading service and everything went very well.

The upside of this is that I did not need to break the large datasets by million rows, as it could process large files at once.

Wrapping up

I fixed and ran my codes once again and as a result, I got a otal 529 Argentine directors against 388, and the number of rows in my dataset is now 6,759, against 5,300 rows without female and XXI century directors. More importantly, I did not miss very successful movies in my dataset, which would have been fatal to all this endeavour.

Assembling data

Data summary

Before moving forward to the joining, let us recapitulate on the data gathered so far. This section is a summary after many backs and forths in data collection and processing. The following subtitles will briefly present each dataset that will be joined together for the analysis.

Argentinean directors

Obtained from the name.basics.tsv IMDB file. This data set is a master database of names, from which I filtered Argentine directors details through the Wikipedia webscrapping. With the nconst variable we can filter then movie titles with the appointed director from Argentina.

str(arg_directors)
## Classes 'data.table' and 'data.frame':   529 obs. of  6 variables:
##  $ nconst           : chr  "nm0002159" "nm0002728" "nm0002869" "nm0009599" ...
##  $ primaryName      : chr  "Alejandro Agresti" "Juan Jose Campanella" "Mario Gallo" "Angel Acciaresi" ...
##  $ birthYear        : chr  "1961" "1959" "1923" "1908" ...
##  $ deathYear        : chr  "\\N" "\\N" "1984" "\\N" ...
##  $ primaryProfession: chr  "director,writer,cinematographer" "writer,actor,director" "actor" "assistant_director,director,writer" ...
##  $ knownForTitles   : chr  "tt0296915,tt0169364,tt0115773,tt0385889" "tt1305806,tt1994702,tt0210843,tt0347449" "tt0073979,tt0074751,tt0086192,tt0081398" "tt0177289,tt0177423,tt0176580,tt0319437" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Title codes

The source IMDB table (title.crew.tsv) matches title codes tconst with name codes nconst both for writers and directors.

As I filter Argentine movies through the appointed director as a proxy for country of origin (see discussion in part 1 about this questionable decision), I use this table to gather all Argentine title codes. The tconst variable is a unique identifier for every title in IMDB.

str(arg_titles)
## Classes 'data.table' and 'data.frame':   6759 obs. of  3 variables:
##  $ tconst   : chr  "tt0000692" "tt0007646" "tt0007958" "tt0008807" ...
##  $ directors: chr  "nm0303066" "nm0188105" "nm0214496" "nm0337395" ...
##  $ writers  : chr  "\\N" "nm0188105" "nm0554955" "nm0337395" ...
##  - attr(*, ".internal.selfref")=<externalptr>

From the console output above, we notice that 5300 Argentine movies have been filtered through the Argentine directors data.

Movie details

The movie_details dataset was generated from the IMDB title.basics.tsv file, which matches each unique tconst code with details; including actual name, release year, and more.

str(arg_details)
## Classes 'data.table' and 'data.frame':   6759 obs. of  9 variables:
##  $ tconst        : chr  "tt0000692" "tt0007646" "tt0007958" "tt0008807" ...
##  $ titleType     : chr  "short" "movie" "movie" "movie" ...
##  $ primaryTitle  : chr  "El fusilamiento de Dorrego" "El apóstol" "Peach Blossom" "The Last Indian Attack" ...
##  $ originalTitle : chr  "El fusilamiento de Dorrego" "El apóstol" "Flor de durazno" "El último malón" ...
##  $ isAdult       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ startYear     : chr  "1908" "1917" "1917" "1917" ...
##  $ endYear       : chr  "\\N" "\\N" "\\N" "\\N" ...
##  $ runtimeMinutes: chr  "\\N" "70" "\\N" "97" ...
##  $ genres        : chr  "History,Short" "Animation,Comedy,Drama" "Drama" "History,War" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Cast

This dataset stemmed with a lot of pain and head-scratching from the IMDB title.principals.tsv file. For every single tconst, there are up to 10 columns with name codes (nconst).

Notice that this data frame has only 6,257 rows, meaning that data for 502 titles have been lost to the data wrangling efforts.

str(arg_cast)
## Classes 'data.table' and 'data.frame':   6257 obs. of  11 variables:
##  $ tconst   : chr  "tt0000692" "tt0007958" "tt0008807" "tt0015576" ...
##  $ nconst_1 : chr  "Roberto Casaux" "Carlos Gardel" "Rosa Volpe" "James Carrasco" ...
##  $ nconst_2 : chr  "Eliseo Gutierrez" "Argentino Gomez" "Mariano Lopez" "Roger San Juana" ...
##  $ nconst_3 : chr  "Salvador Rosich" "Celestino Petray" "Salvador Lopez" "Doris Mansell" ...
##  $ nconst_4 : chr  NA "Ilde Pirovano" NA "Mona Maris" ...
##  $ nconst_5 : chr  NA "Diego Figueroa" NA "Adelqui Migliar" ...
##  $ nconst_6 : chr  NA NA NA "Jerrold Robertshaw" ...
##  $ nconst_7 : chr  NA NA NA "Jameson Thomas" ...
##  $ nconst_8 : chr  NA NA NA NA ...
##  $ nconst_9 : chr  NA NA NA NA ...
##  $ nconst_10: chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

The nconst_n variables present the name for every listed actor in each movie. Some titles show a few actors as “principal cast”, some up to 10. This means that there are some missing values for movies listing less than 10 actors.

It is important to mention that with the power of the virtual machine, I could easily run a function to replace the nconst codes with the actual actor name using this function:

cast_names <- function(data, source, s_ind, s_name){
data <- as.data.frame(data)
source <- as.data.frame(source)
for(i in 1:length(arg_cast)){
    for(j in 1:10){
        col <- j + 1
        get_nconst <- data[i, col]
        if(is.na(get_nconst) == TRUE){next} else{
            get_index <- which(source[,s_ind] == get_nconst)
            get_name <- source[get_index, s_name]
            data[i, col] <- get_name
            }
        }
    }
return(data)
}
arg_cast <- cast_names(arg_cast, names, 1, 2)

Joining datasets

We have so far four data tables (titles, directors, details and cast) to be assembled together and generate a “master data table”. Tables will be joined together using dplyr package and then they will be trimmed from non-essential data.

However, the arg_titles dataset has a column named “directors” with nconst codes, so it has to be fixed to perform the joins.

colnames(arg_titles)
## [1] "tconst"    "directors" "writers"
colnames(arg_titles)[2] <- "nconst"

Proceeding to join together these data tables: 1. Join titles to directors, matching the nconst variable. 2. Joining details by the tconst title code. 3. Renaming variable nconst to directorCode 4. Joining cast to the data set, again by tconst

library(dplyr)
titles_directors <- left_join(arg_titles, arg_directors, by = "nconst")
titles_directors_details <- left_join(titles_directors, arg_details, by = "tconst")
colnames(titles_directors_details)[2] <- "directorCode"
arg_imdb_b <- left_join(titles_directors_details, arg_cast, by = "tconst")

The result is a behemoth dataset with way too many variables: Notice that the “awards” and “nominations” columns have been added on the step described below

str(arg_imdb_b)
## Classes 'data.table' and 'data.frame':   6759 obs. of  28 variables:
##  $ tconst           : chr  "tt0000692" "tt0007646" "tt0007958" "tt0008807" ...
##  $ directorCode     : chr  "nm0303066" "nm0188105" "nm0214496" "nm0337395" ...
##  $ writers          : chr  "\\N" "nm0188105" "nm0554955" "nm0337395" ...
##  $ nominations      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ awards           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ primaryName      : chr  "Mario Gallo" "Quirino Cristiani" "Francisco Defilippis Novoa" "Alcides Greca" ...
##  $ birthYear        : chr  "1878" "1896" "\\N" "1889" ...
##  $ deathYear        : chr  "1945" "1984" "\\N" "1956" ...
##  $ primaryProfession: chr  "director,producer" "director,animation_department,writer" "director,writer" "director,writer,producer" ...
##  $ knownForTitles   : chr  "tt0191000,tt0191220,tt0191561,tt0191402" "tt0196636,tt0196839,tt0007646,tt0197079" "tt0351250,tt0305630,tt0007958,tt0264146" "tt0008807" ...
##  $ titleType        : chr  "short" "movie" "movie" "movie" ...
##  $ primaryTitle     : chr  "El fusilamiento de Dorrego" "El apóstol" "Peach Blossom" "The Last Indian Attack" ...
##  $ originalTitle    : chr  "El fusilamiento de Dorrego" "El apóstol" "Flor de durazno" "El último malón" ...
##  $ isAdult          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ startYear        : chr  "1908" "1917" "1917" "1917" ...
##  $ endYear          : chr  "\\N" "\\N" "\\N" "\\N" ...
##  $ runtimeMinutes   : chr  "\\N" "70" "\\N" "97" ...
##  $ genres           : chr  "History,Short" "Animation,Comedy,Drama" "Drama" "History,War" ...
##  $ nconst_1         : chr  "Roberto Casaux" NA "Carlos Gardel" "Rosa Volpe" ...
##  $ nconst_2         : chr  "Eliseo Gutierrez" NA "Argentino Gomez" "Mariano Lopez" ...
##  $ nconst_3         : chr  "Salvador Rosich" NA "Celestino Petray" "Salvador Lopez" ...
##  $ nconst_4         : chr  NA NA "Ilde Pirovano" NA ...
##  $ nconst_5         : chr  NA NA "Diego Figueroa" NA ...
##  $ nconst_6         : chr  NA NA NA NA ...
##  $ nconst_7         : chr  NA NA NA NA ...
##  $ nconst_8         : chr  NA NA NA NA ...
##  $ nconst_9         : chr  NA NA NA NA ...
##  $ nconst_10        : chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

Many variables are unnecesary for the analysis, so they get trimmed out:

arg_imdb <- arg_imdb_b[,c(1, 4, 5, 6, 11, 12, 13, 15, 17:28)]
colnames(arg_imdb)[4] <- "director"
str(arg_imdb)
## Classes 'data.table' and 'data.frame':   6759 obs. of  20 variables:
##  $ tconst        : chr  "tt0000692" "tt0007646" "tt0007958" "tt0008807" ...
##  $ nominations   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ awards        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ director      : chr  "Mario Gallo" "Quirino Cristiani" "Francisco Defilippis Novoa" "Alcides Greca" ...
##  $ titleType     : chr  "short" "movie" "movie" "movie" ...
##  $ primaryTitle  : chr  "El fusilamiento de Dorrego" "El apóstol" "Peach Blossom" "The Last Indian Attack" ...
##  $ originalTitle : chr  "El fusilamiento de Dorrego" "El apóstol" "Flor de durazno" "El último malón" ...
##  $ startYear     : chr  "1908" "1917" "1917" "1917" ...
##  $ runtimeMinutes: chr  "\\N" "70" "\\N" "97" ...
##  $ genres        : chr  "History,Short" "Animation,Comedy,Drama" "Drama" "History,War" ...
##  $ nconst_1      : chr  "Roberto Casaux" NA "Carlos Gardel" "Rosa Volpe" ...
##  $ nconst_2      : chr  "Eliseo Gutierrez" NA "Argentino Gomez" "Mariano Lopez" ...
##  $ nconst_3      : chr  "Salvador Rosich" NA "Celestino Petray" "Salvador Lopez" ...
##  $ nconst_4      : chr  NA NA "Ilde Pirovano" NA ...
##  $ nconst_5      : chr  NA NA "Diego Figueroa" NA ...
##  $ nconst_6      : chr  NA NA NA NA ...
##  $ nconst_7      : chr  NA NA NA NA ...
##  $ nconst_8      : chr  NA NA NA NA ...
##  $ nconst_9      : chr  NA NA NA NA ...
##  $ nconst_10     : chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

A few fixes and tweaks

A few points have to be fixed about this dataset. For instance, the startYear column is a character column while it should be integer. Also, its name is misleading as it applies as the release year for movies.

arg_imdb <- as.data.frame(arg_imdb)
arg_imdb[,7] <- as.integer(arg_imdb[,7])
colnames(arg_imdb)[7] <- "releaseYear"

Then, the movies dataset contains information about a wide number of material, including…

unique(arg_imdb$titleType)
## [1] "short"        "movie"        "tvMovie"      "tvEpisode"   
## [5] "tvSeries"     "video"        "tvMiniSeries" "tvShort"     
## [9] "videoGame"

So, to look solely into Argentine movies, I apply the following filter:

library(dplyr)
arg_movies <- arg_imdb %>% 
  filter(titleType == "movie")
str(arg_movies)
## 'data.frame':    2518 obs. of  20 variables:
##  $ tconst        : chr  "tt0007646" "tt0007958" "tt0008807" "tt0009619" ...
##  $ nominations   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ awards        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ director      : chr  "Quirino Cristiani" "Francisco Defilippis Novoa" "Alcides Greca" "Quirino Cristiani" ...
##  $ titleType     : chr  "movie" "movie" "movie" "movie" ...
##  $ primaryTitle  : chr  "El apóstol" "Peach Blossom" "The Last Indian Attack" "Sin dejar rastros" ...
##  $ originalTitle : chr  "El apóstol" "Flor de durazno" "El último malón" "Sin dejar rastros" ...
##  $ startYear     : chr  "1917" "1917" "1917" "1918" ...
##  $ runtimeMinutes: chr  "70" "\\N" "97" "\\N" ...
##  $ genres        : chr  "Animation,Comedy,Drama" "Drama" "History,War" "Animation" ...
##  $ nconst_1      : chr  NA "Carlos Gardel" "Rosa Volpe" NA ...
##  $ nconst_2      : chr  NA "Argentino Gomez" "Mariano Lopez" NA ...
##  $ nconst_3      : chr  NA "Celestino Petray" "Salvador Lopez" NA ...
##  $ nconst_4      : chr  NA "Ilde Pirovano" NA NA ...
##  $ nconst_5      : chr  NA "Diego Figueroa" NA NA ...
##  $ nconst_6      : chr  NA NA NA NA ...
##  $ nconst_7      : chr  NA NA NA NA ...
##  $ nconst_8      : chr  NA NA NA NA ...
##  $ nconst_9      : chr  NA NA NA NA ...
##  $ nconst_10     : chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

How to rate best Argentine films?

Discussion This, again, is a matter of decision and highly questionable and the main aspect to consider is about data availability for whatever criteria applied to rank films.

As this is a highly subjective matter, there is a wide range of possible indicators to assess this question: film gross earnings, total movie viewers (and would be better to sum up both box office, television broadcasting and digital or on-demand screenings), audience surveys, critique ratings (Rotten Tomatoes, IMDB stats, and so on), publication analysis and awards. If we could get data from three or more of these indicators, we might be able to get a very solid response.

However, I decided to take the awards because of its online availability. This, again, might be a flawed choice because the probability of a movie getting an award depends on the availability of awards: By early XX century there were not as many film awards as there are by late XX century.

Award lists

Initially, I considered looking up and scrapping awards and nominations lists from the web, but this proved to be quite complicated as there a wide array of awarding institutions and events worldwide; some are just short-lived, some are international AND regional, and so on.

I finally came up with a much better idea: scrapping again from the IMDB the number of awards and nominations that are listed for every title. That is: many titles have an awards sub-page that keep the same URL structure, for example: https://www.imdb.com/title/tt0089276/award. Each one of these awards pages display a short summary (“Showing all 25 wins and 8 nominations”) before going into detail for every event, category and nominated crew or cast member.

This, of course, has some pros and cons: The advantages are online availability, simplicity and comparability, that the IMDB can be considered a reliable source for this information and that it is a common place to scrape the data - scrapped pages have all the same content structure. The disadvantages are on the side of the detail level for analyzing or comparing data; every award (or nomination) weighs just one unit in a film awards count and (at least with the scrapping code I designed in this case) it can not be discenerd whether the award/nomination was international; or else if the category was for “best film” or “best supporting actor”.

However, I am inclined to believe that an internationally recognised movie is likely to be nationally acclaimed as well - with the potential exception of a politically controversial movie that would be shunned either nationally or internationally.

Scrapping the web for awards and nominations

With the power of a virtual machine, I was able to scrap the web using the code below, without burning my laptop down.

imdb_scrapper <- function(data, key, new_var_name, pat){
  # data = ELEMENT dataset name
  # key = NUMBER identifier column in data
  # new_var_name = STRING new name to add scrapped data
  # pat = One of the following regular expressions for either item: awards = "(?<=all\\s)\\d+" // nominations "(?<=and\\s)\\d+"
  library(rebus)
  library(stringr)
  library(htmlwidgets)
  library(data.table)
  library(httr)
  library(rvest)
  data <- as.data.frame(data) # Force data set element as data frame
  key_length <- length(unique(data[,key]))  # How many rows in data set
  hector <- numeric()     # Create placeholder vector
  for(i in 1:key_length){  
    url <- paste0("https://www.imdb.com/title/", data[i, key], "/awards?ref_=tt_awd")  # Parse the URL
     try(input <- url %>%   # The "try" expression help to avoid the function from crashing down should there be an error
      read_html() %>%
      html_nodes(xpath = '/html/body/div[2]/div/div[2]/div[3]/div[1]/div[1]/div/div[2]/div/div') %>%
      html_text())
    # Read input string and detect how many wins OR nominations
    if(length(input) == 0){  # If the scrapper returns empty, add a 0
      hector <- c(hector, 0)
    } else {    # Extract digits, according to the given expression
     get_data <- as.numeric(str_extract(input, pat))
     hector <- c(hector, get_data)} # Add data to vector and continue loop
  }
  # Add new column to data with name new_var_name
  nc <- ncol(data)
  data[,(nc + 1)] <- hector
  colnames(data)[nc + 1] <- new_var_name
  return(data)
}
arg_titles_awards <- imdb_scrapper(arg_titles_awards, 1, "nominations", "(?<=and\\s)\\d+")
arg_titles_awards <- imdb_scrapper(arg_titles_awards, 1, "wins", "(?<=all\\s)\\d+")

The function works by looking up with regular expressions the digits after “and” and “all” in the phrase “*“Showing* all ## wins and ## *nominations“* for every award subpage in the IMDB, passing them onto a vector and finally appending the latter to the dataset as a new column.

The results were previously shown upon inspecting the arg_movies data set above.

A glimpse on Argentine IMDB awards data

Before going onto an in-detail analysis, I will briefly present a snapshot of the data gathered so far.

library(dplyr)
# Total number of Argentine materials extracted from IMDB
nrow(arg_imdb)
## [1] 6759
# Breakdown by material type
table(arg_imdb$titleType)
## 
##        movie        short    tvEpisode tvMiniSeries      tvMovie 
##         2518          429         3527           39           69 
##     tvSeries      tvShort        video    videoGame 
##           99            7           70            1
# Awarded or nominated materials
arg_imdb %>%
  group_by(titleType) %>%
  summarize(awarded = sum(awards > 0 | nominations > 0), 
            not_awarded = sum(awards == 0 & nominations == 0),
            award_ratio = awarded/n(),
            total = awarded + not_awarded) %>%
  select(titleType, total, awarded, award_ratio) %>%
  arrange(desc(total))
## # A tibble: 9 x 4
##   titleType    total awarded award_ratio
##   <chr>        <int>   <int>       <dbl>
## 1 tvEpisode     3527       0      0     
## 2 movie         2518     588      0.234 
## 3 short          429      37      0.0862
## 4 tvSeries        99      16      0.162 
## 5 video           70       1      0.0143
## 6 tvMovie         69       2      0.0290
## 7 tvMiniSeries    39      14      0.359 
## 8 tvShort          7       0      0     
## 9 videoGame        1       0      0

Below is a graphical summary of the awards and nomination data. Given that a great number of movies have no wins or nominations, they have been filtered out to prevent cramming on the left.

library(ggplot2)
arg_movies_awards <- arg_imdb %>%
  filter(titleType == "movie", awards > 0) %>%
  select(awards)
ggplot(arg_movies_awards, aes(awards)) + geom_histogram(bins = 50)  + labs(title = "Awards histogram", subtitle = paste("Showing ", nrow(arg_movies_awards), "awarded movies, over ", sum(arg_imdb$titleType == "movie")))

arg_movies_nominations <- arg_imdb %>%
  filter(titleType == "movie", nominations > 0) %>%
  select(nominations)
ggplot(arg_movies_nominations, aes(nominations)) + geom_histogram(bins = 40) + labs(title = "Nominations histogram", subtitle = paste("Showing ", nrow(arg_movies_nominations), "nominated movies, over ", sum(arg_imdb$titleType == "movie")))

More pieces to the puzzle

Before moving onto the data analysis, I will present another piece of code I used to compare awards and nominations trends worldwide that will be presented in the next part.

The aim is to obtain a worldwide sample of all movies in the IMDB and calculate how many awards and nominations there have been throughout the years, regardless of where the movies come from. This will be use to compare the Argentine movies dataset.

The first step is to obtain samples from the IMDB dataset:

decade_sampling <- function(data, key, yVar, yStart = min(data[,yVar]), yEnd = max(data[,yVar]), sample_prop = 0.33){
  # data is a dataset
  # key is the data$key element to gather - a 
  # yVar is yearIndex
  # yStart AND yEnd are years
  # sample_prop is the proportion of rows to take from each decade "0.10"
  ## Check if year variables is numeric or integer
  if((is.numeric(data[,yVar]) | is.integer(data[,yVar])) == FALSE){
    print("yVar element is not numeric or integer. Fix it.")
  }
  ## Define the decade span
  hm_decades <- (((yEnd -yStart) - ((yEnd -yStart) %% 10)) / 10) + 1
  decades_floor <- yStart - (yStart %% 10)
  decades_ceiling <- yEnd + (10 - yEnd %% 10)
  ## Group the years in decades
  decades_vector <- character()
  for(i in 1:hm_decades){
    j <- i -1
    d_floor <- decades_floor + (j * 10)
    d_ceiling <- decades_floor + (i * 10)
    decades_vector <- c(decades_vector, paste(d_floor, "-", d_ceiling))
  }
  ## Create the output element
  output <- as.data.frame(data[1,])
  ## Subset dataset by decade
  for(i in 1:hm_decades){
    j <- i -1
    d_floor <- decades_floor + (j * 10)
    d_ceiling <- decades_floor + (i * 10)
    current_decade <- d_floor:d_ceiling
    decade_subset <- data[data[,yVar] %in% current_decade,]
    ## Get the sample
    decade_sample <- sample(decade_subset[,key], round(nrow(decade_subset)*sample_prop))
    sample_data <- data[data[,key] %in% decade_sample,]
    rm(decade_sample)
    ## Put all decades together
    output <- rbind(output, sample_data)
    rm(sample_data)
  }
  return(output)
}

The second part is to scrap the web for the number of awards for the titles in the sample previously obtained, but grouping these titles in 5 years periods. These data will be stored in lists instead of data frames because this class can handle more unstable elements.

imdb_data_assembler <- function(sample, key, yKey, pat){
  library(dplyr)
  library(rebus)
  library(stringr)
  library(htmlwidgets)
  library(data.table)
  library(httr)
  library(rvest)
  # Define the time span
  yFloor <- as.numeric(min(sample[,yKey]))
  yCeiling <- as.numeric(max(sample[,yKey]))
  ySpan <- (yCeiling - yFloor)/5
  # Create output element
  output <- list()
  # Filter movies within 5 year span
  for(i in 1:ySpan){
    j <- i - 1
    current5y <- yFloor + (j * 5)
    y5 <- sample[(sample[,yKey] %in%  current5y:(current5y + 5)),]
    # Scrap the data
    #get_data <- imdb_sample_scrapper(y5, key, pat)
    ####
    key_length <- length(unique(y5[,key]))
    hector <- numeric()
    for(a in 1:key_length){  # Length is number of letters in website (see url variable)
      url <- paste0("https://www.imdb.com/title/", y5[a, key], "/awards?ref_=tt_awd")
      try(input <- url %>%
            read_html() %>%
            html_nodes(xpath = '/html/body/div[2]/div/div[2]/div[3]/div[1]/div[1]/div/div[2]/div/div') %>%
            html_text())
      # Read input string and detect how many wins OR nominations
      try(if(length(input) == 0){
        hector <- c(hector, 0)
      } else {
        # awards = (?<=all\\s)\\d+ // nominations "(?<=and\\s)\\d+"
        get_data <- as.numeric(str_extract(input, pat))
        hector <- c(hector, get_data)})
      # Add data to vector and loop over
    }
    # Store the data
    ll <- length(output)
    output[[ll+1]] <- hector
    names(output)[ll+1] <- paste0(current5y,"-",(current5y + 5))
    }
  #######
    return(output)
  }

Finally, in order to have a better time management, the samples are divided in three years periods, and then they are joined together:

earlyXX <- decade_sampling (movies_worldwide, 1, 3, 1900, 1940, sample_prop = 0.25)
earlyXX_awards <- imdb_data_assembler(sample = earlyXX, key = 1, yKey = 3, pat =  "(?<=all\\s)\\d+")
#
midXX <- decade_sampling (movies_worldwide, 1, 3, 1940, 1980, sample_prop = 0.20)
midXX_awards <- imdb_data_assembler(sample = midXX, key = 1, yKey = 3, pat =  "(?<=all\\s)\\d+")
#
lateXX <- decade_sampling (movies_worldwide, 1, 3, 1980, 2020, sample_prop = 0.20)
lateXX_awards <- imdb_data_assembler(sample = lateXX, key = 1, yKey = 3, pat =  "(?<=all\\s)\\d+")

Analysis

Having prepared the Argentine films data sets, I can start answering some questions with the data prepared so far.

Who is the most succesful film director?

As stated in the first part, some concept definitions are completely arbitrary and questionable. There might be many ways to measure success, but in this particular case study, it will be measured by the total quantity of awards and nominations that her or his movies have received up to the date.

It is important to acknowledge that, because of the way the data was gathered, this total number of awards and nominations are not only for the movie or the director him/herself, but also for the cast and any other category associated with the movie. Simply put, if an actor received an award for a movie he worked in, the director of that one will count in this calculation. Therefore, we may clarify the answer as “who is the director responsible for the most award-winning and nominated movies”, as a potential proxy of success.

The first step to figure this out is just to create a filter with movies and sum up total number of awards and nominations that a film director got for all his/her movies:

library(dplyr)
# Sum up awards and nominations for every film director
succesful_director <- arg_movies %>%
  select(director, nominations, awards) %>%
  group_by(director) %>%
  summarize(totalAwards = sum(awards), totalNominations = sum(nominations)) %>%
  arrange(desc(totalAwards), desc(totalNominations))
# Get the top ten and tidy up the data
top_ten <- succesful_director[1:10,] %>%
  gather(recognition, num_recog, totalAwards:totalNominations)
# Short the name by trimming out first names
library(stringr)
for(i in 1:nrow(top_ten)){
  getLastName <- tail(strsplit(top_ten$director[i],split=" ")[[1]],1)
  top_ten$director[i] <- getLastName}
# Convert directors into factor variable to arrange the plot
fact_order <- unique(top_ten$director)
top_ten$director <- factor(top_ten$director, ordered = T, levels = fact_order)

With these arrangements ready, now we plot it onto a column chart:

We clearly see that J.J. Campanella is the most succesful director. There seems to be a significant difference with his follower, Aristarain; while Pablo Trapero is ranked as the third most successful one although he is associated with a remarkable number of nominations closely as high as Campanella’s wins. In this top-ten selection, the mean seems to be around around the 50 awards won.

Which year was the most succesful for Argentine movies?

We can now think about how well has Argentine movies performed over time in award-winning, and maybe inquire about performance trends. The first step is to adapt the dataset to sum up all awards and nominations on an yearly basis:

library(dplyr)
arg_movies_decade <- arg_movies %>%
  mutate(decade = (releaseYear - releaseYear %% 5)) %>% # grouping years by 5 year period
  group_by(decade) %>%
  summarize(totalAwards = sum(awards), totalNominations = sum(nominations)) %>%
  gather(recognition, periodTotal, totalAwards:totalNominations)

And plot this data:

The plot shows in different colour the total amount of awards and nominations received by movies from the dataset. We notice no awards or nominations until around 1930 to 1940, period in which some awards and nominations were given, but altogether no more than 50 in each 5-year period.

Starting the 1980 decade there is a noticeable raise in the total number of awards and nominations, continuing to an award “boom” in the early XXI century.

Let’s take a closer look into each period separated by the vertical lines in the plot above.

In these plots, the thin light lines represent the actual number of total awards and nominations, while the thick dashed lines represent the overall trend. In addition, notice that the scales are different between both of them.

Looking at the 1940-1980 period, there is a steady award performance throughout the period, with a major peak in 1959 and a smaller one in 1947.

In regards to the 1980-present period, the first decade in the chart is actually higher than the plot before, but is dwarfed next to the 2000 period, reaching its summit in the year 2002; and then another very good result in 2009.

##                    originalTitle releaseYear nominations awards
## 1         El secreto de sus ojos        2009          40     52
## 2             Historias mi­nimas        2002           8     25
## 3                 Tan de repente        2002           7     20
## 4                        Gigante        2009           6     16
## 5                      Valenti­n        2002           9     14
## 6                Lugares comunes        2002          13     13
## 7  El ultimo verano de la Boyita        2009          10      9
## 8                  El bonaerense        2002          11      8
## 9                     Caja negra        2002           6      8
## 10          El juego de la silla        2002           2      8
## 11             Salto a la gloria        1959           0      6
## 12                     Kamchatka        2002           6      6
## 13                     La cai­da        1959           1      5
## 14         Tres hombres del ri­o        1947           0      5
## 15      ¿Donde vas, Alfonso XII?        1959           0      4

Comparing worldwide trend

However, the question that arises upon looking at these plots is: why is the late XX/early XXI Argentine cinema so much better than it was by early XX century? Or what explains this award sourmenage in the 2000 decade?

An actual answer to these would be in the turf of the film experts or communication specialists for an in-depth qualitative analysis. However, there is an almost obvious hypothesis to explain this and it is concerning more film awarding events going on these decades, in comparison with early and mid XX century.

To test these, I took samples of the 20% total movies for each decade of the whole IMDB movie database (the same updated version I used to obtain the Argentine movies dataset). This sample represents movies worldwide, without any distinction of nationality. Probably countries with more films listed in the IMDB will be overrepresented.

Comparing the trend in the sample plot, there is clearly a similarity in the shape. Notice plot is a sum of awards in the sample, summing over 4,000 awards worldwide in 2005. It looks like then, that it is not that Argentine movies have been performing particularly better, or that there was an isolated award mania in the country; instead, the country has been following an international trend.

Taking a different approach, I compared the awards average for each 5-year period, hoping to find a comparable scale between the Argentine movies dataset and the worldwide sample. This is the worldwide performance.

The pattern is still noticeable: A sharp rise of award activity shortly before 1950, steady until 1980s to experience a new boom by early XXI century. Compare now with the Argentine movies:

While there seems to be a rather good Argentine performance by mid-XX century, starting the 1980s decade there is an award boom with Argentine movies. The 2000s peak noticed in the previous plot is now a table just higher than early before, dwarfed by the Argentine award performance.

I believe there are two potential interpretations: (1) There were a lot of new award events in which Argentine movies could participate, probably many of them from the country itself, this, of course, in contrast to other countries award activities; and (2) there is a comparability mismatch between the Argentine award performance and the worldwid sample.

Final comments

There are some other questions that I would like to answer with this data; however, this project has taken me too long and I would prioritize to continue learning some other technologies.

Hopefully I will be able to continue working on this, but so far I think it is enough to publish a first version.