Required packages

# This is the R chunk for the required packages
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
library(outliers)
library(forecast)

Executive Summary

Several Datasets have been brought together with the intent of performing data exploration and analysis to reveal insights about films and their performance on the bechdel test. Preparing these datasets for use involved merging data from IMDb and bechdeltest.com. Variable data types and column names were adjusted to be more meaningful, a new variable was created to more easily inspect the data and the dataset was reshaped into a tidy format. The dataset was also scanned for missing or special values and obvious errors which were removed. Retention of outliers in some variables was justified, however, outliers were removed from the runtime variable using the z-score method after a BoxCox transformation was applied to remove skew.

Data

The following datasets were used in this assignment:

Bechdel All Films Dataset

Description: The Bechdel Test can be used as a measure of female representation in film. The test considers three separate criteria:

  • The film contains at least two named women (1 point),
  • These women have a conversation with each other (2 points), and
  • The conversation is not about a man (3 points).

The Bechdel All Films Dataset was sourced from the bechdeltest.com, on this website users submit their own assessments of films against the criteria of the Bechdel Test. The All Films dataset contains all films stored on the database.

It is important to note that assessments of films can be subjective, each film assessment uploaded by a user has a comments section where users can debate the accuracy of the assessment. These debates can sometimes lead to moderators changing the score assigned to a given film. These comments were not available in the All Films dataset that was retrieved, as such it is assumed that the score provided for each film is in the dataset is current, and consistent with the general consensus for that film’s score.

Source: https://bechdeltest.com/?list=all

Variables: The dataset includes 5 variables and 8,574 observations. The definitions of these variables are given in the table below:

Variable Description
ï..id The bechdeltest.com unique id
imdbid The unique IMDb id of the film
year The year the film was released (according to IMDb)
title The title of the film. Special characters are HTML encoded
rating The bechdel score, ranging from 0 to 3, based on the criteria that were met.

The data was extracted from the bechdeltest.com using an API and saved to a .csv file. Note that columns are assigned the ‘character’ class on import. This decision was made in order to ensure leading zeroes on the imdbid variable were not lost.

bechdel <- read.csv("Data/bechdel.csv", colClasses = "character")
head(bechdel)
IMDb All Films - Titles Dataset

Description: The Internet Movie Database (IMDb) is an online database containing numerous details of over 6 million unique films and TV shows. Two datasets from IMDb were utilised, the first of which is the ‘Titles’ dataset (title.basics.tsv.gz). This dataset contains a selected information for all films stored on IMDb.The dataset is formatted with tab separated variables (.tsv).

Source: https://datasets.imdbws.com/

Variables: The dataset includes 9 variables and 7,206,390 observations. “\\N” is used to denote missing or NULL data. The definitions of these variables are given in the table below:

Variable Description
tconst The unique IMDb id of the film
titleType The format of the title (movie, short, tvseries, tv episode, etc.)
primaryTitle The primary title, or title by which it is predominantly known
originalTitle The original title, in the original language
isAdult Boolean value - 0 for non-Adult film, 1 for Adult film
startYear Release Year, or TV series start year
endYear TV Series end year
runtimeMinutes Runtime of the title in minutes
genres A list of up to three genres associated with the title
title <- read.csv("Data/title.tsv", sep = "\t")
head(title)
IMDb All Films - Ratings Dataset

Description: The ‘Ratings’ (title.ratings.tsv.gz) is the second of two datasets from IMDb that were utilised. The IMDb website allows users to submit a rating for any title. These aggregated ratings are stored in the ‘Ratings’ dataset. The ratings data is weighted, rather than a raw data average. The method by which ratings data is aggregated by IMDb is proprietary information and is not publicly available. The dataset is formatted with tab separated variables (.tsv).

Source: https://datasets.imdbws.com/

Variables: The dataset includes 3 variables and 1,076,708 observations. “\\N” is used to denote missing or NULL data. The definitions of these variables are given in the table below:

Variable Description
tconst The unique IMDb id of the film
averageRating The weighted average of user ratings for that title
numVotes The number of votes that a title has received
ratings <- read.csv("Data/ratings.tsv", sep = "\t")
head(ratings)

Merging Datasets

The three individual datasets will be modified prior to merging. Currently, while unique IMDb ID is present in each dataset, it is not in a consistent format, and therefore cannot be used for merging the datasets. Furthermore, the datasets originating from IMDb contain more information than is required, variables in these datasets will only be retained if they are relevant. The changes that will be made are:

#This is code to remove variables that are not of interest
#Year and Title are removed from the titles dataset because this information is also contained in the bechdel dataset. The title variable in the titles dataset is formatted with special characters being HTML encoded, making the data harder to work with than the corresponding title variable from the 'bechdel' dataset.
title %<>%
  select(-primaryTitle, -startYear, -titleType, -originalTitle, -isAdult, -endYear)

#This is code to rename variables in each datasets
bechdel %<>%
  rename(bid = "ï..id", score = rating)

title %<>%  
  rename(imdbid = tconst)

ratings %<>%
  rename(imdbid = tconst, rating = averageRating, votes = numVotes)

#This code is used to append 'tt' to 'imdbid' values in the bechdel datasets

bechdel %<>%
  mutate(imdbid = sub("^", "tt", imdbid))

The primary purpose for this merged dataset is to perform data exploration and analysis for films in the Bechdel Data sets. For this reason, films for which there is no Bechdel data available are not of interest. In essence, for any given film in the bechdel dataset, data from the title and ratings datasets will supplement that data. Films from the title and ratings datasets for which there is no bechdel data will not be included in the merged dataset.

# This is code to merge the three datasets.
bechdel_complete <- bechdel%>%
  left_join(title,by="imdbid")%>%
  left_join(ratings,by = "imdbid")

This is a sample of the merged dataset, bechdel_complete.

head(bechdel_complete)

Understand

The merged dataset, bechdel_complete contains many different data types, many of which are currently not assigned correctly and will need to be converted. The current structure of bechdel_complete is shown below:

str(bechdel_complete)
## 'data.frame':    8574 obs. of  9 variables:
##  $ bid           : chr  "8040" "5433" "6200" "5444" ...
##  $ imdbid        : chr  "tt0392728" "tt0000003" "tt0132134" "tt0000014" ...
##  $ year          : chr  "1888" "1892" "1895" "1895" ...
##  $ title         : chr  "Roundhay Garden Scene" "Pauvre Pierrot" "Execution of Mary, Queen of Scots, The" "Tables Turned on the Gardener" ...
##  $ score         : chr  "0" "0" "0" "0" ...
##  $ runtimeMinutes: chr  "1" "4" "1" "1" ...
##  $ genres        : chr  "Documentary,Short" "Animation,Comedy,Romance" "History,Short" "Comedy,Short" ...
##  $ rating        : num  7.3 6.5 6.7 7.1 5.8 5.3 7.4 6.7 7.5 6 ...
##  $ votes         : int  5413 1349 1912 4615 1083 962 10328 2796 2889 722 ...

The table below shows the required data types for each variable. Variables for which the current data type is inconsistent with the required data type are converted.

Variable Required Data Type
bid Character
imdbid Character
score Ordered Factor (0<1<2<3)
title Character
year Date (see Note)
runtimeMinutes Integer
genres Character (see Note)
rating Numeric
votes Integer

Note:

# This is code to convert data types
bechdel_complete$year <- as.Date(paste(bechdel_complete$year,1,1), '%Y %m %d')
bechdel_complete$score <- factor(bechdel_complete$score, ordered = TRUE, levels = c("0", "1", "2", "3"))
bechdel_complete$runtimeMinutes <- as.integer(bechdel_complete$runtimeMinutes)
## Warning: NAs introduced by coercion
str(bechdel_complete)
## 'data.frame':    8574 obs. of  9 variables:
##  $ bid           : chr  "8040" "5433" "6200" "5444" ...
##  $ imdbid        : chr  "tt0392728" "tt0000003" "tt0132134" "tt0000014" ...
##  $ year          : Date, format: "1888-01-01" "1892-01-01" ...
##  $ title         : chr  "Roundhay Garden Scene" "Pauvre Pierrot" "Execution of Mary, Queen of Scots, The" "Tables Turned on the Gardener" ...
##  $ score         : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ runtimeMinutes: int  1 4 1 1 1 1 1 3 1 1 ...
##  $ genres        : chr  "Documentary,Short" "Animation,Comedy,Romance" "History,Short" "Comedy,Short" ...
##  $ rating        : num  7.3 6.5 6.7 7.1 5.8 5.3 7.4 6.7 7.5 6 ...
##  $ votes         : int  5413 1349 1912 4615 1083 962 10328 2796 2889 722 ...

Scan I

All variables in the bechdel_complete dataset were scanned for missing values, special values, and obvious errors (i.e. inconsistencies).

First the bechdel_complete data set will be scanned for instances of “\\N”. It is noted in the schema for the IMDb database that this value is used to indicate missing or NULL data.

# This is the code to find values of \\N in the dataset.
sapply(bechdel_complete, function(x) sum(as.character(x) == "\\N", na.rm = TRUE))
##            bid         imdbid           year          title          score 
##              0              0              0              0              0 
## runtimeMinutes         genres         rating          votes 
##              0              6              0              0

Six instances of “\\N” were identified in bechdel_complete they will be replaced with “”.

# This is the code to replace \\N with NA
bechdel_complete$genres[which(bechdel_complete$genres == "\\N")] <- NA

Next bechdel_complete will be scanned for “NA” values.

# This is code for identifying the number of NA values in each column
colSums(is.na(bechdel_complete))
##            bid         imdbid           year          title          score 
##              0              0              0              0              0 
## runtimeMinutes         genres         rating          votes 
##             38             18             35             35
#This is code for calculating the sum of in complete rows in the dataset, along with the total number of observations in the dataset.
sum(!complete.cases(bechdel_complete))
## [1] 59
nrow(bechdel_complete)
## [1] 8574

As the number of incomplete observations is relatively small compared to the size of the overall dataset (<1%), incomplete rows will be removed from bechdel_complete as a result.

bechdel_complete <- bechdel_complete[complete.cases(bechdel_complete),]

Finally, numeric columns will be checked for special values (NaN, Inf, or -Inf)

#This is code to find special numeric values. The output will be the sum of special characters in each column in the dataset. non-numeric columns will return NULL

is.special <- function(x){
if (is.numeric(x)) sum(is.infinite(x) | is.nan(x))
}
sapply(bechdel_complete, is.special)
## $bid
## NULL
## 
## $imdbid
## NULL
## 
## $year
## NULL
## 
## $title
## NULL
## 
## $score
## NULL
## 
## $runtimeMinutes
## [1] 0
## 
## $genres
## NULL
## 
## $rating
## [1] 0
## 
## $votes
## [1] 0

No special characters were identified in numeric columns.

Scan II

In this section, all numeric variables will be scanned for outliers. By constructing a boxplot for each numeric variable we are able to visualise outliers.

#This code is to create boxplots for the rating and votes variables.
bechdel_complete$rating %>% boxplot(main = "Box Plot of Ratings", ylab="Ratings")

bechdel_complete$votes %>% boxplot(main = "Box Plot of Votes", ylab="Votes")

While a large number of outliers appear to be present in rating and votes variables through inspection of the boxplots above, it is important to take into account the context in which these variables will be used. For the following reasons, outliers for these variables will not be removed.

bechdel_complete[which(bechdel_complete$votes > 1500000),]

Conversely, the runtimeMinutes variable may need to be considered with more scrutiny. A boxplot for this variable is presented below.

#This code is to create boxplots for the runtimeMinutes variables.
bechdel_complete$runtimeMinutes %>% boxplot(main = "Box Plot of Runtime", ylab="Runtime (mins)")

#This is code to view the summary statistics, including the upper and lower outlier fence.
bechdel_complete %>% summarise(Min = min(runtimeMinutes),
                                         Q1 = quantile(runtimeMinutes,probs = .25),
                                         Median = median(runtimeMinutes),
                                         Q3 = quantile(runtimeMinutes,probs = .75),
                                         Max = max(runtimeMinutes),
                                         Mean = mean(runtimeMinutes),
                                         SD = sd(runtimeMinutes),
                                         upperFence = quantile(runtimeMinutes,probs = .75)+(1.5*(quantile(runtimeMinutes,probs = .75)-quantile(runtimeMinutes,probs = .25))),
                                         lowerFence = quantile(runtimeMinutes,probs = .25)-(1.5*(quantile(runtimeMinutes,probs = .75)-quantile(runtimeMinutes,probs = .25))),
                                         n = n())
#This is code to count the number of outliers.
length (which( bechdel_complete$runtimeMinutes > 151| bechdel_complete$runtimeMinutes < 55  ))
## [1] 450

The range of values for runtimeMinutes is very large and some of the variables are questionable in the context of this dataset. The aim of this dataset is to perform data exploration and analysis to reveal insights about films and the bechdel test. In this context films that exist within the dataset should abide by the constraints of what a normal person would reasonably expect a film to look like. For example, it is reasonable to assume that an average person would not expect film would not be less than say 30 minutes long, or longer than 4 or 5 hours. However, the specific numbers are subjective.

This makes dealing with outliers difficult for the runtimeMinutes variable. The variable is highly skewed and contains entries as long as 24 hours, and as short as several minutes. Further inspection of these outliers also reveals that they are real films, often shorts from the dawn of cinema, or multi-part episodic films for TV.

Handling outliers for this variable will not be as simple as deleting all variables that fall outside the outlier fence. We can see that this would equate to 450 films, and would omit all films over ~2.5 hours.

The decision was made to handle outliers by using a z-score approach, however this requires that the underlying data resembles a normal distribution. In the following section runtimeMinutes will be transformed such that it follows a normal distribution. Once this is achieved, outliers will be revisited in the ‘Outliers’ section.

Transform

Several transformation methods were attempted, however, the BoxCox method was most successful at removing the high level of skew in the dataset. The alternative methods that were not as successful have not been presented in interests of brevity.

#This code shows a histogram of the runtimeMinutes variable
hist((bechdel_complete$runtimeMinutes),main = "Histogram of Runtime", xlab="Runtime (mins)")

#This code shows a histogram of the runtimeMinutes variable transformed via the BoxCox method
hist(BoxCox((bechdel_complete$runtimeMinutes), lambda = "auto"),main = "Histogram of BoxCox(Runtime)", xlab="BoxCox(Runtime (mins))")

#This code is to store the transformed runtime in as a new variable
bechdel_complete %<>%
  mutate(runtimeBoxCox = BoxCox((runtimeMinutes), lambda = "auto"))

Outliers

Now that the runtime has been transformed and is resembles a normal distribution, outliers can be detected using the z score method and handled accordingly.Using a same method as before, first the transformed runtime variable, runtimeBoxCox will be visualised in a boxplot.

# This is code to visualise the transformed runtime variable in a boxplot
bechdel_complete$runtimeBoxCox %>% boxplot(main = "Box Plot of BoxCox(Runtime)", ylab="BoxCox(Runtime (mins))")

As demonstrated above, the boxplot for runtime now appears roughly symmetric and roughly normal (as shown in the previous section). Once a z-score is assigned we can review how many fall outside of the threshold of ‘3’ or ‘-3’. These will be outliers.

# This is the code to assign a z-score. 
z.scores <- BoxCox((bechdel_complete$runtimeMinutes), lambda = "auto") %>% scores(type = "z")
z.scores %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -8.97178 -0.30587  0.01141  0.00000  0.41658 11.06439
length (which( abs(z.scores) >3 ))
## [1] 147
bechdel_complete[which( abs(z.scores) >3 ),]

If these films identified as outliers are removed from the dataset, some films may remain in the dataset that are slightly longer or slightly shorter than what most people would consider of a normal film. This option appears better than other alternatives considered:

The outliers were removed (<5% of the overall data), and a box plot and histogram of the runtimeMinutes and runtimeBoxCox are presented.

# This is the code to remove z-score outliers. 
bechdel_complete <- bechdel_complete[which(!abs(z.scores) >3 ),]

hist((bechdel_complete$runtimeMinutes),main = "Histogram of Runtime", xlab="Runtime (mins)")

hist((bechdel_complete$runtimeBoxCox),main = "Histogram of BoxCox(Runtime)", xlab="BoxCox(Runtime (mins))")

bechdel_complete$runtimeMinutes %>% boxplot(main = "Box Plot of Runtime", ylab="Runtime (mins)")

bechdel_complete$runtimeBoxCox %>% boxplot(main = "Box Plot of BoxCox(Runtime)", ylab="BoxCox(Runtime (mins))")

Tidy & Manipulate Data I

As noted in the earlier sections, the bechdel_complete data frame is untidy due to the genres variable, which contains a list of associated genres for each movie. This violates the rule that each value must have its own cell. The data will be reshaped into a tidy format by ensuring that each genre associated with a film is stored in its own cell.
Once the data is reshaped, the genres variable is converted to a factor.

# This is the code to separate genres within a single row and separate them across many rows. The resulting genres variable is converted into a factor data type 
bechdel_complete %<>%
  mutate(genres = strsplit(genres, ","))%>%
  unnest(genres)%>%
  mutate(genres = factor(genres))

Now the bechdel_complete dataset is in a tidy format. This is because:

head(bechdel_complete)

Tidy & Manipulate Data II

In the bechdel_complete dataset, the score variable is used to record which criteria of the Bechdel Test are passed (0,1,2,or 3). Typically, a film is only considered to have passed the Bechdel Test if it scores a ‘3’. For any other score the film fails the test.
While the specific score is still of use, there may be times where a binary Pass/Fail result is more meaningful for the purposes of data exploration and analysis. For this reason a new variable will be added, named binary. This variable will be a factor with two levels, ‘PASS’ and ‘FAIL’. An observation will have a binary value of ‘PASS’ if it scored ‘3’ on the Bechdel test. Any film scoring less than 3 points will have a binary value of ‘FAIL’.

# this is the code for creating a new variable, based on the 'score' variable.
bechdel_complete %<>%
  mutate(binary = factor(ifelse(score == 3, "PASS", "FAIL" )))
head(bechdel_complete)
str(bechdel_complete)
## tibble [20,377 x 11] (S3: tbl_df/tbl/data.frame)
##  $ bid           : chr [1:20377] "1349" "1349" "1349" "2003" ...
##  $ imdbid        : chr [1:20377] "tt0000574" "tt0000574" "tt0000574" "tt0002101" ...
##  $ year          : Date[1:20377], format: "1906-01-01" "1906-01-01" ...
##  $ title         : chr [1:20377] "Story of the Kelly Gang, The" "Story of the Kelly Gang, The" "Story of the Kelly Gang, The" "Cleopatra" ...
##  $ score         : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 2 2 2 3 3 1 1 3 3 3 ...
##  $ runtimeMinutes: int [1:20377] 70 70 70 100 100 41 41 63 195 195 ...
##  $ genres        : Factor w/ 25 levels "Action","Adult",..: 5 7 9 9 13 21 25 6 9 13 ...
##  $ rating        : num [1:20377] 6.1 6.1 6.1 5.2 5.2 6.1 6.1 5.8 6.3 6.3 ...
##  $ votes         : int [1:20377] 591 591 591 450 450 254 254 202 22299 22299 ...
##  $ runtimeBoxCox : num [1:20377] 6.97 6.97 6.97 7.9 7.9 ...
##  $ binary        : Factor w/ 2 levels "FAIL","PASS": 1 1 1 1 1 1 1 1 1 1 ...

Finally we have our complete dataset.