MATH2349 Data Wrangling

Required packages

# This is the R chunk for the required packages
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
library(outliers)
library(forecast)

Executive Summary

Several Datasets have been brought together with the intent of performing data exploration and analysis to reveal insights about films and their performance on the bechdel test. Preparing these datasets for use involved merging data from IMDb and bechdeltest.com. Variable data types and column names were adjusted to be more meaningful, a new variable was created to more easily inspect the data and the dataset was reshaped into a tidy format. The dataset was also scanned for missing or special values and obvious errors which were removed. Retention of outliers in some variables was justified, however, outliers were removed from the runtime variable using the z-score method after a BoxCox transformation was applied to remove skew.

Data

The following datasets were used in this assignment:

Bechdel All Films Dataset

Description: The Bechdel Test can be used as a measure of female representation in film. The test considers three separate criteria:

The film contains at least two named women (1 point),
These women have a conversation with each other (2 points), and
The conversation is not about a man (3 points).

The Bechdel All Films Dataset was sourced from the bechdeltest.com, on this website users submit their own assessments of films against the criteria of the Bechdel Test. The All Films dataset contains all films stored on the database.

It is important to note that assessments of films can be subjective, each film assessment uploaded by a user has a comments section where users can debate the accuracy of the assessment. These debates can sometimes lead to moderators changing the score assigned to a given film. These comments were not available in the All Films dataset that was retrieved, as such it is assumed that the score provided for each film is in the dataset is current, and consistent with the general consensus for that film’s score.

Source: https://bechdeltest.com/?list=all

Variables: The dataset includes 5 variables and 8,574 observations. The definitions of these variables are given in the table below:

Variable	Description
ï..id	The bechdeltest.com unique id
imdbid	The unique IMDb id of the film
year	The year the film was released (according to IMDb)
title	The title of the film. Special characters are HTML encoded
rating	The bechdel score, ranging from 0 to 3, based on the criteria that were met.

The data was extracted from the bechdeltest.com using an API and saved to a .csv file. Note that columns are assigned the ‘character’ class on import. This decision was made in order to ensure leading zeroes on the imdbid variable were not lost.

bechdel <- read.csv("Data/bechdel.csv", colClasses = "character")

head(bechdel)

IMDb All Films - Titles Dataset

Description: The Internet Movie Database (IMDb) is an online database containing numerous details of over 6 million unique films and TV shows. Two datasets from IMDb were utilised, the first of which is the ‘Titles’ dataset (title.basics.tsv.gz). This dataset contains a selected information for all films stored on IMDb.The dataset is formatted with tab separated variables (.tsv).

Source: https://datasets.imdbws.com/

Variables: The dataset includes 9 variables and 7,206,390 observations. “\\N” is used to denote missing or NULL data. The definitions of these variables are given in the table below:

Variable	Description
tconst	The unique IMDb id of the film
titleType	The format of the title (movie, short, tvseries, tv episode, etc.)
primaryTitle	The primary title, or title by which it is predominantly known
originalTitle	The original title, in the original language
isAdult	Boolean value - 0 for non-Adult film, 1 for Adult film
startYear	Release Year, or TV series start year
endYear	TV Series end year
runtimeMinutes	Runtime of the title in minutes
genres	A list of up to three genres associated with the title

title <- read.csv("Data/title.tsv", sep = "\t")

head(title)

IMDb All Films - Ratings Dataset

Description: The ‘Ratings’ (title.ratings.tsv.gz) is the second of two datasets from IMDb that were utilised. The IMDb website allows users to submit a rating for any title. These aggregated ratings are stored in the ‘Ratings’ dataset. The ratings data is weighted, rather than a raw data average. The method by which ratings data is aggregated by IMDb is proprietary information and is not publicly available. The dataset is formatted with tab separated variables (.tsv).

Source: https://datasets.imdbws.com/

Variables: The dataset includes 3 variables and 1,076,708 observations. “\\N” is used to denote missing or NULL data. The definitions of these variables are given in the table below:

Variable	Description
tconst	The unique IMDb id of the film
averageRating	The weighted average of user ratings for that title
numVotes	The number of votes that a title has received

ratings <- read.csv("Data/ratings.tsv", sep = "\t")

head(ratings)

Merging Datasets

The three individual datasets will be modified prior to merging. Currently, while unique IMDb ID is present in each dataset, it is not in a consistent format, and therefore cannot be used for merging the datasets. Furthermore, the datasets originating from IMDb contain more information than is required, variables in these datasets will only be retained if they are relevant. The changes that will be made are:

Aligning the name of variables in each dataset containing the unique IMDb ID to enable use of the variable as a key for merging datasets,
Appending the prefix ‘tt’ to IMDb IDs in the bechdel dataset to enable use of the variable as a key for merging datasets,
Updating variable names to ensure clarity once datasets are merged, and
Removal of variables from the title and ratings datasets that are not of interest. This will make the data easier to work with.

#This is code to remove variables that are not of interest
#Year and Title are removed from the titles dataset because this information is also contained in the bechdel dataset. The title variable in the titles dataset is formatted with special characters being HTML encoded, making the data harder to work with than the corresponding title variable from the 'bechdel' dataset.
title %<>%
  select(-primaryTitle, -startYear, -titleType, -originalTitle, -isAdult, -endYear)

#This is code to rename variables in each datasets
bechdel %<>%
  rename(bid = "ï..id", score = rating)

title %<>%  
  rename(imdbid = tconst)

ratings %<>%
  rename(imdbid = tconst, rating = averageRating, votes = numVotes)

#This code is used to append 'tt' to 'imdbid' values in the bechdel datasets

bechdel %<>%
  mutate(imdbid = sub("^", "tt", imdbid))

The primary purpose for this merged dataset is to perform data exploration and analysis for films in the Bechdel Data sets. For this reason, films for which there is no Bechdel data available are not of interest. In essence, for any given film in the bechdel dataset, data from the title and ratings datasets will supplement that data. Films from the title and ratings datasets for which there is no bechdel data will not be included in the merged dataset.

# This is code to merge the three datasets.
bechdel_complete <- bechdel%>%
  left_join(title,by="imdbid")%>%
  left_join(ratings,by = "imdbid")

This is a sample of the merged dataset, bechdel_complete.

head(bechdel_complete)

Understand

The merged dataset, bechdel_complete contains many different data types, many of which are currently not assigned correctly and will need to be converted. The current structure of bechdel_complete is shown below:

str(bechdel_complete)

## 'data.frame':    8574 obs. of  9 variables:
##  $ bid           : chr  "8040" "5433" "6200" "5444" ...
##  $ imdbid        : chr  "tt0392728" "tt0000003" "tt0132134" "tt0000014" ...
##  $ year          : chr  "1888" "1892" "1895" "1895" ...
##  $ title         : chr  "Roundhay Garden Scene" "Pauvre Pierrot" "Execution of Mary, Queen of Scots, The" "Tables Turned on the Gardener" ...
##  $ score         : chr  "0" "0" "0" "0" ...
##  $ runtimeMinutes: chr  "1" "4" "1" "1" ...
##  $ genres        : chr  "Documentary,Short" "Animation,Comedy,Romance" "History,Short" "Comedy,Short" ...
##  $ rating        : num  7.3 6.5 6.7 7.1 5.8 5.3 7.4 6.7 7.5 6 ...
##  $ votes         : int  5413 1349 1912 4615 1083 962 10328 2796 2889 722 ...

The table below shows the required data types for each variable. Variables for which the current data type is inconsistent with the required data type are converted.

Variable	Required Data Type
bid	Character
imdbid	Character
score	Ordered Factor (0<1<2<3)
title	Character
year	Date (see Note)
runtimeMinutes	Integer
genres	Character (see Note)
rating	Numeric
votes	Integer

Note:

The genre variable is in an untidy format. Reshaping this data into a tidy format will be discussed in the ‘Tidy & Manipulate Data I’ Section. Once reshaped, the genre will be converted to a Factor. Reshaping the data now would introduce unnecessary complexities when identifying outliers and transforming skewed variables.
The year is a character strings containing the Year in ‘YYYY’ format. It is not possible to convert these variable to the ‘Date’ format without including month and day. For this reason, the day and month for each observation will be set to 1st January.

# This is code to convert data types
bechdel_complete$year <- as.Date(paste(bechdel_complete$year,1,1), '%Y %m %d')
bechdel_complete$score <- factor(bechdel_complete$score, ordered = TRUE, levels = c("0", "1", "2", "3"))
bechdel_complete$runtimeMinutes <- as.integer(bechdel_complete$runtimeMinutes)

## Warning: NAs introduced by coercion

str(bechdel_complete)

## 'data.frame':    8574 obs. of  9 variables:
##  $ bid           : chr  "8040" "5433" "6200" "5444" ...
##  $ imdbid        : chr  "tt0392728" "tt0000003" "tt0132134" "tt0000014" ...
##  $ year          : Date, format: "1888-01-01" "1892-01-01" ...
##  $ title         : chr  "Roundhay Garden Scene" "Pauvre Pierrot" "Execution of Mary, Queen of Scots, The" "Tables Turned on the Gardener" ...
##  $ score         : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ runtimeMinutes: int  1 4 1 1 1 1 1 3 1 1 ...
##  $ genres        : chr  "Documentary,Short" "Animation,Comedy,Romance" "History,Short" "Comedy,Short" ...
##  $ rating        : num  7.3 6.5 6.7 7.1 5.8 5.3 7.4 6.7 7.5 6 ...
##  $ votes         : int  5413 1349 1912 4615 1083 962 10328 2796 2889 722 ...

Scan I

All variables in the bechdel_complete dataset were scanned for missing values, special values, and obvious errors (i.e. inconsistencies).

First the bechdel_complete data set will be scanned for instances of “\\N”. It is noted in the schema for the IMDb database that this value is used to indicate missing or NULL data.

# This is the code to find values of \\N in the dataset.
sapply(bechdel_complete, function(x) sum(as.character(x) == "\\N", na.rm = TRUE))

##            bid         imdbid           year          title          score 
##              0              0              0              0              0 
## runtimeMinutes         genres         rating          votes 
##              0              6              0              0

Six instances of “\\N” were identified in bechdel_complete they will be replaced with “”.

# This is the code to replace \\N with NA
bechdel_complete$genres[which(bechdel_complete$genres == "\\N")] <- NA

Next bechdel_complete will be scanned for “NA” values.

# This is code for identifying the number of NA values in each column
colSums(is.na(bechdel_complete))

##            bid         imdbid           year          title          score 
##              0              0              0              0              0 
## runtimeMinutes         genres         rating          votes 
##             38             18             35             35

#This is code for calculating the sum of in complete rows in the dataset, along with the total number of observations in the dataset.
sum(!complete.cases(bechdel_complete))

## [1] 59

nrow(bechdel_complete)

## [1] 8574

As the number of incomplete observations is relatively small compared to the size of the overall dataset (<1%), incomplete rows will be removed from bechdel_complete as a result.

bechdel_complete <- bechdel_complete[complete.cases(bechdel_complete),]

Finally, numeric columns will be checked for special values (NaN, Inf, or -Inf)

#This is code to find special numeric values. The output will be the sum of special characters in each column in the dataset. non-numeric columns will return NULL

is.special <- function(x){
if (is.numeric(x)) sum(is.infinite(x) | is.nan(x))
}
sapply(bechdel_complete, is.special)

## $bid
## NULL
## 
## $imdbid
## NULL
## 
## $year
## NULL
## 
## $title
## NULL
## 
## $score
## NULL
## 
## $runtimeMinutes
## [1] 0
## 
## $genres
## NULL
## 
## $rating
## [1] 0
## 
## $votes
## [1] 0

No special characters were identified in numeric columns.

Scan II

In this section, all numeric variables will be scanned for outliers. By constructing a boxplot for each numeric variable we are able to visualise outliers.

#This code is to create boxplots for the rating and votes variables.
bechdel_complete$rating %>% boxplot(main = "Box Plot of Ratings", ylab="Ratings")

bechdel_complete$votes %>% boxplot(main = "Box Plot of Votes", ylab="Votes")

While a large number of outliers appear to be present in rating and votes variables through inspection of the boxplots above, it is important to take into account the context in which these variables will be used. For the following reasons, outliers for these variables will not be removed.

rating is a subjective value that is already aggregated by IMDb. It is conceivable that a film could have an aggregated score anywhere in the range of 1 to 10. Outlier scores could be explained by a low number of users who have submitted a rating for that film, or simply that the film was exceptionally good or bad.
The votes variable is only likely to be used in analysis of this data to gain more insight into the rating that has been assigned to a film. For a popular film it is plausible that over a million votes could be accrued, and in that case such a high number of votes may give greater confidence in the aggregated rating for that film. Furthermore, in any assessment related to ratings of films in the bechdel_complete dataset, it is likely that the analyst would consider omitting films with a very low number of votes due to a reduced confidence in the rating. Note that movies with a very low number of votes are quite common, and are not outliers in the data set. In the table below we can see a list of films that have accrued a very high number of votes. These are all known to be very popular films.

bechdel_complete[which(bechdel_complete$votes > 1500000),]

Conversely, the runtimeMinutes variable may need to be considered with more scrutiny. A boxplot for this variable is presented below.

#This code is to create boxplots for the runtimeMinutes variables.
bechdel_complete$runtimeMinutes %>% boxplot(main = "Box Plot of Runtime", ylab="Runtime (mins)")

#This is code to view the summary statistics, including the upper and lower outlier fence.
bechdel_complete %>% summarise(Min = min(runtimeMinutes),
                                         Q1 = quantile(runtimeMinutes,probs = .25),
                                         Median = median(runtimeMinutes),
                                         Q3 = quantile(runtimeMinutes,probs = .75),
                                         Max = max(runtimeMinutes),
                                         Mean = mean(runtimeMinutes),
                                         SD = sd(runtimeMinutes),
                                         upperFence = quantile(runtimeMinutes,probs = .75)+(1.5*(quantile(runtimeMinutes,probs = .75)-quantile(runtimeMinutes,probs = .25))),
                                         lowerFence = quantile(runtimeMinutes,probs = .25)-(1.5*(quantile(runtimeMinutes,probs = .75)-quantile(runtimeMinutes,probs = .25))),
                                         n = n())

#This is code to count the number of outliers.
length (which( bechdel_complete$runtimeMinutes > 151| bechdel_complete$runtimeMinutes < 55  ))

## [1] 450

The range of values for runtimeMinutes is very large and some of the variables are questionable in the context of this dataset. The aim of this dataset is to perform data exploration and analysis to reveal insights about films and the bechdel test. In this context films that exist within the dataset should abide by the constraints of what a normal person would reasonably expect a film to look like. For example, it is reasonable to assume that an average person would not expect film would not be less than say 30 minutes long, or longer than 4 or 5 hours. However, the specific numbers are subjective.

This makes dealing with outliers difficult for the runtimeMinutes variable. The variable is highly skewed and contains entries as long as 24 hours, and as short as several minutes. Further inspection of these outliers also reveals that they are real films, often shorts from the dawn of cinema, or multi-part episodic films for TV.

Handling outliers for this variable will not be as simple as deleting all variables that fall outside the outlier fence. We can see that this would equate to 450 films, and would omit all films over ~2.5 hours.

The decision was made to handle outliers by using a z-score approach, however this requires that the underlying data resembles a normal distribution. In the following section runtimeMinutes will be transformed such that it follows a normal distribution. Once this is achieved, outliers will be revisited in the ‘Outliers’ section.

Transform

Several transformation methods were attempted, however, the BoxCox method was most successful at removing the high level of skew in the dataset. The alternative methods that were not as successful have not been presented in interests of brevity.

#This code shows a histogram of the runtimeMinutes variable
hist((bechdel_complete$runtimeMinutes),main = "Histogram of Runtime", xlab="Runtime (mins)")

#This code shows a histogram of the runtimeMinutes variable transformed via the BoxCox method
hist(BoxCox((bechdel_complete$runtimeMinutes), lambda = "auto"),main = "Histogram of BoxCox(Runtime)", xlab="BoxCox(Runtime (mins))")

#This code is to store the transformed runtime in as a new variable
bechdel_complete %<>%
  mutate(runtimeBoxCox = BoxCox((runtimeMinutes), lambda = "auto"))

Outliers

Now that the runtime has been transformed and is resembles a normal distribution, outliers can be detected using the z score method and handled accordingly.Using a same method as before, first the transformed runtime variable, runtimeBoxCox will be visualised in a boxplot.

# This is code to visualise the transformed runtime variable in a boxplot
bechdel_complete$runtimeBoxCox %>% boxplot(main = "Box Plot of BoxCox(Runtime)", ylab="BoxCox(Runtime (mins))")

As demonstrated above, the boxplot for runtime now appears roughly symmetric and roughly normal (as shown in the previous section). Once a z-score is assigned we can review how many fall outside of the threshold of ‘3’ or ‘-3’. These will be outliers.

# This is the code to assign a z-score. 
z.scores <- BoxCox((bechdel_complete$runtimeMinutes), lambda = "auto") %>% scores(type = "z")
z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -8.97178 -0.30587  0.01141  0.00000  0.41658 11.06439

length (which( abs(z.scores) >3 ))

## [1] 147

bechdel_complete[which( abs(z.scores) >3 ),]

If these films identified as outliers are removed from the dataset, some films may remain in the dataset that are slightly longer or slightly shorter than what most people would consider of a normal film. This option appears better than other alternatives considered:

Remove a much larger number of films, based on outlier fences identified in the non-transformed runtime variable.
Retain the outliers, leaving many films within the dataset that do not abide by the constraints of what a regular person would reasonably expect of a film.
Use another method such as capping to handle the outliers. This was not used because the recorded runtimes of these films, while unusual, are correct.

The outliers were removed (<5% of the overall data), and a box plot and histogram of the runtimeMinutes and runtimeBoxCox are presented.

# This is the code to remove z-score outliers. 
bechdel_complete <- bechdel_complete[which(!abs(z.scores) >3 ),]

hist((bechdel_complete$runtimeMinutes),main = "Histogram of Runtime", xlab="Runtime (mins)")

hist((bechdel_complete$runtimeBoxCox),main = "Histogram of BoxCox(Runtime)", xlab="BoxCox(Runtime (mins))")

bechdel_complete$runtimeMinutes %>% boxplot(main = "Box Plot of Runtime", ylab="Runtime (mins)")

bechdel_complete$runtimeBoxCox %>% boxplot(main = "Box Plot of BoxCox(Runtime)", ylab="BoxCox(Runtime (mins))")

Tidy & Manipulate Data I

As noted in the earlier sections, the bechdel_complete data frame is untidy due to the genres variable, which contains a list of associated genres for each movie. This violates the rule that each value must have its own cell. The data will be reshaped into a tidy format by ensuring that each genre associated with a film is stored in its own cell.
Once the data is reshaped, the genres variable is converted to a factor.

# This is the code to separate genres within a single row and separate them across many rows. The resulting genres variable is converted into a factor data type 
bechdel_complete %<>%
  mutate(genres = strsplit(genres, ","))%>%
  unnest(genres)%>%
  mutate(genres = factor(genres))

Now the bechdel_complete dataset is in a tidy format. This is because:

Each variable has its own column
Each observation has its own row
Each value has its own cell

head(bechdel_complete)

Tidy & Manipulate Data II

In the bechdel_complete dataset, the score variable is used to record which criteria of the Bechdel Test are passed (0,1,2,or 3). Typically, a film is only considered to have passed the Bechdel Test if it scores a ‘3’. For any other score the film fails the test.
While the specific score is still of use, there may be times where a binary Pass/Fail result is more meaningful for the purposes of data exploration and analysis. For this reason a new variable will be added, named binary. This variable will be a factor with two levels, ‘PASS’ and ‘FAIL’. An observation will have a binary value of ‘PASS’ if it scored ‘3’ on the Bechdel test. Any film scoring less than 3 points will have a binary value of ‘FAIL’.

# this is the code for creating a new variable, based on the 'score' variable.
bechdel_complete %<>%
  mutate(binary = factor(ifelse(score == 3, "PASS", "FAIL" )))

head(bechdel_complete)

str(bechdel_complete)

## tibble [20,377 x 11] (S3: tbl_df/tbl/data.frame)
##  $ bid           : chr [1:20377] "1349" "1349" "1349" "2003" ...
##  $ imdbid        : chr [1:20377] "tt0000574" "tt0000574" "tt0000574" "tt0002101" ...
##  $ year          : Date[1:20377], format: "1906-01-01" "1906-01-01" ...
##  $ title         : chr [1:20377] "Story of the Kelly Gang, The" "Story of the Kelly Gang, The" "Story of the Kelly Gang, The" "Cleopatra" ...
##  $ score         : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 2 2 2 3 3 1 1 3 3 3 ...
##  $ runtimeMinutes: int [1:20377] 70 70 70 100 100 41 41 63 195 195 ...
##  $ genres        : Factor w/ 25 levels "Action","Adult",..: 5 7 9 9 13 21 25 6 9 13 ...
##  $ rating        : num [1:20377] 6.1 6.1 6.1 5.2 5.2 6.1 6.1 5.8 6.3 6.3 ...
##  $ votes         : int [1:20377] 591 591 591 450 450 254 254 202 22299 22299 ...
##  $ runtimeBoxCox : num [1:20377] 6.97 6.97 6.97 7.9 7.9 ...
##  $ binary        : Factor w/ 2 levels "FAIL","PASS": 1 1 1 1 1 1 1 1 1 1 ...

Finally we have our complete dataset.