Abstract

In a landscape shaped by the explosion of digital content and shifting audience preferences, This report embarks on a journey through the Movies dataset from Kaggle to unravel the secrets of cinematic success. Armed with meticulous data cleaning and advanced statistical techniques, we uncover the critical ingredients that define the modern blockbuster. These insights guide strategic decisions for our production company as we navigate the ever-changing currents of audience taste, ensuring our films resonate deeply and soar at the box office.

Introduction & Objectives

The film industry stands as a dynamic and ever-evolving landscape, characterized by its ability to captivate global audiences and shape cultural narratives. Within this realm of creativity and commerce, understanding the intricacies of what makes a movie successful is paramount for filmmakers, producers, and industry stakeholders alike. This report digs into the world of film analysis, aiming to uncover the underlying factors driving box office success.

As the context for our investigation, we recognize the increasing importance of data-driven decision-making in an industry traditionally driven by intuition and creativity. In today’s competitive marketplace, filmmakers and production companies face mounting pressures to deliver commercially successful films while balancing artistic integrity and audience preferences. Against this context, our research seeks to illuminate the key determinants of movie revenue, providing insights for the industry.

With a focus on revenue as the primary metric of success, our analysis spans various dimensions, including genre preferences, production budgets, release timing, and geographical considerations. By dissecting these factors, we aim to uncover patterns and trends that offer valuable guidance for the film production company executives seeking to optimize their strategies and maximize returns on investment.

Through a systematic examination of revenue data and industry trends, this report strives to empower stakeholders with actionable intelligence, fostering informed decision-making and strategic innovation in the realm of film production and distribution. By explaining the underlying drivers of box office success, we aim to contribute to the ongoing dialogue surrounding the art and business of filmmaking, ultimately shaping a more prosperous and vibrant future for our production company.

Data Description & Cleaning

The analysis in this report draws upon the Movies dataset obtained from Kaggle, encompassing various attributes of movies such as budgets, revenues, genres, release dates, production countries, and production companies. This dataset offers a comprehensive view of the global film industry, spanning diverse genres, languages, and production contexts. Prior to analysis, rigorous preprocessing and cleaning were conducted to ensure data integrity and reliability. This involved addressing missing values through imputation or exclusion, removing duplicates to prevent redundancy, and standardizing data formats for consistency. The goal of these cleaning procedures was to enhance the dataset’s quality and usability, providing a solid foundation for robust analysis of film revenue trends and patterns.

Data types

Searching for errors and areas of improvement

Observations:

  • adult column data type could be logical

  • belongs_to_collection column must be cleaned for better understanding of data

  • budget column data type should be integer or numeric

  • genres column must be cleaned for better understanding of data

  • original_language column could be factor

  • popularity column must be numeric

  • production_companies column must be cleaned to show the relevant information

  • production_countries column must be cleaned to avoid redundant and irrelevant information

  • release_date column data type should be Date

  • spoken_languages column must be cleaned to avoid redundant and irrelevant information

  • status column data type could be factor

  • video column is not necessary for analysis purposes

Converting Data Types

Variables to be converted:
* adult (logical)
* budget (numeric)
* original_language (factor)
* popularity (numeric)
* release_date (Date)
* status (factor)

Tidying the dataset

Variables the must be cleaned,each part of the variable should be separated erasing tags.
* belongs_to_collection: Contains “id”,“name”,“poster_part”,“backdrop_part”
* genres: Contains “id”,“name”
* production_companies: Contains “name”,“id”
* production_countries: Contains “abbreviated_name”,“name”
* spoken_languages: Contains “abbreviated_name”,“name”

belongs_to_collection

# Divide string in columns delimiting by ":"
collection <- str_split_fixed(movies$belongs_to_collection, ":", n = Inf)
# Choose index 1-5
collection <- collection[,1:5]
# Choose index 2-5
collection <- collection[, c(2,3,4,5)]
summary(collection)
##       V1                 V2                 V3                 V4           
##  Length:45466       Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
# Convert to a data frame
collection <- as.data.frame(collection)
#Eliminate punctuation signs except "." and "/"
collection <- collection %>% 
  mutate(id_collection = str_replace_all(collection$V1, "[[:punct:]&&[^./]]", " "))
collection <- collection %>% 
  mutate(name_collection = str_replace_all(collection$V2, "[[:punct:]&&[^./]]", " "))
collection <- collection %>% 
  mutate(poster_path_collection = str_replace_all(collection$V3, "[[:punct:]&&[^./]]", " "))
collection <- collection %>% 
  mutate(backdrop_path_collection = str_replace_all(collection$V4, "[[:punct:]&&[^./]]", " "))
# Remove specfic words from data frame
collection$id_collection <- str_remove(collection$id_collection,"name")
collection$name_collection <- str_remove(collection$name_collection,"poster path")
collection$poster_path_collection <- str_remove(collection$poster_path_collection,"backdrop path")
# Choose specific columns
collection <- collection[,5:8]
summary(collection)
##  id_collection      name_collection    poster_path_collection
##  Length:45466       Length:45466       Length:45466          
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##  backdrop_path_collection
##  Length:45466            
##  Class :character        
##  Mode  :character
# Remove whitespace
collection$id_collection <- str_trim(collection$id_collection, "right")
collection$name_collection <- str_trim(collection$name_collection, "right")
collection$poster_path_collection <- str_trim(collection$poster_path_collection, "right")
collection$backdrop_path_collection <- str_trim(collection$backdrop_path_collection, "right")
summary(collection)
##  id_collection      name_collection    poster_path_collection
##  Length:45466       Length:45466       Length:45466          
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##  backdrop_path_collection
##  Length:45466            
##  Class :character        
##  Mode  :character
# Add separated collection to movies data frame
movies <- cbind(movies, collection)
# Remove the 'belongs_to_collection' column from the 'movies' data frame
movies <- movies %>% 
  select(-belongs_to_collection)

genres

new_genres <- str_split_fixed(movies$genres, ":", n = Inf)
new_genres <- new_genres[,1:7]
new_genres <- new_genres[, c(3,5,7)]
new_genres <- as.data.frame(new_genres)
new_genres <- new_genres %>% 
  mutate(genre1 = str_replace_all(new_genres$V1, "[[:punct:]]", " "))
new_genres <- new_genres %>% 
  mutate(genre2 = str_replace_all(new_genres$V2, "[[:punct:]]", " "))
new_genres <- new_genres %>% 
  mutate(genre3 = str_replace_all(new_genres$V3, "[[:punct:]]", " "))
new_genres$genre1 <- str_remove(new_genres$genre1,"id")
new_genres$genre2 <- str_remove(new_genres$genre2,"id")
new_genres$genre3 <- str_remove(new_genres$genre3,"id")
new_genres <- new_genres[,4:6]
# Trim leading and trailing spaces in genre columns
new_genres <- new_genres %>% 
  mutate(genre1 = str_trim(genre1),
         genre2 = str_trim(genre2),
         genre3 = str_trim(genre3))
new_genres$genre1 <- as.factor(new_genres$genre1)
new_genres$genre2 <- as.factor(new_genres$genre2)
new_genres$genre3 <- as.factor(new_genres$genre3)
summary(new_genres)
##          genre1           genre2                  genre3     
##  Drama      :11966           :17001                  :31481  
##  Comedy     : 8820   Drama   : 6308   Thriller       : 2235  
##  Action     : 4489   Comedy  : 3265   Romance        : 2045  
##  Documentary: 3415   Romance : 2859   Drama          : 1677  
##  Horror     : 2619   Thriller: 2523   Comedy         :  911  
##             : 2442   Action  : 1546   Science Fiction:  873  
##  (Other)    :11715   (Other) :11964   (Other)        : 6244
movies <- cbind(movies, new_genres)
# Remove the original 'genres' column from the 'movies' data frame
movies <- movies %>% 
  select(-genres)

production_countries

new_production_countries <- str_split_fixed(movies$production_countries, ":", n = Inf)
new_production_countries <- new_production_countries[,2:7]
new_production_countries <- new_production_countries[, c(2,4,6)]
new_production_countries <- as.data.frame(new_production_countries)
new_production_countries <- new_production_countries %>% 
  mutate(country1 = str_replace_all(new_production_countries$V1, "[[:punct:]]", " "))
new_production_countries <- new_production_countries %>% 
  mutate(country2 = str_replace_all(new_production_countries$V2, "[[:punct:]]", " "))
new_production_countries <- new_production_countries %>% 
  mutate(country3 = str_replace_all(new_production_countries$V3, "[[:punct:]]", " "))
new_production_countries$country1 <- str_remove(new_production_countries$country1,"iso 3166 1")
new_production_countries$country2 <- str_remove(new_production_countries$country2,"iso 3166 1")
new_production_countries$country3 <- str_remove(new_production_countries$country3,"iso 3166 1")
new_production_countries <- new_production_countries[,4:6]
# Trim leading and trailing spaces in country columns
new_production_countries <- new_production_countries %>% 
  mutate(country1 = str_trim(country1),
         country2 = str_trim(country2),
         country3 = str_trim(country3))
new_production_countries$country1 <- as.factor(new_production_countries$country1)
new_production_countries$country2 <- as.factor(new_production_countries$country2)
new_production_countries$country3 <- as.factor(new_production_countries$country3)
summary(new_production_countries)
##                      country1                         country2    
##  United States of America:18425                           :38439  
##                          : 6288   United States of America: 2131  
##  United Kingdom          : 3070   France                  :  917  
##  France                  : 2705   United Kingdom          :  659  
##  Canada                  : 1498   Germany                 :  528  
##  Japan                   : 1493   Italy                   :  482  
##  (Other)                 :11987   (Other)                 : 2310  
##                      country3    
##                          :43314  
##  United States of America:  410  
##  France                  :  247  
##  Germany                 :  232  
##  United Kingdom          :  231  
##  Italy                   :  153  
##  (Other)                 :  879
movies <- cbind(movies, new_production_countries)
# Remove the original 'production_countries' column from the 'movies' data frame
movies <- movies %>% 
  select(-production_countries)

spoken_languages

new_spoken_languages <- str_split_fixed(movies$spoken_languages, ":", n = Inf)
new_spoken_languages <- new_spoken_languages[,2:7]
new_spoken_languages <- new_spoken_languages[, c(2,4,6)]
new_spoken_languages <- as.data.frame(new_spoken_languages)
new_spoken_languages <- new_spoken_languages %>% 
  mutate(country1_language = str_replace_all(new_spoken_languages$V1, "[[:punct:]]", " "))
new_spoken_languages <- new_spoken_languages %>% 
  mutate(country2_language = str_replace_all(new_spoken_languages$V2, "[[:punct:]]", " "))
new_spoken_languages <- new_spoken_languages %>% 
  mutate(country3_language = str_replace_all(new_spoken_languages$V3, "[[:punct:]]", " "))
new_spoken_languages$country1_language <- str_remove(new_spoken_languages$country1_language,"iso 639 1")
new_spoken_languages$country2_language <- str_remove(new_spoken_languages$country2_language,"iso 639 1")
new_spoken_languages$country3_language <- str_remove(new_spoken_languages$country3_language,"iso 639 1")
new_spoken_languages <- new_spoken_languages[,4:6]
# Trim leading and trailing spaces in language columns
new_spoken_languages <- new_spoken_languages %>% 
  mutate(country1_language = str_trim(country1_language),
         country2_language = str_trim(country2_language),
         country3_language = str_trim(country3_language))
new_spoken_languages$country1_language <- as.factor(new_spoken_languages$country1_language)
new_spoken_languages$country2_language <- as.factor(new_spoken_languages$country2_language)
new_spoken_languages$country3_language <- as.factor(new_spoken_languages$country3_language)
summary(new_spoken_languages)
##  country1_language country2_language country3_language
##  English :26840            :37707            :43018   
##          : 4062    English : 1593    Deutsch :  328   
##  Français: 2428    Français: 1477    Español :  308   
##  Italiano: 1411    Deutsch :  919    Français:  234   
##  日本語  : 1388    Español :  782    English :  232   
##  Deutsch : 1301    Italiano:  616    Italiano:  225   
##  (Other) : 8036    (Other) : 2372    (Other) : 1121
movies <- cbind(movies, new_spoken_languages)
# Remove the original 'spoken_languages' column from the 'movies' data frame
movies <- movies %>% 
  select(-spoken_languages)

production_companies

new_production_companies <- str_split_fixed(movies$production_companies, ":", n = Inf)
new_production_companies <- new_production_companies[,1:6]
new_production_companies <- new_production_companies[, c(2,4,6)]
new_production_companies <- as.data.frame(new_production_companies)
new_production_companies <- new_production_companies %>% 
  mutate(company1 = str_replace_all(new_production_companies$V1, "[[:punct:]]", " "))
new_production_companies <- new_production_companies %>% 
  mutate(company2 = str_replace_all(new_production_companies$V2, "[[:punct:]]", " "))
new_production_companies <- new_production_companies %>% 
  mutate(company3 = str_replace_all(new_production_companies$V3, "[[:punct:]]", " "))
new_production_companies$company1 <- str_remove(new_production_companies$company1,"id")
new_production_companies$company2 <- str_remove(new_production_companies$company2,"id")
new_production_companies$company3 <- str_remove(new_production_companies$company3,"id")
new_production_companies <- new_production_companies[,4:6]
# Trim leading and trailing spaces in company columns
new_production_companies <- new_production_companies %>% 
  mutate(company1 = str_trim(company1),
         company2 = str_trim(company2),
         company3 = str_trim(company3))
new_production_companies$company1 <- as.factor(new_production_companies$company1)
new_production_companies$company2 <- as.factor(new_production_companies$company2)
new_production_companies$company3 <- as.factor(new_production_companies$company3)
summary(new_production_companies)
##                                    company1    
##                                        :11881  
##  Paramount Pictures                    :  998  
##  Metro Goldwyn Mayer  MGM              :  852  
##  Twentieth Century Fox Film Corporation:  780  
##  Warner Bros                           :  757  
##  Universal Pictures                    :  754  
##  (Other)                               :29444  
##                      company2                         company3    
##                          :28458                           :36419  
##  Warner Bros             :  270   Warner Bros             :  130  
##  Metro Goldwyn Mayer  MGM:  150   Canal+                  :  109  
##  Canal+                  :  124   Metro Goldwyn Mayer  MGM:   44  
##  Touchstone Pictures     :   75   Relativity Media        :   42  
##  Universal Pictures      :   71   TF1 Films Production    :   29  
##  (Other)                 :16318   (Other)                 : 8693
movies <- cbind(movies, new_production_companies)
# Remove the original 'production_companies' column from the 'movies' data frame
movies <- movies %>% 
  select(-production_companies)

Out of Range Values

Variables that require to be checked.

  • budget: Does not have a range, but could be important to detect outliers for further analysis
  • popularity: Has a range from 0 to 100
  • runtime: Does not have a range, but could be important to detect outliers for further analysis
  • vote_average: Has a range from 0 to 10
  • vote_count: Does not have a range, but could be important to detect outliers for further analysis
  • revenue: too many movies with 0 revenue, could be better to imputate or remove those values

budget

# Generate an histogram
ggplot(movies,aes(budget))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Sort and obtain highest 10 and lowest 10 rows by budget
sorted_budget <- sort(movies$budget)
tail(sorted_budget,10) %>% format(scientific = FALSE)
##  [1] "250000000" "255000000" "258000000" "260000000" "260000000" "260000000"
##  [7] "270000000" "280000000" "300000000" "380000000"
head(sorted_budget,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
# Count zeroes
movies %>% count(budget == 0)
##   budget == 0     n
## 1       FALSE  8890
## 2        TRUE 36573
## 3          NA     3
summary(movies$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0         0         0   4224579         0 380000000         3

It is highly unlikely that a movie has a cost of 0 dollars to produce, and due to the high amount of movies that have this budget it could mean the budget information was not available. An imputation method must be applied in this case.

revenue

# Generate an histogram
ggplot(movies,aes(revenue))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Sort and obtain highest 10 and lowest 10 rows by budget
sorted_revenue <- sort(movies$revenue)
tail(sorted_revenue,10) %>% format(scientific = FALSE)
##  [1] "1262886337" "1274219009" "1342000000" "1405403694" "1506249360"
##  [6] "1513528810" "1519557910" "1845034188" "2068223624" "2787965087"
head(sorted_revenue,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
# Count zeroes
movies %>% count(revenue == 0)
##   revenue == 0     n
## 1        FALSE  7408
## 2         TRUE 38052
## 3           NA     6
summary(movies$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09         6

Because an imputation will be done for budget, the same has to be done to revenue to balance out the data and get rid of it’s volatility.

popularity

# Generate an histogram
ggplot(movies,aes(popularity))+geom_histogram(bins=10) + xlim(0,100)
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# Sort and obtain highest 10 and lowest 10 rows by popularity
sorted_popularity <- sort(movies$popularity)
tail(sorted_popularity,10)
##  [1] 154.8010 183.8704 185.0709 185.3310 187.8605 213.8499 228.0327 287.2537
##  [9] 294.3370 547.4883
head(sorted_popularity,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
# Count zeroes
movies %>% count(popularity == 0)
##   popularity == 0     n
## 1           FALSE 45394
## 2            TRUE    66
## 3              NA     6
summary(movies$popularity)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   0.0000   0.3859   1.1277   2.9215   3.6789 547.4883        6

There are some movies that exceed the 100 points limit, the variable must be imputated for a better analysis of the data.

runtime

# Generate an histogram
ggplot(movies,aes(runtime))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 263 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Sort and obtain highest 10 and lowest 10 rows by runtime
sorted_runtime <- sort(movies$runtime)
tail(sorted_runtime,10)
##  [1]  840  840  874  877  900  925  931 1140 1140 1256
head(sorted_runtime,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
# Count zeroes
movies %>% count(runtime == 0)
##   runtime == 0     n
## 1        FALSE 43645
## 2         TRUE  1558
## 3           NA   263
summary(movies$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   85.00   95.00   94.13  107.00 1256.00     263

Some movies last 0 minutes, a movie cannot have that duration, an imputation must be done to fix it.

vote_average

# Generate an histogram
ggplot(movies,aes(vote_average))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Sort and obtain highest 10 and lowest 10 rows by vote_average
sorted_vote_average <- sort(movies$vote_average)
tail(sorted_vote_average,10)
##  [1] 10 10 10 10 10 10 10 10 10 10
head(sorted_vote_average,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
summary(movies$vote_average)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.000   6.000   5.618   6.800  10.000       6

Vote averages are in order, there is no need for applying an imputation method.

vote_count

# Generate an histogram
ggplot(movies,aes(vote_count))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Sort and obtain highest 10 and lowest 10 rows by vote_count
sorted_vote_count <- sort(movies$vote_count)
tail(sorted_vote_count,10)
##  [1]  9634  9678 10014 10297 11187 11444 12000 12114 12269 14075
head(sorted_vote_count,10)
##  [1] 0 0 0 0 0 0 0 0 0 0
# Count zeroes
movies %>% count(vote_count == 0)
##   vote_count == 0     n
## 1           FALSE 42561
## 2            TRUE  2899
## 3              NA     6
summary(movies$vote_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     3.0    10.0   109.9    34.0 14075.0       6

Vote counts are in order, there is no need to apply imputation methods to this variable.

Imputation

Only 4 variables must be imputated based on the search of out of range values done before which are the following:

  • budget: Values are going to be replaced with the mean to avoid the multiple zeroes from distorting statistical descriptors.
  • revenue: The same case as in budget
  • popularity: Out of range values are going to be replaced with the range limit of 100 to avoid eliminating them from the dataset.
  • runtime: Values are going to be replaced with the mean to avoid the multiple zeroes from distorting statistical descriptors.

It seems that there is very few data that was left as NA in the database after the cleaning process done during the deliverable of the progress setup, however there are some missing data in revenue, runtime and votes which could be adressed with mice.

Advanced Imputation using MICE

For imputation we will be using the MICE package along with the variables detected before which are revenue, runtime, vote_count and vote_average.

Note

I encountered technical difficulties while attempting to use the MICE library for multiple imputation. Despite efforts to resolve these issues using alternative platforms such as posit and Google Colab, I was unable to overcome the challenges. As a result, I acknowledge that not using the MICE library limited my ability to perform multiple imputation and address missing data comprehensively. Instead, I employed alternative approaches to handle missing data. However, it’s important to acknowledge that these methods may introduce additional uncertainty and potential biases into my analysis, impacting the validity of the results.

# Specifies the characteristics of the imputation
#movies_mice <- mice(movies2,m=1,maxit=50,meth='pmm',seed=500)

# Summarizes the the imputation characteristics defined before
#summary(movies_mice)

# Allows to see the result MICE assigned to missing values
#movies_mice$imp$runtime

# Fill the dataset with results from first option
# movies_clean <- complete(movies_mice,1)

# Serves as a way to check if imputation was done sucessfully
#sum(is.na(movies_clean))

# Generates a density plot
#densityplot(movies_clean)

Density plot can help to determine the effectiveness of the imputation in the dataset, it is possible that imputation is less precise once it get past certain values, therefore it is important to check and use other methods if necessary.

budget

summary(movies$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0         0         0   4224579         0 380000000         3

The average budget is 4,224,579

# Replace zeroes with the budget mean
movies$budget_original <- movies$budget  # Create a copy of the original column
movies$budget <- ifelse(movies$budget == 0, mean(movies$budget, na.rm = TRUE), movies$budget)
summary(movies$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1   4224579   4224579   7623068   4224579 380000000         3

revenue

summary(movies$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09         6

The average revenue is 11,210,000

# Replace zeroes with the revenue mean
movies$revenue_original <- movies$revenue  # Create a copy of the original column
movies$revenue <- ifelse(movies$revenue == 0, mean(movies$revenue, na.rm = TRUE), movies$revenue)
summary(movies$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 1.000e+00 1.121e+07 1.121e+07 2.059e+07 1.121e+07 2.788e+09         6

popularity

summary(movies$popularity)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   0.0000   0.3859   1.1277   2.9215   3.6789 547.4883        6
movies <- movies %>%
  mutate(popularity_max = pmin(pmax(popularity, 0), 100))
summary(movies$popularity_max)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.386   1.128   2.884   3.679 100.000       6

runtime

summary(movies$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   85.00   95.00   94.13  107.00 1256.00     263

A movie average runtime is 94.13 minutes

# Replace zeroes and NA values in the 'runtime' column with the average runtime (94)
movies <- movies %>%
  mutate(runtime = ifelse(runtime == 0 | is.na(runtime), 94, runtime))
summary(movies$runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   87.00   95.00   97.35  107.00 1256.00

Since multiple values were NA’s the values were also replaced with the average

Duplicates

Most of the variables in these dataset does not require to check for duplicates as for example it is completely normal that movies share the same genres, spoken language, company, etc. However there are three variables that must be checked for partial duplicates which are the following.

  • id
  • imdb_id
  • original_title

Full Duplicates

# Counting the total amount of full duplicates
sum(duplicated(movies))
## [1] 17
# Creating a data frame for full duplicates visualization
duplicated_rows <- movies[duplicated(movies), ]
duplicated_rows
##       adult  budget                               homepage     id   imdb_id
## 1466  FALSE 4224579                                        105045 tt0111613
## 9166  FALSE 4224579                                          5511 tt0062229
## 9328  FALSE 4224579                                         23305 tt0295682
## 13376 FALSE 4224579                                        141971 tt1180333
## 16765 FALSE 4224579                                        141971 tt1180333
## 21166 FALSE 4224579                                        119916 tt0080000
## 21855 FALSE 4224579                                        152795 tt1821641
## 22152 FALSE 4224579 http://www.daysofdarknessthemovie.com/  18440 tt0499456
## 23045 FALSE 4224579                                         25541 tt1327820
## 24845 FALSE 4224579           http://www.dealthemovie.com/  11115 tt0446676
## 28861 FALSE 4224579                                        168538 tt0084387
## 29375 FALSE 4224579                                         42495 tt0067306
## 35799 FALSE 4224579                                        159849 tt0173769
## 38872 FALSE 4224579                                         99080 tt0022537
## 40041 FALSE  980000                                        298721 tt2818654
## 40277 FALSE 4224579                                         97995 tt0127834
## 45266 FALSE 4224579                                        265189 tt2121382
##       original_language                   original_title
## 1466                 de                  Das Versprechen
## 9166                 fr                      Le Samouraï
## 9328                 en                      The Warrior
## 13376                fi                         Blackout
## 16765                fi                         Blackout
## 21166                en                      The Tempest
## 21855                en                     The Congress
## 22152                en                 Days of Darkness
## 23045                da                       Broderskab
## 24845                en                             Deal
## 28861                en                             Nana
## 29375                en                        King Lear
## 35799                en Why We Fight: Divide and Conquer
## 38872                en                       The Viking
## 40041                th                        รักที่ขอนแก่น
## 40277                en             Seven Years Bad Luck
## 45266                sv                           Turist
overview
## 1466                                                   East-Berlin, 1961, shortly after the erection of the Wall. Konrad, Sophie and three of their friends plan a daring escape to Western Germany. The attempt is successful, except for Konrad, who remains behind. From then on, and for the next 28 years, Konrad and Sophie will attempt to meet again, in spite of the Iron Curtain. Konrad, who has become a reputed Astrophysicist, tries to take advantage of scientific congresses outside Eastern Germany to arrange encounters with Sophie. But in a country where the political police, the Stasi, monitors the moves of all suspicious people (such as Konrad's sister Barbara and her husband Harald), preserving one's privacy, ideals and self-respect becomes an exhausting fight, even as the Eastern block begins its long process of disintegration.
itman Jef Costello is a perfectionist who always carefully plans his murders and who never gets caught.
n feudal India, a warrior (Khan) who renounces his role as the longtime enforcer to a local lord becomes the prey in a murderous hunt through the Himalayan mountains.
ecovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.
ecovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.
rospero, the true Duke of Milan is now living on an enchanted island with his daughter Miranda, the savage Caliban and Ariel, a spirit of the air. Raising a sorm to bring his brother - the usurper of his dukedom - along with his royal entourage. to the island. Prospero contrives his revenge.
## 21855 More than two decades after catapulting to stardom with The Princess Bride, an aging actress (Robin Wright, playing a version of herself) decides to take her final job: preserving her digital likeness for a future Hollywood. Through a deal brokered by her loyal, longtime agent and the head of Miramount Studios, her alias will be controlled by the studio, and will star in any film they want with no restrictions. In return, she receives healthy compensation so she can care for her ailing son and her digitized character will stay forever young. Twenty years later, under the creative vision of the studio’s head animator, Wright’s digital double rises to immortal stardom. With her contract expiring, she is invited to take part in “The Congress” convention as she makes her comeback straight into the world of future fantasy cinema.
## 22152                                                                                                                                                                                                                                                                                                                                                                                                              When a comet strikes Earth and kicks up a cloud of toxic dust, hundreds of humans join the ranks of the living dead. But there's bad news for the survivors: The newly minted zombies are hell-bent on eradicating every last person from the planet. For the few human beings who remain, going head to head with the flesh-eating fiends is their only chance for long-term survival. Yet their battle will be dark and cold, with overwhelming odds.
ormer Danish servicemen Lars and Jimmy are thrown together while training in a neo-Nazi group. Moving from hostility through grudging admiration to friendship and finally passion, events take a darker turn when their illicit relationship is uncovered.
s an ex-gambler teaches a hot-shot college kid some things about playing cards, he finds himself pulled into the world series of poker, where his protégé is his toughest competition.
## 28861                                                                                                                                                                                                                                                                                                                             In Zola's Paris, an ingenue arrives at a tony bordello: she's Nana, guileless, but quickly learning to use her erotic innocence to get what she wants. She's an actress for a soft-core filmmaker and soon is the most popular courtesan in Paris, parlaying this into a house, bought for her by a wealthy banker. She tosses him and takes up with her neighbor, a count of impeccable rectitude, and with the count's impressionable son. The count is soon fetching sticks like a dog and mortgaging his lands to satisfy her whims.
## 29375                                                                                                                                                                                                                                                                       King Lear, old and tired, divides his kingdom among his daughters, giving great importance to their protestations of love for him. When Cordelia, youngest and most honest, refuses to idly flatter the old man in return for favor, he banishes her and turns for support to his remaining daughters. But Goneril and Regan have no love for him and instead plot to take all his power from him. In a parallel, Lear's loyal courtier Gloucester favors his illegitimate son Edmund after being told lies about his faithful son Edgar. Madness and tragedy befall both ill-starred fathers.
he third film of Frank Capra's 'Why We Fight" propaganda film series, dealing with the Nazi conquest of Western Europe in 1940.
## 38872                                                                                                                                                                                                                           Originally called White Thunder, American producer Varick Frissell's 1931 film was inspired by his love for the Canadian Arctic Circle. Set in a beautifully black-and-white filmed Newfoundland, it is the story of a rivalry between two seal hunters that plays out on the ice floes during a hunt. Unsatisfied with the first cut, Frissell arranged for the crew to accompany an actual Newfoundland seal hunt on The SS Viking, on which an explosion of dynamite (carried regularly at the time on Arctic ships to combat ice jams) killed many members of the crew, including Frissell. The film was renamed in honor of the dead.
n a hospital, ten soldiers are being treated for a mysterious sleeping sickness. In a story in which dreams can be experienced by others, and in which goddesses can sit casually with mortals, a nurse learns the reason why the patients will never be cured, and forms a telepathic bond with one of them.
fter breaking a mirror in his home, superstitious Max tries to avoid situations which could bring bad luck but in doing so, causes himself the worst luck imaginable.
hile holidaying in the French Alps, a Swedish family deals with acts of cowardliness as an avalanche breaks out.
##       popularity                      poster_path release_date  revenue runtime
## 1466    0.122178 /5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg   1995-02-16 11209349     115
## 9166    9.091288 /cvNW8IXigbaMNo4gKEIps0NGnhA.jpg   1967-10-25    39481     105
## 9328    1.967992 /9GlrmbZO7VGyqhaSR1utinRJz3L.jpg   2001-09-23 11209349      86
## 13376   0.411949 /8VSZ9coCzxOCW2wE2Qene1H1fKO.jpg   2008-12-26 11209349     108
## 16765   0.411949 /8VSZ9coCzxOCW2wE2Qene1H1fKO.jpg   2008-12-26 11209349     108
## 21166   0.000018 /gLVRTxaLtUDkfscFKPyYrCtRnTk.jpg   1980-02-27 11209349     123
## 21855   8.534039 /nnKX3ahYoT7P3au92dNgLf4pKwA.jpg   2013-05-16   455815     122
## 22152   1.436085 /tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg   2007-01-01 11209349      89
## 23045   2.587911 /q19Q5BRZpMXoNCA4OYodVozfjUh.jpg   2009-10-21 11209349      90
## 24845   6.880365 /kHaBqrrozaG7rj6GJg3sUCiM29B.jpg   2008-01-29 11209349      85
## 28861   1.276602 /pg4PUHRFrgNfACHSh5MITQ2gYch.jpg   1983-06-13 11209349      92
## 29375   0.187901 /xuE1IlUCohbxMY0fiqKTT6d013n.jpg   1971-02-04 11209349     137
## 35799   0.473322 /g21ruZZ3BZeUDuKMb82kejjtufk.jpg   1943-01-01 11209349      57
## 38872   0.002362 /qenjwRvW9itR5pVp4CBkYfhVAOp.jpg   1931-06-21 11209349      70
## 40041   2.535419 /5GasjPRAy5rlEyDOH7MeOyxyQGX.jpg   2015-09-02 11209349     122
## 40277   0.141558 /4J6Ai4C5YRgfRUTlirrJ7QsmJKU.jpg   1921-02-06 11209349      62
## 45266  12.165685 /rGMtc9AtZsnWSSL5VnLaGvx1PI6.jpg   2014-08-15  1359497     118
##         status
## 1466  Released
## 9166  Released
## 9328  Released
## 13376 Released
## 16765 Released
## 21166 Released
## 21855 Released
## 22152 Released
## 23045 Released
## 24845 Released
## 28861 Released
## 29375  Rumored
## 35799 Released
## 38872 Released
## 40041 Released
## 40277 Released
## 45266 Released
##                                                                                    tagline
## 1466                                                               A love, a hope, a wall.
## 9166                                 There is no solitude greater than that of the Samurai
## 9328                                                                                      
## 13376                           Which one is the first to return - memory or the murderer?
## 16765                           Which one is the first to return - memory or the murderer?
## 21166                                                                                     
## 21855                                                                                     
## 22152                                                                                     
## 23045                                                                                     
## 24845                                                                                     
## 28861                                                                                     
## 29375                                                                                     
## 35799                                                                                     
## 38872 Actually produced during the Great Newfoundland Seal Hunt and You see the REAL thing
## 40041                                                                                     
## 40277                                                                                     
## 45266                                                                                     
##                                  title video vote_average vote_count
## 1466                       The Promise False          5.0          1
## 9166                       Le Samouraï False          7.9        187
## 9328                       The Warrior False          6.3         15
## 13376                         Blackout False          6.7          3
## 16765                         Blackout False          6.7          3
## 21166                      The Tempest False          0.0          0
## 21855                     The Congress False          6.4        165
## 22152                 Days of Darkness False          5.0          5
## 23045                      Brotherhood False          7.1         21
## 24845                             Deal False          5.2         22
## 28861   Nana, the True Key of Pleasure False          4.7          3
## 29375                        King Lear False          8.0          3
## 35799 Why We Fight: Divide and Conquer False          5.0          1
## 38872                       The Viking False          0.0          0
## 40041            Cemetery of Splendour False          4.4         50
## 40277             Seven Years Bad Luck False          5.6          4
## 45266                    Force Majeure False          6.8        255
##       id_collection name_collection             poster_path_collection
## 1466                                                                  
## 9166                                                                  
## 9328                                                                  
## 13376                                                                 
## 16765                                                                 
## 21166                                                                 
## 21855                                                                 
## 22152                                                                 
## 23045                                                                 
## 24845                                                                 
## 28861                                                                 
## 29375                                                                 
## 35799        158365    Why We Fight   /fFYBLu2Hnx27CWLOMd425ExDkgK.jpg
## 38872                                                                 
## 40041                                                                 
## 40277                                                                 
## 45266                                                                 
##       backdrop_path_collection      genre1          genre2          genre3
## 1466                                 Drama         Romance                
## 9166                                 Crime           Drama        Thriller
## 9328                             Adventure       Animation           Drama
## 13376                             Thriller         Mystery                
## 16765                             Thriller         Mystery                
## 21166                              Fantasy           Drama Science Fiction
## 21855                                Drama Science Fiction       Animation
## 22152                               Action          Horror Science Fiction
## 23045                                Drama                                
## 24845                               Comedy           Drama                
## 28861                                Drama          Comedy                
## 29375                                Drama         Foreign                
## 35799                     None Documentary                                
## 38872                               Action           Drama         Romance
## 40041                                Drama         Fantasy                
## 40277                               Comedy                                
## 45266                               Comedy           Drama                
##                       country1                 country2 country3
## 1466                   Germany                                  
## 9166                    France                    Italy         
## 9328                    France                  Germany    India
## 13376                  Finland                                  
## 16765                  Finland                                  
## 21166                                                           
## 21855                  Belgium                   France  Germany
## 22152 United States of America                                  
## 23045                   Sweden                  Denmark         
## 24845 United States of America                                  
## 28861                                                           
## 29375                  Denmark           United Kingdom         
## 35799 United States of America                                  
## 38872                                                           
## 40041           United Kingdom United States of America   France
## 40277 United States of America                                  
## 45266                   Norway                   Sweden   France
##       country1_language country2_language country3_language
## 1466            Deutsch                                    
## 9166           Français                                    
## 9328              हिन्दी                                    
## 13376             suomi                                    
## 16765             suomi                                    
## 21166                                                      
## 21855           English                                    
## 22152           English                                    
## 23045             Dansk                                    
## 24845           English                                    
## 28861                                                      
## 29375           English                                    
## 35799           English                                    
## 38872           English                                    
## 40041           English           ภาษาไทย                  
## 40277           English                                    
## 45266          Français             Norsk           svenska
##                        company1
## 1466          Studio Babelsberg
## 9166   Fa cinematografica    id
## 9328                   Filmfour
## 13376      Filmiteollisuus Fine
## 16765      Filmiteollisuus Fine
## 21166                          
## 21855    Pandora Filmproduktion
## 22152                          
## 23045                          
## 24845       Andertainment Group
## 28861              Cannon Group
## 29375 Royal Shakespeare Company
## 35799                          
## 38872                          
## 40041        Match Factory  The
## 40277    Max Linder Productions
## 45266                    Motlys
##                                                            company2
## 1466                          Centre National de la Cinématographie
## 9166  Compagnie Industrielle et Commerciale Cinématographique  CICC
## 9328                                                               
## 13376                                                              
## 16765                                                              
## 21166                                                              
## 21855                                           Entre Chien et Loup
## 22152                                                              
## 23045                                                              
## 24845                                        Crescent City Pictures
## 28861                                      Metro Goldwyn Mayer  MGM
## 29375                                                  Laterna Film
## 35799                                                              
## 38872                                                              
## 40041                                              Louverture Films
## 40277                                                              
## 45266                                           Coproduction Office
##                company3 budget_original revenue_original popularity_max
## 1466       Odessa Films               0                0       0.122178
## 9166     TC Productions               0            39481       9.091288
## 9328                                  0                0       1.967992
## 13376                                 0                0       0.411949
## 16765                                 0                0       0.411949
## 21166                                 0                0       0.000018
## 21855         Opus Film               0           455815       8.534039
## 22152                                 0                0       1.436085
## 23045                                 0                0       2.587911
## 24845 Tag Entertainment               0                0       6.880365
## 28861                                 0                0       1.276602
## 29375   Athena Film A S               0                0       0.187901
## 35799                                 0                0       0.473322
## 38872                                 0                0       0.002362
## 40041     Tordenfilm AS          980000                0       2.535419
## 40277                                 0                0       0.141558
## 45266       Film i Väst               0          1359497      12.165685
# Remove full duplicates
movies <- distinct(movies)
# Verify whether full duplicates remain on the data set
sum(duplicated(movies))
## [1] 0

Full Duplicates were eliminated from the dataset.

Partial Duplicates

# Check for partial duplicates
movies %>% 
  count(id) %>% 
  filter(n > 1)
##        id n
## 1   10991 2
## 2  109962 2
## 3  110428 2
## 4   12600 2
## 5   13209 2
## 6  132641 2
## 7   14788 2
## 8   15028 2
## 9   22649 2
## 10   4912 2
## 11  69234 2
## 12  77221 2
## 13  84198 2
movies %>% 
  count(imdb_id) %>% 
  filter(n > 1)
##      imdb_id  n
## 1            17
## 2          0  3
## 3  tt0022879  2
## 4  tt0046468  2
## 5  tt0082992  2
## 6  tt0100361  2
## 7  tt0157472  2
## 8  tt0235679  2
## 9  tt0270288  2
## 10 tt0287635  2
## 11 tt0454792  2
## 12 tt0499537  2
## 13 tt1701210  2
## 14 tt1736049  2
## 15 tt2018086  2
movies %>% 
  count(original_title) %>% 
  filter(n > 1)
##                                                  original_title n
## 1                                                  12 Angry Men 2
## 2                                  20,000 Leagues Under the Sea 4
## 3                                                          2:22 2
## 4                                                  3:10 to Yuma 2
## 5                                                             8 3
## 6                                                             9 2
## 7                                             A Bucket of Blood 2
## 8                                             A Christmas Carol 7
## 9                                             A Dangerous Place 2
## 10                                           A Farewell to Arms 2
## 11                                             A Foreign Affair 2
## 12                                         A Girl in Every Port 2
## 13                                           A Hole in the Head 2
## 14                                          A Kiss Before Dying 2
## 15                                      A Letter to Three Wives 2
## 16                                            A Little Princess 2
## 17                                            A Madea Christmas 2
## 18                                    A Midsummer Night's Dream 4
## 19                                          A Night to Remember 2
## 20                                    A Nightmare on Elm Street 2
## 21                                         A Place at the Table 2
## 22                                          A Raisin in the Sun 2
## 23                                               A Star Is Born 3
## 24                                     A Streetcar Named Desire 3
## 25                                         A Tale of Two Cities 3
## 26                                                      Aakrosh 2
## 27                                                      Aankhen 2
## 28                                                    Abduction 2
## 29                                                         Abel 2
## 30                                                    Abendland 2
## 31                                              Above Suspicion 2
## 32                                                   Absolution 2
## 33                                                         Adam 3
## 34                                    Adventures in Babysitting 2
## 35                                                        After 2
## 36                                               After Midnight 2
## 37                                                    Aftermath 4
## 38                                                     Airborne 2
## 39                                                       Aladin 2
## 40                                                        Alfie 2
## 41                                                        Alice 3
## 42                              Alice Through the Looking Glass 2
## 43                                          Alice in Wonderland 8
## 44                                               All Night Long 2
## 45                               All Quiet on the Western Front 2
## 46                                                    All of Me 2
## 47                                           All the King's Men 2
## 48                                             All the Way Home 2
## 49                                            Alone in the Dark 2
## 50                                                     Altitude 2
## 51                                                       Always 2
## 52                                                  Amber Alert 2
## 53                                                      America 2
## 54                                                 American Gun 2
## 55                                              American Virgin 2
## 56                                                    Americano 2
## 57                                                          Amy 2
## 58                                       An Enemy of the People 2
## 59                                             An Ideal Husband 2
## 60                                           An Inspector Calls 2
## 61                                                    Anastasia 2
## 62                                        And Soon the Darkness 2
## 63                                     And Then There Were None 2
## 64                                                        Angel 4
## 65                                                   Angel Baby 2
## 66                                       Angels in the Outfield 2
## 67                                                        Angst 2
## 68                                                       Animal 2
## 69                                                  Animal Farm 2
## 70                                                      Animals 3
## 71                                                        Anita 3
## 72                                                Anna Karenina 4
## 73                                         Anne of Green Gables 3
## 74                                                        Annie 3
## 75                                                 Annie Oakley 2
## 76                                                Another World 2
## 77                                             April Fool's Day 2
## 78                                               Arabian Nights 2
## 79                                                    Archangel 2
## 80                                                        Arena 3
## 81                                  Around the World in 80 Days 2
## 82                                                    Arrowhead 2
## 83                                                 Arsène Lupin 2
## 84                                                       Arthur 2
## 85                                               As You Like It 2
## 86                                                 Aschenputtel 4
## 87                                       Assault on Precinct 13 2
## 88                                                       Asylum 4
## 89                                                       Attila 3
## 90                                                       August 3
## 91                                                       Aurora 3
## 92                                                       Avalon 3
## 93                                                       Awaken 2
## 94                                             Babes in Toyland 3
## 95                                                  Back Street 2
## 96                                              Back in the Day 2
## 97                                                     Backfire 2
## 98                                                    Backstage 2
## 99                                                     Bad Boys 3
## 100                                                 Bad Company 3
## 101                                                    Bad Girl 2
## 102                                                   Bad Karma 3
## 103                                                        Bait 2
## 104                                                   Ballerina 2
## 105                                                    Bandidos 2
## 106                                                     Bandits 2
## 107                                                    Barabbas 3
## 108                                                     Barbara 2
## 109                                               Bare Knuckles 2
## 110                                                Barely Legal 2
## 111                                               Barnacle Bill 2
## 112                                                   Barricade 2
## 113                                                    Bartleby 2
## 114                                                      Batman 2
## 115                                                Battleground 2
## 116                                                  Beau Geste 2
## 117                                                   Beautiful 2
## 118                                         Beautiful Creatures 2
## 119                                        Beauty and the Beast 5
## 120                                                Bed of Roses 2
## 121                                                   Bedazzled 2
## 122                                          Behind Enemy Lines 2
## 123                                                 Belle Starr 2
## 124                                                     Ben-Hur 2
## 125                                                     Beneath 2
## 126                                                       Benji 2
## 127                                                     Beowulf 2
## 128                                                      Bernie 2
## 129                                                Best Friends 2
## 130                                                    Betrayal 2
## 131                                                    Betrayed 2
## 132                                                  Between Us 2
## 133                                                   Bewitched 2
## 134                                                      Beyond 3
## 135                                   Beyond a Reasonable Doubt 2
## 136                                                    Big Game 2
## 137                                                 Big Trouble 2
## 138                                                     Bigfoot 3
## 139                                               Billy the Kid 2
## 140                                                       Bingo 2
## 141                                              Bird on a Wire 2
## 142                                                       Black 2
## 143                                                 Black Angel 2
## 144                                                Black Beauty 2
## 145                                             Black Christmas 2
## 146                                                Black Friday 2
## 147                                                  Black Gold 2
## 148                                                 Black Magic 2
## 149                                                  Black Moon 2
## 150                                                 Black Sheep 2
## 151                                                 Black Widow 3
## 152                                                   Blackbird 3
## 153                                                    Blackout 4
## 154                                                       Blast 2
## 155                                                       Blind 4
## 156                                                  Blind Date 4
## 157                                                       Bliss 2
## 158                                                  Blood Moon 2
## 159                                                  Blood Ties 2
## 160                                              Blood and Sand 2
## 161                                     Blood: The Last Vampire 2
## 162                                                   Bloodline 2
## 163                                                  Blown Away 2
## 164                                                        Blue 3
## 165                                                  Blue Steel 2
## 166                                                   Bluebeard 2
## 167                                                    Bluebird 2
## 168                                               Body and Soul 3
## 169                                                      Bomber 2
## 170                                                Book of Love 2
## 171                                                  Borderline 4
## 172                                                  Bordertown 2
## 173                                               Born Reckless 3
## 174                                              Born Yesterday 2
## 175                                              Born to Be Bad 2
## 176                                             Born to Be Wild 2
## 177                                               Borrowed Time 2
## 178                                                   Boulevard 3
## 179                                                       Bound 2
## 180                                                         Boy 3
## 181                                              Boy Meets Girl 4
## 182                                                  Brainstorm 2
## 183                                                     Branded 2
## 184                                                       Brave 2
## 185                                              Breaking Point 4
## 186                                       Breaking and Entering 2
## 187                                                    Breakout 3
## 188                                              Breathing Room 2
## 189                                                  Breathless 2
## 190                                         Brewster's Millions 2
## 191                                               Bright Lights 2
## 192                                               Brighton Rock 2
## 193                                                      Broken 2
## 194                                                Broken Arrow 2
## 195                                             Broken Blossoms 2
## 196                                              Broken English 2
## 197                                            Brother's Keeper 2
## 198                                                      Brutal 2
## 199                                                      Bubble 2
## 200                                                       Buddy 2
## 201                                                         Bug 2
## 202                                                      Bullet 2
## 203                                                 Bulletproof 2
## 204                                                       Bully 2
## 205                                                Buried Alive 2
## 206                                       By Dawn's Early Light 2
## 207                                                  By the Sea 3
## 208                                                        Ca$h 2
## 209                                                 Cabin Fever 2
## 210                                                        Cake 2
## 211                                                  California 2
## 212                                                     Camille 3
## 213                                        Camille Claudel 1915 2
## 214                                                      Camino 2
## 215                                                        Camp 2
## 216                                                       Candy 2
## 217                                                   Cape Fear 2
## 218                                                     Caprice 2
## 219                                             Captain America 3
## 220                                             Captain January 2
## 221                                                     Captive 3
## 222                                                  Caravaggio 2
## 223                                                       Cargo 4
## 224                                                      Carmen 4
## 225                                           Carnival of Souls 2
## 226                                                       Carny 2
## 227                                                      Carrie 4
## 228                                                    Casanova 2
## 229                                               Casino Royale 2
## 230                                                  Cat People 2
## 231                                                   Catacombs 2
## 232                                         Catch Me If You Can 2
## 233                                                      Caught 4
## 234                                              Chain Reaction 2
## 235                                            Chain of Command 3
## 236                                                     Chained 2
## 237                                                    Champion 2
## 238                                                       Chaos 3
## 239                                             Charlotte's Web 2
## 240                                                      Charly 2
## 241                                        Cheaper by the Dozen 2
## 242                                                     Chicago 2
## 243                                                Child's Play 2
## 244                                        Children of the Corn 2
## 245                                                  China Gate 2
## 246                                                    Chocolat 3
## 247                                                   Christine 3
## 248                                               Christmas Eve 2
## 249                                    Christmas in Connecticut 2
## 250                                                   Chrysalis 2
## 251                                                       Ciało 2
## 252                                                    Cimarron 2
## 253                                                  Cinderella 7
## 254                                              City of Ghosts 2
## 255                                         Clash of the Titans 2
## 256                                                   Cleopatra 5
## 257                                               Clockstoppers 2
## 258                                                     Cloud 9 2
## 259                                                       Cobra 2
## 260                                                    Cocktail 2
## 261                                                  Cold Sweat 2
## 262                                                     Colegas 2
## 263                                                     College 2
## 264                                                 Coming Soon 2
## 265                                                   Committed 3
## 266                                                     Company 2
## 267                                                  Compulsion 2
## 268                                         Conan the Barbarian 2
## 269                                                  Concussion 2
## 270                                                Coney Island 2
## 271                             Confessions of a Dangerous Mind 2
## 272                                                  Conspiracy 2
## 273                                                   Contagion 2
## 274                                                  Contraband 2
## 275                                                     Control 2
## 276                                                      Cosmos 2
## 277                                                   Countdown 2
## 278                                                 Crackerjack 2
## 279                                                       Crash 2
## 280                                                  Crash Dive 2
## 281                                                  Crawlspace 3
## 282                                                       Crazy 2
## 283                                                 Crazy Horse 2
## 284                                                  Crazy Love 2
## 285                                                    Creature 4
## 286                                                       Creep 2
## 287                                                  Crime Wave 2
## 288                                        Crime and Punishment 2
## 289                                                    Criminal 2
## 290                                                  Crossroads 3
## 291                                                       Crush 4
## 292                                    Cry, the Beloved Country 2
## 293                                                  Cyberbully 2
## 294                                          Cyrano de Bergerac 2
## 295                                                      D.O.A. 3
## 296                                                  Dad's Army 2
## 297                                             Dante's Inferno 2
## 298                                                   Dark City 2
## 299                                                  Dark Horse 3
## 300                                                  Dark House 2
## 301                                              Darkness Falls 2
## 302                                                     Darling 4
## 303                                             Das Versprechen 2
## 304                                           David Copperfield 2
## 305                                              David and Lisa 2
## 306                                            Dawn of the Dead 2
## 307                                                     Day One 2
## 308                                             Day of the Dead 2
## 309                                                    Daylight 2
## 310                                                  Dead Awake 3
## 311                                                  Dead Birds 2
## 312                                                    Dead End 3
## 313                                                   Dead Heat 2
## 314                                                Dead Silence 2
## 315                                               Dead of Night 2
## 316                                                    Deadfall 3
## 317                                                    Deadline 4
## 318                                                        Deal 2
## 319                                              Death Sentence 2
## 320                                                Death Valley 2
## 321                                          Death at a Funeral 2
## 322                                         Death of a Salesman 4
## 323                                                   Deception 2
## 324                                              Deck the Halls 2
## 325                                                    Defiance 2
## 326                                                   Delirious 2
## 327                                                    Delirium 2
## 328                                        Deliver Us from Evil 3
## 329                                                    Dementia 2
## 330                                                Demon Hunter 2
## 331                                                  Der Tunnel 2
## 332                                              Der var engang 2
## 333                                                    Derailed 2
## 334                                                    Deranged 2
## 335                                                   Destroyer 2
## 336                                                   Detention 2
## 337                                                      Detour 4
## 338                                          Devil's Playground 2
## 339                                                  Die Brücke 2
## 340                                            Die goldene Gans 2
## 341                                                   Dillinger 2
## 342                                             Dinosaur Island 2
## 343                                               Dirty Dancing 2
## 344                                                 Dirty Deeds 2
## 345                                                      Django 2
## 346                                              Do Not Disturb 3
## 347                                             Doctor Dolittle 2
## 348                                              Doctor Strange 2
## 349                                                    Dog Tags 2
## 350                                                    Don Juan 2
## 351                                                 Don Quixote 2
## 352                                 Don't Be Afraid of the Dark 2
## 353                                       Don't Drink the Water 2
## 354                                               Don't Hang Up 2
## 355                                            Double Indemnity 2
## 356                                                 Double Take 2
## 357                                              Double Trouble 2
## 358                                              Double Wedding 2
## 359                                                    Downhill 2
## 360                                     Dr. Jekyll and Mr. Hyde 4
## 361                                                     Dracula 3
## 362                                                   Dragonfly 2
## 363                                                Dragonslayer 2
## 364                                                Dreamcatcher 2
## 365                                                   Dreamland 3
## 366                                             Dressed to Kill 3
## 367                                                       Drive 2
## 368                                            Driving Me Crazy 2
## 369                                                       Drone 2
## 370                                                     Dunkirk 2
## 371                                                     Déjà Vu 2
## 372                                        Earth vs. the Spider 2
## 373                                                 Easy Living 2
## 374                                                 Easy Virtue 2
## 375                                                         Eat 3
## 376                                                        Eden 5
## 377                                            Edge of Darkness 2
## 378                                                   El Dorado 2
## 379                                               El Estudiante 2
## 380                                                    El Greco 2
## 381                                                       Elegy 2
## 382                                                    Elephant 2
## 383                                                    Elevator 2
## 384                                                      Elokuu 2
## 385                                      Embrace of the Vampire 2
## 386                                                        Emma 5
## 387                                                      Empire 2
## 388                                       Employee of the Month 2
## 389                                             Enchanted April 2
## 390                                                 Enchantment 2
## 391                                                      Encore 2
## 392                                             End of the Line 2
## 393                                            End of the World 2
## 394                                          Endangered Species 2
## 395                                                     Endgame 3
## 396                                                Endless Love 2
## 397                                                      Enigma 3
## 398                                                     Equinox 2
## 399                                                    Erotikon 2
## 400                                                      Escape 2
## 401                                    Escape to Witch Mountain 2
## 402                                                         Eva 2
## 403                                                     Everest 2
## 404                                                   Evergreen 2
## 405                                                    Evidence 3
## 406                                                        Exit 3
## 407                                                     Exposed 2
## 408                                                  Extraction 2
## 409                                                    FC Venus 2
## 410                                                        Face 2
## 411                                               Fade to Black 3
## 412                                                   Fair Game 3
## 413                                                      Fallen 4
## 414                                                        Fame 2
## 415                                                       Fanny 3
## 416                                                  Fanny Hill 2
## 417                                              Fantastic Four 2
## 418                                  Far from the Madding Crowd 2
## 419                                         Father of the Bride 2
## 420                                                  Fatherland 2
## 421                                                       Fatso 2
## 422                                                       Faust 3
## 423                                                        Fear 2
## 424                                           Fear in the Night 2
## 425                                                       Feast 2
## 426                                                        Feed 2
## 427                                                       Fever 2
## 428                                                 Fever Pitch 2
## 429                                               Final Justice 2
## 430                                             Finders Keepers 2
## 431                                             Fire Down Below 2
## 432                                              Fire with Fire 2
## 433                                              First Daughter 2
## 434                                                        Five 4
## 435                                                Flash Gordon 2
## 436                                                   Flashback 2
## 437                                                    Flawless 2
## 438                                                     Flipper 2
## 439                                        Flowers in the Attic 2
## 440                                                       Focus 2
## 441                                                   Footloose 2
## 442                                           For Love or Money 2
## 443                                                   Forbidden 3
## 444                                                     Forever 2
## 445                                               Forget Me Not 2
## 446                                                    Forsaken 2
## 447                                                    Fortress 3
## 448                                                    Fotograf 2
## 449                                                   Four Sons 2
## 450                                                     Foxfire 2
## 451                                                    Fracture 2
## 452                                                      Framed 2
## 453                                                Frankenstein 6
## 454                                               Frankenweenie 2
## 455                                          Frankie and Johnny 2
## 456                                               Freaky Friday 3
## 457                                                     Freedom 2
## 458                                                    Freeheld 2
## 459                                                     Freeway 2
## 460                                                       Fresh 2
## 461                                             Friday the 13th 2
## 462                                                Fright Night 2
## 463                                  From the Earth to the Moon 2
## 464                                                      Frozen 3
## 465                                      Fun with Dick and Jane 2
## 466                                                  Funny Farm 2
## 467                                                 Funny Games 2
## 468                                                        Fury 2
## 469                                                     Gabriel 2
## 470                                                   Gabrielle 2
## 471                                                      Gambit 2
## 472                                                   Game Over 2
## 473                                                       Gamer 2
## 474                                                    Gaslight 2
## 475                                                      Genius 2
## 476                                                    Geronimo 2
## 477                                                  Get Carter 2
## 478                                                     Get Out 2
## 479                                                     Ghajini 2
## 480                                                       Ghost 2
## 481                                                Ghostbusters 2
## 482                                                       Ghoul 2
## 483                                                        Gigi 2
## 484                                                  Girls Town 2
## 485                                                      Gloria 4
## 486                                                     Go West 2
## 487                                                    Godzilla 2
## 488                                              Going in Style 2
## 489                                          Going the Distance 2
## 490                                                        Gold 4
## 491                                                        Gone 2
## 492                                          Goodbye, Mr. Chips 2
## 493                                                      Gossip 2
## 494                                                       Grace 3
## 495                                                 Grand Hotel 2
## 496                                               Grandma's Boy 2
## 497                                             Graveyard Shift 2
## 498                                          Great Expectations 5
## 499                                                Grey Gardens 2
## 500                                          Gulliver's Travels 4
## 501                                                         Gus 2
## 502                                              Guys and Dolls 2
## 503                                                       Gypsy 2
## 504                                                   Hairspray 2
## 505                                                   Halloween 3
## 506                                                Halloween II 2
## 507                                                      Hamlet 8
## 508                                             Hansel & Gretel 2
## 509                                           Hansel and Gretel 2
## 510                                                   Happiness 2
## 511                                                       Happy 2
## 512                                                   Happy End 2
## 513                                              Happy New Year 2
## 514                                                   Hard Luck 2
## 515                                                    Hardcore 2
## 516                                                     Harvest 2
## 517                                                      Harvey 2
## 518                                                      Hawaii 2
## 519                                                     Hawking 2
## 520                                             Head Over Heels 2
## 521                                            Heartbreak Hotel 2
## 522                                               Heartbreakers 2
## 523                                            Hearts and Minds 2
## 524                                                        Heat 3
## 525                                                      Heaven 2
## 526                                             Heaven Can Wait 2
## 527                                               Heavy Petting 2
## 528                                                      Hector 2
## 529                                                       Heidi 6
## 530                                                       Heist 2
## 531                                                     Held Up 2
## 532                                                       Helen 2
## 533                                                    Hellgate 2
## 534                                              Helter Skelter 2
## 535                                                    Hercules 4
## 536                                                        Hero 2
## 537                                               Hidden Agenda 2
## 538                                               Hide and Seek 2
## 539                                                   High Noon 2
## 540                                                 High School 2
## 541                                                High Society 2
## 542                                                 High Strung 2
## 543                                                   Hiroshima 2
## 544                                                     Holiday 2
## 545                                              Holy Matrimony 2
## 546                                                        Home 5
## 547                                                  Home Movie 2
## 548                                             Home Sweet Home 2
## 549                                       Home for the Holidays 2
## 550                                           Home of the Brave 4
## 551                                                   Honeymoon 2
## 552                                                Hope Springs 2
## 553                                         Horton Hears a Who! 2
## 554                                                 Hot Pursuit 2
## 555                                                       Hotel 3
## 556                                                     Houdini 2
## 557                                                       House 2
## 558                                              House of Cards 2
## 559                                              House of Usher 2
## 560                                                House of Wax 2
## 561                                       House on Haunted Hill 2
## 562                                                Housekeeping 2
## 563                                       How to Make a Monster 2
## 564                                                        Howl 2
## 565                                                      Hunger 3
## 566                                                   Hurricane 2
## 567                                                        Hush 3
## 568                                              I Love Trouble 2
## 569                                    I'll Sleep When I'm Dead 2
## 570                                                 Ice Castles 2
## 571                                           Imitation of Life 2
## 572                                                      Impact 2
## 573                                                     Impulse 3
## 574                                               In Cold Blood 2
## 575                                                In the Blood 2
## 576                                                     Incubus 3
## 577                                               Indian Summer 2
## 578                                                     Inferno 4
## 579                                            Inherit the Wind 2
## 580                                                   Innocence 2
## 581                                                      Inside 2
## 582                                                  Inside Out 3
## 583                                                    Insomnia 2
## 584                                              Into the Storm 2
## 585                                                Into the Sun 2
## 586                                               Into the West 2
## 587                                              Into the Woods 3
## 588                                                    Intruder 2
## 589                                                   Intruders 3
## 590                                          Invaders from Mars 2
## 591                              Invasion of the Body Snatchers 2
## 592                                                  Invincible 2
## 593                                                        Iris 3
## 594                                                    Iron Man 3
## 595                                                   Isolation 2
## 596                                                It Takes Two 2
## 597                                                  It's Alive 2
## 598                                   It's Such a Beautiful Day 2
## 599                                                     Ivanhoe 2
## 600                                                        Jack 3
## 601                                                  Jack Frost 3
## 602                                      Jack and the Beanstalk 4
## 603                                       Jack the Giant Killer 2
## 604                                             Jack the Ripper 2
## 605                                                    Jailbait 2
## 606                                                   Jane Eyre 6
## 607                                     Jason and the Argonauts 2
## 608                                                 Jersey Girl 2
## 609                                                   Jerusalem 2
## 610                                                       Jesus 2
## 611                                      Jesus Christ Superstar 2
## 612                                                      Jigsaw 3
## 613                                                 Joan of Arc 3
## 614                                                      Joanna 2
## 615                                                         Joe 2
## 616                                                        Joey 2
## 617                                                      Joshua 2
## 618                                           Journey Into Fear 2
## 619                          Journey to the Center of the Earth 4
## 620                                                         Joy 2
## 621                                                  Judas Kiss 2
## 622                                                       Judex 2
## 623                                                        Juha 2
## 624                                                       Julia 3
## 625                                                       Julie 2
## 626                                               Julius Caesar 3
## 627                                                      Junior 2
## 628                                                Just My Luck 2
## 629                                              Just for Kicks 2
## 630                                                    Kamikaze 2
## 631                                                 Kid Galahad 2
## 632                                                   Kidnapped 2
## 633                                                        Kiki 3
## 634                                                Kill 'em All 2
## 635                                                 Kill Switch 2
## 636                                          Kill Your Darlings 2
## 637                                                     Killjoy 2
## 638                                                   Kind Lady 2
## 639                                                  King Cobra 2
## 640                                                   King Kong 3
## 641                                                   King Lear 5
## 642                                        King Solomon's Mines 4
## 643                                                Kingdom Come 2
## 644                                                      Kismet 2
## 645                                             Kiss Me Goodbye 2
## 646                                               Kiss of Death 2
## 647                                                 Knock Knock 2
## 648                                                    Knockout 2
## 649                                                    Kon-Tiki 2
## 650                                                  L'Attentat 2
## 651                                             L'Auberge rouge 2
## 652                                                     L'Enfer 2
## 653                                                 L'amour fou 2
## 654                                                         LOL 2
## 655                                     La maschera del demonio 2
## 656                                               La religieuse 2
## 657                                                   Labyrinth 2
## 658                                                    Ladrones 2
## 659                                                    Lamerica 2
## 660                                                      Lassie 2
## 661                                                Last Holiday 2
## 662                                           Last Man Standing 2
## 663                                                  Last Night 2
## 664                                                 Last Resort 2
## 665                                                 Last Summer 2
## 666                                               Late Bloomers 2
## 667                                               Law and Order 2
## 668                                    Le Comte de Monte-Cristo 2
## 669                                                     Le fils 2
## 670                                                 Left Behind 2
## 671                                                      Legacy 3
## 672                                                      Legend 2
## 673                                                      Legion 2
## 674                                              Les Misérables 7
## 675                                    Les liaisons dangereuses 2
## 676                                                   Leviathan 2
## 677                                                        Life 4
## 678                                                      Lifted 2
## 679                                                  Lights Out 2
## 680                                                      Liliom 2
## 681                                                   Limelight 2
## 682                                                   Lionheart 2
## 683                                               Little Dorrit 2
## 684                                      Little Lord Fauntleroy 2
## 685                                                  Little Men 2
## 686                                          Little Miss Marker 2
## 687                                             Little Monsters 2
## 688                                               Little Sister 2
## 689                                                Little Women 3
## 690                                                      Lizzie 2
## 691                                                      Loaded 3
## 692                                                 Local Color 2
## 693                                                        Loft 2
## 694                                                       Logan 2
## 695                                                        Lola 3
## 696                                                      Lolita 2
## 697                                                      London 2
## 698                                       London After Midnight 2
## 699                               Long Day's Journey Into Night 2
## 700                                                Long Weekend 2
## 701                                           Lord of the Flies 2
## 702                                                Lost & Found 2
## 703                                                Lost Horizon 2
## 704                                                        Love 4
## 705                                                 Love Affair 2
## 706                                                    Loverboy 3
## 707                                                    Lovesick 2
## 708                                                      Loving 2
## 709                                                       Lucky 3
## 710                                                  Lucky Luke 2
## 711                                                        Lucy 2
## 712                                                     Lullaby 2
## 713                                                      Luther 2
## 714                                                           M 2
## 715                                                     Macbeth 7
## 716                                                    Mad Love 2
## 717                                                  Madagascar 2
## 718                                               Madame Bovary 4
## 719                                                Mademoiselle 2
## 720                                                    Madhouse 3
## 721                                       Magnificent Obsession 2
## 722                                                      Magnus 2
## 723                                            Mail Order Bride 2
## 724                                                   Malcolm X 2
## 725                                                        Mama 2
## 726                                                     Mammoth 2
## 727                                            Man of the House 2
## 728                                           Man of the Moment 2
## 729                                             Man of the Year 2
## 730                                                 Man on Fire 2
## 731                                                      Maniac 4
## 732                                                   Mannequin 2
## 733                                              Mansfield Park 2
## 734                                                    Margaret 2
## 735                                            Marie Antoinette 2
## 736                                                      Marius 2
## 737                                                        Mars 2
## 738                                                     Martyrs 2
## 739                                                 Masterminds 2
## 740                                                   Mata Hari 2
## 741                                                     Matilda 2
## 742                                                         Max 3
## 743                                                   Mayerling 2
## 744                                                       Medea 2
## 745                                                      Melody 2
## 746                                                Memorial Day 2
## 747                                                 Memory Lane 2
## 748                                               Men with Guns 2
## 749                                                 Mercenaries 2
## 750                                                       Mercy 4
## 751                                                      Meteor 2
## 752                                                  Metropolis 2
## 753                                                     Michael 3
## 754                                                      Mickey 2
## 755                                           Middle of Nowhere 2
## 756                                                    Midnight 3
## 757                                                Midnight Man 2
## 758                                            Mighty Joe Young 2
## 759                                              Mildred Pierce 2
## 760                                                        Milk 2
## 761                                                        Mine 2
## 762                                      Miracle on 34th Street 2
## 763                                                      Mirage 2
## 764                                                     Miranda 3
## 765                                               Mirror Mirror 2
## 766                                              Mischief Night 2
## 767                                                  Miss Julie 2
## 768                                                       Moana 2
## 769                                                     Molière 2
## 770                                                    Momentum 2
## 771                                                       Mommy 2
## 772                                             Monkey Business 2
## 773                                                     Monster 2
## 774                                                     Montana 2
## 775                                                 Monte Carlo 2
## 776                                                        More 2
## 777                                                      Morgan 2
## 778                                               Morning Glory 2
## 779                                                    Mortuary 2
## 780                                                      Mosaic 2
## 781                                                Mother's Day 3
## 782                                               Moving Target 2
## 783                                            Mr. & Mrs. Smith 2
## 784                                                   Mr. Jones 2
## 785                                                   Mr. Right 3
## 786                                      Much Ado About Nothing 2
## 787                                Murder on the Orient Express 2
## 788                                                     Mutants 2
## 789                                        Mutiny on the Bounty 2
## 790                                         My Bloody Valentine 2
## 791                                              My Blue Heaven 2
## 792                                            My Cousin Rachel 2
## 793                                              My Man Godfrey 2
## 794                                            My Sister Eileen 2
## 795                                           Mysterious Island 2
## 796                                          Mädchen in Uniform 2
## 797                                                       Naked 2
## 798                                                        Nana 3
## 799                                                  Nancy Drew 2
## 800                                           Natural Selection 2
## 801                                         Nature of the Beast 2
## 802                                                   Ned Kelly 2
## 803                                                   Neighbors 3
## 804                                         Never a Dull Moment 2
## 805                                                 Next of Kin 3
## 806                                                 Night Moves 2
## 807                                          Night and the City 2
## 808                                          Night of the Demon 2
## 809                                         Night of the Demons 2
## 810                                    Night of the Living Dead 2
## 811                                                   Nightmare 3
## 812                                                  Nightmares 2
## 813                                                        Nina 2
## 814                                                  Nine Lives 3
## 815                                                   No Escape 2
## 816                                                No Good Deed 2
## 817                                           No Man of Her Own 2
## 818                                               No Man's Land 4
## 819                                                  No Smoking 2
## 820                                                  No Way Out 2
## 821                                                        Noah 2
## 822                                                  Noah's Ark 2
## 823                                               Nobody's Fool 2
## 824                                            Nobody's Perfect 2
## 825                                                    Nocturna 2
## 826                                                       Noise 2
## 827                                                    Non-Stop 2
## 828                                                      Normal 2
## 829                                            Nothing Personal 2
## 830                                             Nothing to Lose 2
## 831                                                   Notorious 2
## 832                                                    Oblivion 2
## 833                                                    Obsessed 3
## 834                                                   Obsession 2
## 835                                              Ocean's Eleven 2
## 836                                             Of Mice and Men 2
## 837                                                     Offside 2
## 838                                                   Oklahoma! 2
## 839                                                Oliver Twist 4
## 840                                                On the Beach 2
## 841                                                Once a Thief 2
## 842                                               One More Time 2
## 843                                                    One Week 2
## 844                                                 Open Season 2
## 845                                               Opening Night 2
## 846                                                    Operator 2
## 847                                                       Oscar 2
## 848                                                     Othello 4
## 849                                                    Our Town 2
## 850                                                    Out Cold 2
## 851                                                Out of Reach 2
## 852                                                 Out of Time 2
## 853                                             Out of the Blue 3
## 854                                               Out on a Limb 2
## 855                                                     Outrage 3
## 856                                                Paid in Full 2
## 857                                                         Pan 2
## 858                                                Panic Button 2
## 859                                                   Paparazzi 2
## 860                                                    Paradise 3
## 861                                                     Paradox 2
## 862                                                    Paranoia 2
## 863                                                   Parineeta 2
## 864                                                  Party Girl 3
## 865                                               Party Monster 2
## 866                                                  Passengers 2
## 867                                                     Passion 2
## 868                                                     Patrick 2
## 869                                                    Penelope 2
## 870                                         Pennies from Heaven 3
## 871                                                    Penumbra 2
## 872                                                  Persuasion 3
## 873                                               Pete's Dragon 2
## 874                                                   Peter Pan 5
## 875                                                     Phantom 3
## 876                                                     Phoenix 3
## 877                                              Pie in the Sky 2
## 878                                                  Pilgrimage 2
## 879                                                   Pinocchio 4
## 880                                                     Piranha 2
## 881                                                      Pixels 2
## 882                                                       Pizza 2
## 883                                          Planet of the Apes 2
## 884                                           Playing for Keeps 2
## 885                                             Poil de carotte 2
## 886                                                 Point Break 2
## 887                                                      Poison 2
## 888                                                  Poison Ivy 2
## 889                                        Pokémon 3: The Movie 2
## 890                                                      Police 2
## 891                                                   Pollyanna 2
## 892                                                 Poltergeist 2
## 893                                                     Popcorn 2
## 894                                                       Posse 2
## 895                                                   Possessed 3
## 896                                                  Possession 3
## 897                                                 Pretty Baby 2
## 898                                                       Pride 2
## 899                                         Pride and Prejudice 2
## 900                                                      Priest 2
## 901                                     Priklyucheniya Buratino 2
## 902                                              Prince Valiant 2
## 903                                                    Princess 2
## 904                                                   Prinsessa 2
## 905                                               Private Parts 2
## 906                                                   Project X 3
## 907                                                  Prom Night 2
## 908                                               Promised Land 2
## 909                                                       Proof 2
## 910                                                     Proteus 2
## 911                                                  Providence 2
## 912                                                      Psycho 2
## 913                                              Public Enemies 3
## 914                                                       Pulse 2
## 915                                                      Pusher 2
## 916                                         Pünktchen und Anton 2
## 917                                                           Q 2
## 918                                                     Quartet 3
## 919                                                   Quo Vadis 2
## 920                                                        Race 2
## 921                                                     Raffles 2
## 922                                                        Rage 3
## 923                                                        Rain 2
## 924                                                     Rampage 2
## 925                                                      Ransom 2
## 926                                                  Rapid Fire 2
## 927                                                         Rat 2
## 928                                                    Raw Deal 2
## 929                                                 Rear Window 2
## 930                                                     Rebirth 2
## 931                                                    Reckless 2
## 932                                                    Red Dawn 2
## 933                                                    Red Dust 2
## 934                                                    Red Heat 2
## 935                                             Red Riding Hood 2
## 936                                                      Refuge 2
## 937                                                Regeneration 2
## 938                                                   Rembrandt 3
## 939                                              Remote Control 2
## 940                                                     Requiem 2
## 941                                                  Resistance 2
## 942                                                     Respire 2
## 943                                                    Restless 2
## 944                                                 Restoration 2
## 945                                            Return to Sender 2
## 946                                                  Revolution 2
## 947                                             Rich and Famous 2
## 948                                                 Richard III 3
## 949                                                    Ricochet 2
## 950                                                        Ride 2
## 951                                                   Riff-Raff 2
## 952                                                       Rings 2
## 953                                                         Rio 2
## 954                                                        Riot 2
## 955                                                      Ritual 2
## 956                                                       River 2
## 957                                                  Riverworld 2
## 958                                                        Road 2
## 959                                                  Road House 2
## 960                                                      Roadie 2
## 961                                                  Robin Hood 4
## 962                                             Robinson Crusoe 3
## 963                                                     RoboCop 2
## 964                                                  Rollerball 2
## 965                                                        Roma 2
## 966                                                     Romance 3
## 967                                            Romeo and Juliet 2
## 968                                                        Room 2
## 969                                             Rosemary's Baby 2
## 970                                                        Ruby 2
## 971                                                         Run 3
## 972                                                     Runaway 3
## 973                                              Running Scared 3
## 974                                                        Rush 2
## 975                                                    Sabotage 2
## 976                                                     Sabrina 2
## 977                                                   Sacrifice 3
## 978                                                        Safe 2
## 979                                                  Safe House 2
## 980                                                      Sahara 4
## 981                                                 Salem's Lot 2
## 982                                                      Salomé 3
## 983                                                     Salvage 2
## 984                                                     Samsara 2
## 985                                          Samson and Delilah 3
## 986                                                 San Quentin 2
## 987                                                 Santa Claus 2
## 988                                                 Santa Claws 2
## 989                                                     Savages 2
## 990                                                     Save Me 2
## 991                                                 Saving Face 2
## 992                                                         Saw 2
## 993                                                 Scaramouche 2
## 994                                                   Scarecrow 2
## 995                                                    Scarface 2
## 996                                       School for Scoundrels 2
## 997                                                     Scorned 2
## 998                                                   Screamers 2
## 999                                                     Screwed 2
## 1000                                                    Scrooge 3
## 1001                                                Second Skin 2
## 1002                                             Secret Défense 2
## 1003                                                See No Evil 2
## 1004                                                    Seizure 2
## 1005                                      Sense and Sensibility 4
## 1006                                                  Senseless 2
## 1007                                                  September 2
## 1008                                                    Sequoia 2
## 1009                                                     Serena 2
## 1010                                              Shadow People 2
## 1011                                                      Shaft 2
## 1012                                                  Shakedown 3
## 1013                                                      Shank 2
## 1014                                                        She 3
## 1015                                                    Shelter 3
## 1016                                             Shelter Island 2
## 1017                                                 Shenandoah 2
## 1018                                            Sherlock Holmes 4
## 1019                                                      Shiva 2
## 1020                                            Shock Treatment 2
## 1021                                              Shoot to Kill 2
## 1022                                                  Show Boat 2
## 1023                                                    Sicario 2
## 1024                                               Side Effects 2
## 1025                                      Sidewalks of New York 2
## 1026                                                      Signs 2
## 1027                                             Silent Retreat 2
## 1028                                                       Silk 2
## 1029                                                      Simon 2
## 1030                                               Sink or Swim 2
## 1031                                                      Siren 2
## 1032                                                    Sisters 3
## 1033                                                 Ski Patrol 2
## 1034                                                       Skin 2
## 1035                                                Skinwalkers 2
## 1036                                                    Skylark 2
## 1037                                            Sleeping Beauty 4
## 1038                                                     Sleuth 2
## 1039                                                 Slipstream 2
## 1040                                                    Slither 2
## 1041                                                  Slow Burn 3
## 1042                                                      Smile 2
## 1043                                                   Snatched 2
## 1044                                                 Snow White 4
## 1045                                                    Soldier 2
## 1046                                                       Solo 3
## 1047                                 Somebody Up There Likes Me 2
## 1048                                             Something Wild 2
## 1049                                    Something to Sing About 2
## 1050                                             Son of Dracula 2
## 1051                                                  Sonny Boy 2
## 1052                                                  Sorceress 2
## 1053                                                    Sounder 2
## 1054                                                Sour Grapes 2
## 1055                                           Southern Comfort 2
## 1056                                                    Sparkle 2
## 1057                                                  Spartacus 2
## 1058                                                   Speedway 2
## 1059                                                 Spellbound 2
## 1060                                                     Spider 2
## 1061                                                    Spiders 2
## 1062                                                       Spin 3
## 1063                                                   Splendor 2
## 1064                                               Split Second 2
## 1065                                               Stage Fright 2
## 1066                                               Stage Struck 3
## 1067                                                 Stagecoach 2
## 1068                                                   Standoff 2
## 1069                                                   Stardust 2
## 1070                                                 State Fair 3
## 1071                                                       Stay 2
## 1072                                                      Steel 2
## 1073                                                     Stella 3
## 1074                                                     Stereo 2
## 1075                                                     Stevie 2
## 1076                                                 Still Life 2
## 1077                                                   Stitches 2
## 1078                                                 Stone Cold 2
## 1079                                                  Stonewall 2
## 1080                                                      Storm 2
## 1081                                              Storm Warning 2
## 1082                                             Stormy Weather 2
## 1083                                                   Stranded 3
## 1084                                           Strange Invaders 2
## 1085                                                   Strapped 2
## 1086                                                 Straw Dogs 2
## 1087                                                      Stuck 2
## 1088                                                  Submarine 2
## 1089                                                  Submerged 2
## 1090                                                   Suddenly 2
## 1091                                                      Sugar 3
## 1092                                                 Sugar Hill 2
## 1093                                                     Sultan 2
## 1094                                                Summer Camp 2
## 1095                                             Summer Holiday 2
## 1096                                              Summer School 3
## 1097                                                    Sundown 2
## 1098                                               Sunset Strip 2
## 1099                                                   Sunshine 2
## 1100                                                   Superman 2
## 1101                                                  Supernova 2
## 1102                                                  Superstar 2
## 1103                                                        Sur 2
## 1104                                                   Survivor 3
## 1105                                                    Suspect 3
## 1106                                       Swallows and Amazons 2
## 1107             Sweeney Todd: The Demon Barber of Fleet Street 3
## 1108                                             Sweet November 2
## 1109                                              Sweet Revenge 2
## 1110                                              Sweet Sixteen 2
## 1111                                                   Swingers 2
## 1112                                                     Switch 3
## 1113                                                      Sybil 2
## 1114                                                     Sylvia 2
## 1115                                                       Tabu 2
## 1116                                       Take Me to the River 2
## 1117                                                      Taken 2
## 1118                                                  Tangerine 2
## 1119                                                    Tangled 2
## 1120                                                      Tango 2
## 1121                                                Taras Bulba 2
## 1122                                                     Target 2
## 1123                                                     Tarzan 2
## 1124                                                       Taxi 2
## 1125                                              Teacher's Pet 3
## 1126                               Teenage Mutant Ninja Turtles 2
## 1127                                                    Tempest 2
## 1128                                                   Terminus 2
## 1129                                  Tess of the D'Urbervilles 2
## 1130                                               The 39 Steps 3
## 1131                                              The Abandoned 2
## 1132                                                The Accused 2
## 1133                         The Adventures of Huckleberry Finn 2
## 1134                               The Adventures of Mark Twain 2
## 1135                                       The Age of Innocence 2
## 1136                                                  The Alamo 2
## 1137                                      The Amityville Horror 2
## 1138                                       The Andromeda Strain 2
## 1139                                              The Architect 2
## 1140                                       The Art of the Steal 2
## 1141                                             The Assignment 2
## 1142                                               The Avengers 2
## 1143                                                The Aviator 2
## 1144                                              The Awakening 2
## 1145                                            The Awful Truth 2
## 1146                                               The Bachelor 2
## 1147                                               The Bad Seed 2
## 1148                                                   The Bank 2
## 1149                                                 The Barber 2
## 1150                                                    The Bat 2
## 1151                                               The Beguiled 2
## 1152                                               The Best Man 2
## 1153                                                The Big Fix 2
## 1154                                              The Big Sleep 2
## 1155                                              The Big Steal 2
## 1156                                      The Birth of a Nation 2
## 1157                                          The Biscuit Eater 2
## 1158                                              The Black Cat 3
## 1159                                             The Black Hole 4
## 1160                                             The Black Room 2
## 1161                                             The Bling Ring 2
## 1162                                                   The Blob 2
## 1163                                              The Blue Bird 2
## 1164                                            The Blue Lagoon 2
## 1165                                           The Book of Life 2
## 1166                                              The Borrowers 3
## 1167                                                   The Boss 2
## 1168                                        The Bourne Identity 2
## 1169                                                    The Box 2
## 1170                                                  The Boxer 2
## 1171                                                    The Boy 2
## 1172                                          The Boy Next Door 2
## 1173                                                   The Boys 2
## 1174                                              The Brave One 2
## 1175                                                  The Breed 2
## 1176                                                 The Bridge 2
## 1177                                       The Browning Version 2
## 1178                                              The Buccaneer 2
## 1179                                       The Call of the Wild 3
## 1180                                                 The Caller 2
## 1181                                      The Canterville Ghost 2
## 1182                                                The Captive 2
## 1183                                              The Caretaker 2
## 1184                                        The Case for Christ 2
## 1185                                     The Cat and the Canary 3
## 1186                                         The Cat in the Hat 2
## 1187                                              The Challenge 3
## 1188                                                  The Champ 2
## 1189                            The Charge of the Light Brigade 2
## 1190                                                  The Chase 3
## 1191                                                  The Child 2
## 1192                                                 The Chosen 2
## 1193                                                 The Circle 3
## 1194                                                   The Club 2
## 1195                                      The Cold Light of Day 2
## 1196                                             The Collection 2
## 1197                                              The Collector 2
## 1198                                               The Comedian 3
## 1199                                              The Condemned 2
## 1200                                             The Confession 3
## 1201                                             The Connection 2
## 1202                                                The Cottage 2
## 1203                                  The Count of Monte Cristo 2
## 1204                                               The Covenant 2
## 1205                                                The Crazies 2
## 1206                                                   The Crew 3
## 1207                                                   The Cure 2
## 1208                                                 The Damned 2
## 1209                                                   The Dark 2
## 1210                                             The Dark Horse 2
## 1211                                            The Dark Knight 2
## 1212                                             The Dark Tower 2
## 1213                                            The Dawn Patrol 2
## 1214                                    The Day of the Triffids 2
## 1215                              The Day the Earth Stood Still 2
## 1216                                                   The Dead 2
## 1217                                              The Dead Zone 2
## 1218                                                   The Deal 2
## 1219                                          The Deep Blue Sea 2
## 1220                                           The Defiant Ones 2
## 1221                                                The Dentist 2
## 1222                                            The Desert Song 3
## 1223                                    The Diary of Anne Frank 3
## 1224                                            The Disappeared 2
## 1225                                                 The Double 2
## 1226                                             The Dream Team 2
## 1227                                                The Dresser 2
## 1228                                         The Dunwich Horror 2
## 1229                                                   The Edge 2
## 1230                                           The Elephant Man 2
## 1231                                  The Emperor's New Clothes 2
## 1232                                              The Encounter 2
## 1233                                                    The End 2
## 1234                                      The End of the Affair 2
## 1235                                               The Enforcer 2
## 1236                                               The Escapist 2
## 1237                                                  The Falls 2
## 1238                                                    The Fan 2
## 1239                                   The Fast and the Furious 2
## 1240                                              The Final Cut 2
## 1241                                                   The Firm 2
## 1242                                             The First Time 2
## 1243                                                    The Fly 2
## 1244                                                    The Fog 2
## 1245                                              The Foreigner 2
## 1246                                                 The Forest 2
## 1247                                                 The Forger 2
## 1248                                              The Forgotten 2
## 1249                                                The Formula 2
## 1250                                          The Four Feathers 4
## 1251                        The Four Horsemen of the Apocalypse 2
## 1252                                               The Freshman 2
## 1253                                                  The Front 2
## 1254                                             The Front Page 2
## 1255                                           The Frozen North 2
## 1256                                               The Fugitive 2
## 1257                                                The Gambler 3
## 1258                                                 The Garden 2
## 1259                                        The Gathering Storm 2
## 1260                                               The Gauntlet 2
## 1261                                                The General 2
## 1262                                                The Getaway 2
## 1263                                            The Ghost Train 2
## 1264                                                  The Ghoul 2
## 1265                                                   The Gift 3
## 1266                                                   The Girl 2
## 1267                                         The Girl Next Door 3
## 1268                                           The Girl Said No 2
## 1269                                      The Girl on the Train 2
## 1270                                            The Glass House 2
## 1271                                        The Glass Menagerie 3
## 1272                                         The Good Humor Man 2
## 1273                                               The Good Lie 2
## 1274                                          The Good Shepherd 2
## 1275                                           The Goodbye Girl 2
## 1276                                           The Great Gatsby 4
## 1277                                            The Great Waltz 2
## 1278                                               The Greatest 2
## 1279                                           The Green Hornet 3
## 1280                                               The Guardian 2
## 1281                                             The Gunfighter 2
## 1282                                              The Happening 2
## 1283                                               The Hard Way 2
## 1284                                          The Haunted House 2
## 1285                                               The Haunting 2
## 1286                                         The Heartbreak Kid 2
## 1287                                        The Hills Have Eyes 2
## 1288                                                The Hitcher 2
## 1289                                                   The Hive 2
## 1290                                                   The Hole 2
## 1291                                                 The Hollow 3
## 1292                                                The Hoodlum 2
## 1293                              The Hound of the Baskervilles 6
## 1294                                The Hunchback of Notre Dame 3
## 1295                                                 The Hunted 2
## 1296                                                 The Hunter 2
## 1297                                                The Hunters 4
## 1298                                          The Hunting Party 2
## 1299                                              The Hurricane 2
## 1300                                              The Immigrant 2
## 1301                            The Importance of Being Earnest 3
## 1302                                               The In Crowd 2
## 1303                                                The In-Laws 2
## 1304                                               The Incident 3
## 1305                                    The Initiation of Sarah 2
## 1306                                              The Institute 2
## 1307                                                 The Intern 2
## 1308                                              The Interview 2
## 1309                                               The Intruder 2
## 1310                                        The Invisible Woman 2
## 1311                                             The Invitation 2
## 1312                                                 The Island 2
## 1313                                   The Island of Dr. Moreau 2
## 1314                                            The Italian Job 2
## 1315                                            The Jazz Singer 2
## 1316                                                The Journey 4
## 1317                                            The Jungle Book 3
## 1318                                             The Karate Kid 2
## 1319                                                 The Keeper 3
## 1320                                                    The Key 2
## 1321                                                    The Kid 2
## 1322                                                The Killers 2
## 1323                                             The King and I 2
## 1324                                                   The Kiss 4
## 1325                                             The Ladies Man 2
## 1326                                          The Lady Vanishes 3
## 1327                                            The Ladykillers 2
## 1328                                                   The Land 2
## 1329                                 The Last House on the Left 2
## 1330                                      The Last Man on Earth 2
## 1331                                            The Last Patrol 2
## 1332                                               The Last Run 2
## 1333                                              The Last Word 2
## 1334                                   The Last of the Mohicans 3
## 1335                                The Legend of Sleepy Hollow 3
## 1336                                                 The Letter 3
## 1337                       The Life & Adventures of Santa Claus 2
## 1338                                          The Little Prince 2
## 1339                                                 The Lodger 2
## 1340                                            The Lone Ranger 3
## 1341                                           The Longest Yard 2
## 1342                                                  The Lorax 2
## 1343                                                   The Lost 2
## 1344                                             The Lost World 3
## 1345                                                The Lottery 2
## 1346                                               The Love Bug 2
## 1347                                            The Love Letter 2
## 1348                                                 The Lovers 2
## 1349                                      The Luck of the Irish 2
## 1350                                               The Magician 2
## 1351                                      The Magnificent Seven 2
## 1352                                                  The Maker 2
## 1353                                         The Maltese Falcon 2
## 1354                                                    The Man 2
## 1355                                  The Man Who Knew Too Much 2
## 1356                                   The Man Who Wasn't There 2
## 1357                                   The Man in the Iron Mask 3
## 1358                                   The Manchurian Candidate 2
## 1359                                                   The Mark 2
## 1360                                           The Mark of Cain 2
## 1361                                          The Mark of Zorro 2
## 1362                                                   The Mask 2
## 1363                                                 The Master 2
## 1364                                                The Matador 2
## 1365                                             The Matchmaker 2
## 1366                                                   The Maze 2
## 1367                                               The Mechanic 2
## 1368                                     The Merchant of Venice 2
## 1369                                            The Merry Widow 3
## 1370                                         The Miracle Worker 3
## 1371                                                The Monster 2
## 1372                                          The Morning After 2
## 1373                                                  The Mummy 4
## 1374                                              The Music Man 2
## 1375                                               The Neighbor 3
## 1376                                           The Night Before 2
## 1377                                          The Night Stalker 2
## 1378                                        The Nutty Professor 2
## 1379                                         The Old Dark House 2
## 1380                                    The Old Man and the Sea 2
## 1381                                                   The Omen 2
## 1382                                                    The One 2
## 1383                                              The Open Road 2
## 1384                                                  The Order 2
## 1385                                            The Other Woman 2
## 1386                                               The Outsider 2
## 1387                                                   The Pack 2
## 1388                                                The Package 2
## 1389                                           The Painted Veil 2
## 1390                                               The Paleface 2
## 1391                                            The Parent Trap 2
## 1392                                                The Patriot 2
## 1393                                                  The Patsy 2
## 1394                                                The Penalty 2
## 1395                                      The Perils of Pauline 3
## 1396                                                The Phantom 3
## 1397                                   The Phantom of the Opera 4
## 1398                                The Philadelphia Experiment 2
## 1399                                             The Pied Piper 2
## 1400                                           The Pink Panther 2
## 1401                                    The Pirates of Penzance 2
## 1402                                                    The Pit 2
## 1403                                   The Pit and the Pendulum 3
## 1404                                              The Plainsman 2
## 1405                                     The Poseidon Adventure 2
## 1406                             The Postman Always Rings Twice 2
## 1407                                    The Power and the Glory 2
## 1408                                  The Prince and the Pauper 4
## 1409                                      The Prisoner of Zenda 3
## 1410                                              The Producers 2
## 1411                                                The Program 2
## 1412                                                The Promise 2
## 1413                                            The Proposition 2
## 1414                                                The Prowler 2
## 1415                                               The Punisher 2
## 1416                                                  The Queen 2
## 1417                                     The Quick and the Dead 2
## 1418                                         The Quiet American 2
## 1419                                                 The Racket 2
## 1420                                              The Rainmaker 2
## 1421                                                  The Raven 4
## 1422                                           The Razor's Edge 2
## 1423                                             The Real McCoy 2
## 1424                                              The Reckoning 3
## 1425                                                 The Return 2
## 1426                                               The Revenant 2
## 1427                                                   The Rift 2
## 1428                                                   The Ring 3
## 1429                                                  The River 2
## 1430                                                   The Road 2
## 1431                             The Roman Spring of Mrs. Stone 2
## 1432                                                 The Rookie 2
## 1433                                                  The Saint 2
## 1434                                              The Scapegoat 2
## 1435                                              The Scarecrow 2
## 1436                                         The Scarlet Letter 3
## 1437                                      The Scarlet Pimpernel 2
## 1438                                               The Sea Hawk 2
## 1439                                                 The Search 2
## 1440                                          The Secret Garden 3
## 1441                            The Secret Life of Walter Mitty 2
## 1442                                               The Sentinel 2
## 1443                                             The Shaggy Dog 2
## 1444                                                  The Sheik 2
## 1445                                                   The Show 2
## 1446                                                 The Signal 2
## 1447                                                 The Sitter 2
## 1448                                                The Snowman 2
## 1449                                     The Sound and the Fury 2
## 1450                                       The Spiral Staircase 2
## 1451                                    The Spirit of Christmas 3
## 1452                                                 The Square 2
## 1453                                                The Squeeze 3
## 1454                                             The Stepfather 2
## 1455                                         The Stepford Wives 2
## 1456                                               The Stranger 4
## 1457                                             The Substitute 2
## 1458                                          The Sunshine Boys 2
## 1459                                                The Suspect 2
## 1460                                                   The Take 3
## 1461                                    The Taming of the Shrew 3
## 1462                                                The Tempest 2
## 1463                                       The Ten Commandments 2
## 1464                                   The Theory of Everything 2
## 1465                                        The Thief of Bagdad 2
## 1466                                                  The Thing 2
## 1467                                    The Thomas Crown Affair 2
## 1468                                       The Three Musketeers 7
## 1469                                          The Three Stooges 2
## 1470                                           The Time Machine 2
## 1471                                      The Time of Your Life 2
## 1472                                        The Toolbox Murders 2
## 1473                                                The Tracker 2
## 1474                                                   The Trap 2
## 1475                                                   The Trip 3
## 1476                                      The Trip to Bountiful 2
## 1477                                                 The Tunnel 2
## 1478                                          The Turning Point 2
## 1479                                             The Undefeated 2
## 1480                                             The Underneath 2
## 1481                                           The Unholy Three 2
## 1482                                              The Uninvited 3
## 1483                                                 The Unseen 2
## 1484                                                    The Van 2
## 1485                                                The Verdict 2
## 1486                                           The Violent Kind 2
## 1487                                              The Virginian 2
## 1488                                                  The Visit 4
## 1489                                                   The Void 3
## 1490                                           The Waiting Room 2
## 1491                                           The Walking Dead 2
## 1492                                                    The War 2
## 1493                                            The War at Home 2
## 1494                                          The Wedding March 2
## 1495                                                   The Well 2
## 1496                                             The Wicker Man 2
## 1497                                             The Wild Party 3
## 1498                                    The Wind in the Willows 3
## 1499                                            The Winslow Boy 2
## 1500                                                The Witches 2
## 1501                                         The Wizard of Gore 2
## 1502                                           The Wizard of Oz 2
## 1503                                         The Woman in Black 2
## 1504                                                  The Women 2
## 1505                                          The Wrecking Crew 2
## 1506                                             The Wrong Girl 2
## 1507                                              The Wrong Man 2
## 1508                                               The Yearling 2
## 1509                                           Thick as Thieves 2
## 1510                                                      Thief 3
## 1511                                                   Thin Ice 2
## 1512                                                     Thirst 3
## 1513                                          This Land Is Mine 2
## 1514                                        Three Men in a Boat 2
## 1515                                              Thunderstruck 2
## 1516                                                   Timecode 2
## 1517                                  Tinker Tailor Soldier Spy 2
## 1518                                                    Titanic 3
## 1519                                         To Be or Not to Be 2
## 1520                                   To the Ends of the Earth 2
## 1521                                                     Tobruk 2
## 1522                                                 Tom Sawyer 3
## 1523                                                     Tomboy 2
## 1524                                          Too Hot to Handle 2
## 1525                                                     Topaze 2
## 1526                                                  Tormented 2
## 1527                                               Total Recall 2
## 1528                                               Toy Soldiers 2
## 1529                                                     Tracks 3
## 1530                                                     Trance 2
## 1531                                                    Trapped 4
## 1532                                                      Trash 2
## 1533                                                     Trauma 2
## 1534                                             Treading Water 2
## 1535                                            Treasure Island 6
## 1536                                                   Trespass 2
## 1537                                                      Trick 2
## 1538                                             Trick or Treat 2
## 1539                                             Triple Trouble 2
## 1540                                                  True Blue 2
## 1541                                                 True Crime 2
## 1542                                                  True Grit 2
## 1543                                                     Trumbo 2
## 1544                                                      Trust 2
## 1545                                                      Truth 2
## 1546                                                      Tsuma 2
## 1547                                                 Tumbledown 3
## 1548                                         Tuntematon sotilas 2
## 1549                                               Turkey Shoot 2
## 1550                                                       Tusk 2
## 1551                                                   Twilight 2
## 1552                                                      Twist 2
## 1553                                             Twist of Faith 2
## 1554                                                    Twisted 3
## 1555                                                    Twister 2
## 1556                                              Two of a Kind 2
## 1557                                                      Tyson 2
## 1558                                            Under Suspicion 2
## 1559                                              Under the Gun 3
## 1560                                             Under the Skin 2
## 1561                                                Underground 2
## 1562                                                   Undertow 2
## 1563                                                 Underworld 3
## 1564                                         Unfaithfully Yours 2
## 1565                                              Unforgettable 2
## 1566                                                     United 2
## 1567                                          Universal Soldier 2
## 1568                                                    Unknown 2
## 1569                                                Unmade Beds 2
## 1570                                                Unstoppable 2
## 1571                                                  Valentino 2
## 1572                                                    Valerie 2
## 1573                                                   Vampires 2
## 1574                                                   Vendetta 2
## 1575                                                      Venom 2
## 1576                                                       Vice 2
## 1577                                                 Vice Squad 2
## 1578                                                     Victim 3
## 1579                                                   Victoria 2
## 1580                                      Village of the Damned 2
## 1581                                                      Virus 2
## 1582                                                       Viva 2
## 1583                                                     Walker 2
## 1584                                               Walking Tall 2
## 1585                                                     Walter 2
## 1586                                                 Wanderlust 2
## 1587                                                     Wanted 2
## 1588                                              War and Peace 3
## 1589                                                    Warlock 2
## 1590                                                      Water 2
## 1591                                            Waterloo Bridge 2
## 1592                                            We're No Angels 2
## 1593                                      Weekend of a Champion 2
## 1594                                                    Welcome 2
## 1595                                      Welcome to the Jungle 2
## 1596                                                    Western 2
## 1597                                           What Price Glory 2
## 1598                                           When Ladies Meet 2
## 1599                                      When a Stranger Calls 2
## 1600                                               When in Rome 2
## 1601                                      When the Bough Breaks 2
## 1602                                         Where the Heart Is 2
## 1603                                      While the City Sleeps 2
## 1604                                                   Whiplash 3
## 1605                               Whistle and I'll Come to You 2
## 1606                                                       Wild 2
## 1607                                                  Wild Bill 2
## 1608                                                    Willard 2
## 1609                                                     Wilson 2
## 1610                                                       Wind 2
## 1611                                         Wish You Were Here 2
## 1612                                                 Witch Hunt 2
## 1613                                                 Witchcraft 2
## 1614                                            Without Warning 2
## 1615                                                       Wolf 2
## 1616                                                     Wolves 3
## 1617                                              Women in Love 2
## 1618                                               Wonder Woman 3
## 1619                                                 Wonderland 3
## 1620                                          World Without End 2
## 1621                                                    Woyzeck 2
## 1622                                          Wuthering Heights 6
## 1623                                                  Xue di zi 2
## 1624                                                 Youngblood 2
## 1625                                                       Zero 2
## 1626                                                    Zig Zag 2
## 1627                                                     Zodiac 2
## 1628                                                   Zolushka 2
## 1629                                                       Zoom 2
## 1630                                                       Zulu 2
## 1631                   [{'iso_639_1': 'en', 'name': 'English'}] 2
## 1632                                    Долгая счастливая жизнь 2
## 1633                                         Мастер и Маргарита 2
## 1634                                          Обыкновенное чудо 2
## 1635                                                    Окраина 2
## 1636                                                    Русалка 2
## 1637                                           Снежная королева 2
## 1638                                                    Солярис 2
## 1639                                                 Сталинград 2
## 1640                                                   修羅雪姫 2
## 1641                                                   倩女幽魂 2
## 1642 劇場版ポケットモンスター セレビィ 時を越えた遭遇(であい) 2
## 1643                                               十三人の刺客 2
## 1644                                                     座頭市 2
## 1645                                                       怪談 2
## 1646                                       日本のいちばん長い日 2
## 1647                                             時をかける少女 3
## 1648                                                   楢山節考 2
## 1649                                                       野火 2
## 1650                                               魔女の宅急便 2
## 1651                                                       하녀 2

There are partial duplicates where movies have the same id, imdb_id or original_title however they have variation in 1 or more columns, in the case of original title it was found that some movies had the same name, however their producer, runtime, year and overview were completely different meaning that those movies are different from each other.

In the case of id and imdb_id after filtering the cases in the movies data frame it was found all the columns matched with the exception of popularity, this could be caused if the same movie was registered twice by accident but on a different time, causing the popularity shift. In order to solve this problem the maximum popularity will used to merge the duplicates. Further action will be made depending of the result.

imdb_id and id

# Keeps the maximum popularity value of a variable and delete its duplicate
movies <- movies %>% group_by(imdb_id) %>% slice_max(popularity_max) %>% ungroup()
# Check how many partial duplicates remain on the dataset.
movies %>% 
  count(id) %>% 
  filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: id <chr>, n <int>
movies %>% 
  count(imdb_id) %>% 
  filter(n > 1)
## # A tibble: 1 × 2
##   imdb_id     n
##   <chr>   <int>
## 1 0           3
movies %>% 
  count(original_title) %>% 
  filter(n > 1)
## # A tibble: 1,638 × 2
##    original_title                   n
##    <chr>                        <int>
##  1 12 Angry Men                     2
##  2 20,000 Leagues Under the Sea     4
##  3 2:22                             2
##  4 3:10 to Yuma                     2
##  5 8                                2
##  6 9                                2
##  7 A Bucket of Blood                2
##  8 A Christmas Carol                7
##  9 A Dangerous Place                2
## 10 A Foreign Affair                 2
## # ℹ 1,628 more rows

Most of the partial duplicates for the id columns have been eliminated, there are only three instances where an imdb_id is duplicated and this value is likely invalid.

# See remaining partial imdb_id duplicates
movies %>% filter(imdb_id %in% c("0"))
## # A tibble: 3 × 38
##   adult budget homepage  id    imdb_id original_language original_title overview
##   <lgl>  <dbl> <chr>     <chr> <chr>   <fct>             <chr>          <chr>   
## 1 NA        NA [{'iso_3… 1997… 0       104.0             [{'iso_639_1'… Released
## 2 NA        NA [{'iso_3… 2012… 0       68.0              [{'iso_639_1'… Released
## 3 NA        NA [{'iso_3… 2014… 0       82.0              [{'iso_639_1'… Released
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## #   country1_language <fct>, country2_language <fct>, …

Looking at the remaining duplicates it seems most of the information is missing while the rest is out of place, due to this a proper adjustment in the dataset could be difficult to do, and with their imdb_id missing it could be hard to manually enter the information, therefore these aspects in combination to the low amount of data the duplicates represent to the dataset, the three rows are going to be deleted on their entirety.

# Remove rows with an imdb_id of 0 
movies <- filter(movies,imdb_id != 0)
# Ensure the rows with an imdb_id of 0 were eliminated
movies %>% 
  count(imdb_id) %>% 
  filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: imdb_id <chr>, n <int>

Although partial duplicates remain for the original_title this are going to be kept in the dataset, when checking the dataset the were significant differences in all the columns, also given each of those duplicates has its own id and imdb_id even if the title suggest they are the same movie, in reality they are completely different in all aspects, therefore these rows must be seen as different movies and not duplicates.

Managing Factor Levels

Eligible variables were converted to factors on the previous steps and once the columns were properly cleaned.

# Check data type for each variable
glimpse(movies)
## Rows: 45,417
## Columns: 38
## $ adult                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ budget                   <dbl> 4224579, 4224579, 4224579, 4224579, 4224579, …
## $ homepage                 <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ id                       <chr> "15257", "16612", "88013", "16624", "105158",…
## $ imdb_id                  <chr> "", "tt0000001", "tt0000003", "tt0000005", "t…
## $ original_language        <fct> en, en, fr, xx, en, fr, es, fr, fr, fr, fr, f…
## $ original_title           <chr> "Hulk vs. Wolverine", "Carmencita", "Pauvre P…
## $ overview                 <chr> "Department H sends in Wolverine to track dow…
## $ popularity               <dbl> 5.539197, 1.273072, 0.673164, 1.061591, 0.312…
## $ poster_path              <chr> "/dXjbsjVkpykJECOO0kgThsipSYP.jpg", "/6QJowxF…
## $ release_date             <date> 2009-01-27, 1894-03-14, 1892-10-28, 1893-05-…
## $ revenue                  <dbl> 11209349, 11209349, 11209349, 11209349, 11209…
## $ runtime                  <dbl> 38, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,…
## $ status                   <fct> Released, Released, Released, Released, Relea…
## $ tagline                  <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ title                    <chr> "Hulk vs. Wolverine", "Carmencita", "Poor Pie…
## $ video                    <chr> "False", "False", "False", "False", "False", …
## $ vote_average             <dbl> 6.8, 4.9, 6.1, 5.8, 4.7, 6.2, 6.9, 7.0, 5.3, …
## $ vote_count               <int> 48, 18, 19, 19, 12, 52, 87, 44, 12, 22, 17, 2…
## $ id_collection            <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ name_collection          <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ poster_path_collection   <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ backdrop_path_collection <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ genre1                   <fct> Animation, Documentary, Comedy, Drama, Docume…
## $ genre2                   <fct> Action, , Animation, , , , , , , , , , Horror…
## $ genre3                   <fct> Science Fiction, , , , , , , , , , , , , , , …
## $ country1                 <fct> United States of America, United States of Am…
## $ country2                 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country3                 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country1_language        <fct> English, No Language, No Language, No Languag…
## $ country2_language        <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country3_language        <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ company1                 <fct> Marvel Studios, Edison Manufacturing Company,…
## $ company2                 <fct> , , , , , , , , , , , , , , Star Film Company…
## $ company3                 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ budget_original          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ revenue_original         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ popularity_max           <dbl> 5.539197, 1.273072, 0.673164, 1.061591, 0.312…

The following variables will be considered as factors:
* original_language
* status
* genre1-3
* company1-3
* country1-3
* country1-3_language

In this stage factor levels for each variable will be explored in order to find inconsistencies or errors in the available categories for a variable.

original_language

# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$original_language))
##  [1] ""   "ab" "af" "am" "ar" "ay" "bg" "bm" "bn" "bo" "bs" "ca" "cn" "cs" "cy"
## [16] "da" "de" "el" "en" "eo" "es" "et" "eu" "fa" "fi" "fr" "fy" "gl" "he" "hi"
## [31] "hr" "hu" "hy" "id" "is" "it" "iu" "ja" "jv" "ka" "kk" "kn" "ko" "ku" "ky"
## [46] "la" "lb" "lo" "lt" "lv" "mk" "ml" "mn" "mr" "ms" "mt" "nb" "ne" "nl" "no"
## [61] "pa" "pl" "ps" "pt" "qu" "ro" "ru" "rw" "sh" "si" "sk" "sl" "sm" "sq" "sr"
## [76] "sv" "ta" "te" "tg" "th" "tl" "tr" "uk" "ur" "uz" "vi" "wo" "xx" "zh" "zu"

By examining the levels of the original language only the level in blank must be further investigated, other than that all remaining factor levels are valid and have the same formatting.

# Count the number of elements within a level
movies %>% count(original_language)
## # A tibble: 90 × 2
##    original_language     n
##    <fct>             <int>
##  1 ""                   11
##  2 "ab"                 10
##  3 "af"                  2
##  4 "am"                  2
##  5 "ar"                 39
##  6 "ay"                  1
##  7 "bg"                 10
##  8 "bm"                  3
##  9 "bn"                 29
## 10 "bo"                  2
## # ℹ 80 more rows

There are 11 movies in which the column original_language does not contain information.

# See which movies does not contain original_language
movies %>% filter(original_language == "")
## # A tibble: 11 × 38
##    adult budget homepage id    imdb_id original_language original_title overview
##    <lgl>  <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
##  1 FALSE 4.22e6 ""       3591… tt0053… ""                13 Fighting M… "A grou…
##  2 FALSE 4.22e6 ""       1470… tt0122… ""                Lambchops      "George…
##  3 FALSE 4.22e6 ""       1444… tt0154… ""                Annabelle Ser… "Two da…
##  4 FALSE 4.22e6 ""       1044… tt0223… ""                La prise de T… "Three …
##  5 FALSE 4.22e6 ""       2570… tt0225… ""                Bajaja         "The fi…
##  6 FALSE 4.22e6 ""       3804… tt0298… ""                Lettre d'une … ""      
##  7 FALSE 4.22e6 ""       2831… tt0429… ""                Shadowing the… "Docume…
##  8 FALSE 4.22e6 ""       1039… tt0838… ""                Unfinished Sky "An Out…
##  9 FALSE 4.22e6 ""       3327… tt4432… ""                Song of Lahore "Until …
## 10 FALSE 4.22e6 ""       3810… tt5333… ""                Garn           "The tr…
## 11 FALSE 4.22e6 ""       3815… tt5376… ""                WiNWiN         "Americ…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## #   country1_language <fct>, country2_language <fct>, …

status

# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$status))
## [1] ""                "Canceled"        "In Production"   "Planned"        
## [5] "Post Production" "Released"        "Rumored"

The same issue is present on the original_language for these variable, with no other inconsistencies encountered.

# Count the number of elements within a level
movies %>% count(status)
## # A tibble: 7 × 2
##   status                n
##   <fct>             <int>
## 1 ""                   84
## 2 "Canceled"            2
## 3 "In Production"      20
## 4 "Planned"            15
## 5 "Post Production"    98
## 6 "Released"        44971
## 7 "Rumored"           227

In total 84 rows contains a blank status.

# See which movies does not contain a status
movies %>% filter(status == "")
## # A tibble: 84 × 38
##    adult budget homepage id    imdb_id original_language original_title overview
##    <lgl>  <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
##  1 FALSE 4.22e6 ""       42496 tt0067… en                Millhouse      "Emile …
##  2 FALSE 4.22e6 ""       57868 tt0071… en                The Autobiogr… "In Feb…
##  3 FALSE 4.22e6 ""       46770 tt0094… en                Sur            " "     
##  4 FALSE 4.22e6 ""       41934 tt0095… en                Heavy Petting  "HEAVY …
##  5 FALSE 4.22e6 ""       41932 tt0097… en                Easy Wheels    "A grou…
##  6 FALSE 4.22e6 ""       41811 tt0099… en                Eating         "At a s…
##  7 FALSE 4.22e6 ""       77314 tt0101… fr                The Cabinet o… ""      
##  8 FALSE 4.22e6 ""       1236… tt0104… en                Dream Deceive… "A chil…
##  9 FALSE 4.22e6 ""       1242… tt0106… en                Anna: Ot shes… "Direct…
## 10 FALSE 4.22e6 ""       71687 tt0107… en                My Life's in … "No gir…
## # ℹ 74 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …

genre variables

# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$genre1))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
levels(droplevels(movies$genre2))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
levels(droplevels(movies$genre3))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"

Genres are properly described on the factors, the only observations are the blank values and also that there is a 2 letter white space in each factor, probably due to the column separation process that was made to the columns in JSON format.

# Count the number of elements within a level
movies %>% count(genre1)
## # A tibble: 21 × 2
##    genre1            n
##    <fct>         <int>
##  1 ""             2437
##  2 "Action"       4487
##  3 "Adventure"    1508
##  4 "Animation"    1123
##  5 "Comedy"       8815
##  6 "Crime"        1683
##  7 "Documentary"  3412
##  8 "Drama"       11952
##  9 "Family"        524
## 10 "Fantasy"       702
## # ℹ 11 more rows
movies %>% count(genre2)
## # A tibble: 21 × 2
##    genre2            n
##    <fct>         <int>
##  1 ""            16988
##  2 "Action"       1544
##  3 "Adventure"    1412
##  4 "Animation"     617
##  5 "Comedy"       3262
##  6 "Crime"        1428
##  7 "Documentary"   469
##  8 "Drama"        6301
##  9 "Family"       1109
## 10 "Fantasy"       764
## # ℹ 11 more rows
movies %>% count(genre3)
## # A tibble: 21 × 2
##    genre3            n
##    <fct>         <int>
##  1 ""            31454
##  2 "Action"        451
##  3 "Adventure"     422
##  4 "Animation"     171
##  5 "Comedy"        911
##  6 "Crime"         852
##  7 "Documentary"    38
##  8 "Drama"        1673
##  9 "Family"        756
## 10 "Fantasy"       538
## # ℹ 11 more rows

There are 2437 rows in the dataset where movies does not have a defined genre, for the other two variables the number increases, however we will only focus on the genre1 undefined genres as it is not necessary for a movie to have more than one genre.

company variables

# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset. We will only use the first 50 levels for demostration purposes.
levels(droplevels(movies$company1)) %>% head(50)
##  [1] ""                                    "01 Distribution"                    
##  [3] "1 85 Films"                          "100  Halal"                         
##  [5] "100 Bares"                           "101st Street Films"                 
##  [7] "10dB Inc"                            "10th Hole Productions"              
##  [9] "11"                                  "1201"                               
## [11] "120dB Films"                         "13 All Stars LLC"                   
## [13] "14 Luglio Cinematografica"           "14 Reels Entertainment"             
## [15] "1492 Pictures"                       "1818"                               
## [17] "1821 Pictures"                       "185 Trax"                           
## [19] "185º Equator"                        "19 Entertainment"                   
## [21] "1984 Private Defense Contractors"    "2 4 7  Films"                       
## [23] "2 Man Production"                    "2 Player Productions"               
## [25] "2 Smooth Film Productions"           "2 Team Productions"                 
## [27] "20 Steps Productions"                "20ten Media"                        
## [29] "20th Century Fox"                    "20th Century Fox Film Corporation"  
## [31] "20th Century Fox Home Entertainment" "20th Century Fox Russia"            
## [33] "20th Century Fox Television"         "20th Century Pictures"              
## [35] "21 Laps Entertainment"               "21 One Productions"                 
## [37] "21st Century Film Corporation"       "22 Dicembre"                        
## [39] "23 5 Filmproduktion"                 "23 Giugno"                          
## [41] "24 7 Films"                          "2425 PRODUCTION"                    
## [43] "26 Films"                            "27 Films Production"                
## [45] "27 Productions"                      "29 fevralya"                        
## [47] "2929 Productions"                    "2afilm"                             
## [49] "2B Films"                            "2DS Productions"
#The other variables won't be show for presentation purposes but this is the code that would show their levels
#levels(droplevels(movies$company2))
#levels(droplevels(movies$company3))

Looking at the factor levels for this categories, it is clear there are a lot of companies involved and therefore a lot of levels within the factors.

movies %>% count(company1) %>% arrange(desc(n))
## # A tibble: 10,590 × 2
##    company1                                     n
##    <fct>                                    <int>
##  1 ""                                       11861
##  2 "Paramount Pictures"                       996
##  3 "Metro Goldwyn Mayer  MGM"                 851
##  4 "Twentieth Century Fox Film Corporation"   780
##  5 "Warner Bros"                              757
##  6 "Universal Pictures"                       754
##  7 "Columbia Pictures"                        429
##  8 "Columbia Pictures Corporation"            401
##  9 "RKO Radio Pictures"                       290
## 10 "United Artists"                           272
## # ℹ 10,580 more rows
movies %>% count(company2) %>% arrange(desc(n))
## # A tibble: 9,040 × 2
##    company2                                     n
##    <fct>                                    <int>
##  1 ""                                       28428
##  2 "Warner Bros"                              270
##  3 "Metro Goldwyn Mayer  MGM"                 149
##  4 "Canal+"                                   124
##  5 "Touchstone Pictures"                       75
##  6 "Universal Pictures"                        71
##  7 "TF1 Films Production"                      52
##  8 "StudioCanal"                               47
##  9 "Twentieth Century Fox Film Corporation"    45
## 10 "Amblin Entertainment"                      43
## # ℹ 9,030 more rows
movies %>% count(company3) %>% arrange(desc(n))
## # A tibble: 5,980 × 2
##    company3                                         n
##    <fct>                                        <int>
##  1 ""                                           36385
##  2 "Warner Bros"                                  130
##  3 "Canal+"                                       109
##  4 "Metro Goldwyn Mayer  MGM"                      44
##  5 "Relativity Media"                              42
##  6 "TF1 Films Production"                          29
##  7 "Touchstone Pictures"                           27
##  8 "Working Title Films"                           24
##  9 "Centre National de la Cinématographie  CNC"    20
## 10 "Film4"                                         20
## # ℹ 5,970 more rows

There are two ways to proceed with this issue, a first solution could be to only consider companies which have the most amount of movies produced as a levels and smaller companies consider them as others, the other solution would be to drop this variable as a factor and replace it’s data type as a string. We consider the first solution is the way to go as it would allow us to still make an analysis of the companies.

Other issues are the rows that contain a blank and also that there is a 2 space white space before each company name, which also should be corrected.

country_languages variables

# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$country1_language))
##  [1] ""                 "Afrikaans"        "Azərbaycan"       "Bahasa indonesia"
##  [5] "Bahasa melayu"    "Bamanankan"       "Bokmål"           "Bosanski"        
##  [9] "Català"           "Český"            "Cymraeg"          "Dansk"           
## [13] "Deutsch"          "Eesti"            "English"          "Español"         
## [17] "Esperanto"        "euskera"          "Français"         "Fulfulde"        
## [21] "Gaeilge"          "Galego"           "Hausa"            "Hrvatski"        
## [25] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [29] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [33] "Magyar"           "Nederlands"       "No Language"      "Norsk"           
## [37] "Polski"           "Português"        "Pусский"          "Română"          
## [41] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [45] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [49] "Türkçe"           "Wolof"            "ελληνικά"         "беларуская мова" 
## [53] "български език"   "қазақ"            "Український"      "ქართული"         
## [57] "עִבְרִית"            "اردو"             "العربية"          "پښتو"            
## [61] "فارسی"            "हिन्दी"            "বাংলা"            "ਪੰਜਾਬੀ"           
## [65] "தமிழ்"             "తెలుగు"            "ภาษาไทย"          "한국어 조선말"   
## [69] "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country2_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Català"           "Český"           
##  [9] "Cymraeg"          "Dansk"            "Deutsch"          "Eesti"           
## [13] "English"          "Español"          "Esperanto"        "Français"        
## [17] "Fulfulde"         "Gaeilge"          "Galego"           "Hrvatski"        
## [21] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [25] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [29] "Magyar"           "Malti"            "Nederlands"       "No Language"     
## [33] "Norsk"            "ozbek"            "Polski"           "Português"       
## [37] "Pусский"          "Română"           "shqip"            "Slovenčina"      
## [41] "Slovenščina"      "Somali"           "Srpski"           "suomi"           
## [45] "svenska"          "Tiếng Việt"       "Türkçe"           "Wolof"           
## [49] "ελληνικά"         "български език"   "қазақ"            "Український"     
## [53] "ქართული"          "עִבְרִית"            "اردو"             "العربية"         
## [57] "پښتو"             "فارسی"            "हिन्दी"            "বাংলা"           
## [61] "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"            "ภาษาไทย"         
## [65] "한국어 조선말"    "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country3_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Český"            "Cymraeg"         
##  [9] "Dansk"            "Deutsch"          "Eesti"            "English"         
## [13] "Español"          "Esperanto"        "euskera"          "Français"        
## [17] "Gaeilge"          "Hrvatski"         "isiZulu"          "Íslenska"        
## [21] "Italiano"         "Kiswahili"        "Latin"            "Latviešu"        
## [25] "Lietuvi x9akai"   "Magyar"           "Nederlands"       "Norsk"           
## [29] "Polski"           "Português"        "Pусский"          "Română"          
## [33] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [37] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [41] "Türkçe"           "Wolof"            "ελληνικά"         "български език"  
## [45] "қазақ"            "Український"      "ქართული"          "עִבְרִית"           
## [49] "اردو"             "العربية"          "پښتو"             "فارسی"           
## [53] "हिन्दी"            "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"           
## [57] "ภาษาไทย"          "한국어 조선말"    "广州话   廣州話"  "日本語"          
## [61] "普通话"

Factor levels have the first 2 characters in blank in a similar way to other variables plus a level without characters, however the biggest issue is that the languages are written in their original language with may complicate our efforts to analyze variables related to language, so a solution could be to translate the language names to English.

# Count the number of elements within a level
movies %>% count(country1_language)
## # A tibble: 71 × 2
##    country1_language      n
##    <fct>              <int>
##  1 ""                  4050
##  2 "Afrikaans"           22
##  3 "Azərbaycan"           4
##  4 "Bahasa indonesia"    26
##  5 "Bahasa melayu"        5
##  6 "Bamanankan"           4
##  7 "Bokmål"               3
##  8 "Bosanski"            25
##  9 "Català"              31
## 10 "Český"              263
## # ℹ 61 more rows
movies %>% count(country2_language)
## # A tibble: 68 × 2
##    country2_language      n
##    <fct>              <int>
##  1 ""                 37667
##  2 "Afrikaans"            4
##  3 "Bahasa indonesia"     9
##  4 "Bahasa melayu"        4
##  5 "Bamanankan"           1
##  6 "Bosanski"             3
##  7 "Català"               5
##  8 "Český"               14
##  9 "Cymraeg"              4
## 10 "Dansk"               18
## # ℹ 58 more rows
movies %>% count(country3_language)
## # A tibble: 61 × 2
##    country3_language      n
##    <fct>              <int>
##  1 ""                 42970
##  2 "Afrikaans"            2
##  3 "Bahasa indonesia"     2
##  4 "Bahasa melayu"        6
##  5 "Bamanankan"           1
##  6 "Bosanski"             2
##  7 "Český"                2
##  8 "Cymraeg"              1
##  9 "Dansk"                5
## 10 "Deutsch"            328
## # ℹ 51 more rows

Only the first country_language variable with a blank level must be fixed, as it is not necessary for a movie to have more than 1 language available.

Fixing categorical data problems

By using functions on factors we were able to detect inconsistencies, errors and opportunities to improve the data legibility by making adjustments to the factor levels. In this sections we will be fixing all related categorical data issues that were detected previously.

original_language

  • Detected Issues: A level does not contain any information
  • Affected Rows: 11 blank & 33 xx
  • Proposed Solution: After checking some movies that don´t contain information, I have dediced to change blank to “xx” values and classify them as silent movies.
levels(droplevels(movies$original_language))
##  [1] ""   "ab" "af" "am" "ar" "ay" "bg" "bm" "bn" "bo" "bs" "ca" "cn" "cs" "cy"
## [16] "da" "de" "el" "en" "eo" "es" "et" "eu" "fa" "fi" "fr" "fy" "gl" "he" "hi"
## [31] "hr" "hu" "hy" "id" "is" "it" "iu" "ja" "jv" "ka" "kk" "kn" "ko" "ku" "ky"
## [46] "la" "lb" "lo" "lt" "lv" "mk" "ml" "mn" "mr" "ms" "mt" "nb" "ne" "nl" "no"
## [61] "pa" "pl" "ps" "pt" "qu" "ro" "ru" "rw" "sh" "si" "sk" "sl" "sm" "sq" "sr"
## [76] "sv" "ta" "te" "tg" "th" "tl" "tr" "uk" "ur" "uz" "vi" "wo" "xx" "zh" "zu"
# See movies without original_language
movies %>% filter(original_language == "")
## # A tibble: 11 × 38
##    adult budget homepage id    imdb_id original_language original_title overview
##    <lgl>  <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
##  1 FALSE 4.22e6 ""       3591… tt0053… ""                13 Fighting M… "A grou…
##  2 FALSE 4.22e6 ""       1470… tt0122… ""                Lambchops      "George…
##  3 FALSE 4.22e6 ""       1444… tt0154… ""                Annabelle Ser… "Two da…
##  4 FALSE 4.22e6 ""       1044… tt0223… ""                La prise de T… "Three …
##  5 FALSE 4.22e6 ""       2570… tt0225… ""                Bajaja         "The fi…
##  6 FALSE 4.22e6 ""       3804… tt0298… ""                Lettre d'une … ""      
##  7 FALSE 4.22e6 ""       2831… tt0429… ""                Shadowing the… "Docume…
##  8 FALSE 4.22e6 ""       1039… tt0838… ""                Unfinished Sky "An Out…
##  9 FALSE 4.22e6 ""       3327… tt4432… ""                Song of Lahore "Until …
## 10 FALSE 4.22e6 ""       3810… tt5333… ""                Garn           "The tr…
## 11 FALSE 4.22e6 ""       3815… tt5376… ""                WiNWiN         "Americ…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## #   country1_language <fct>, country2_language <fct>, …

There are only 11 rows in which the original language is not present

# Add rows without original language to the category "xx"
movies <- movies %>% mutate(original_language = fct_collapse(original_language, xx = c("xx","")))
# See movies if changes were applied 
movies %>% filter(original_language == "")
## # A tibble: 0 × 38
## # ℹ 38 variables: adult <lgl>, budget <dbl>, homepage <chr>, id <chr>,
## #   imdb_id <chr>, original_language <fct>, original_title <chr>,
## #   overview <chr>, popularity <dbl>, poster_path <chr>, release_date <date>,
## #   revenue <dbl>, runtime <dbl>, status <fct>, tagline <chr>, title <chr>,
## #   video <chr>, vote_average <dbl>, vote_count <int>, id_collection <chr>,
## #   name_collection <chr>, poster_path_collection <chr>,
## #   backdrop_path_collection <chr>, genre1 <fct>, genre2 <fct>, genre3 <fct>, …
# See movies without original_language
movies %>% filter(original_language == "xx")
## # A tibble: 44 × 38
##    adult budget homepage id    imdb_id original_language original_title overview
##    <lgl>  <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
##  1 FALSE 4.22e6 ""       16624 tt0000… xx                Blacksmith Sc… "Three …
##  2 FALSE 4.22e6 ""       1330… tt0000… xx                Le manoir du … "A bat …
##  3 FALSE 4.22e6 ""       1323… tt0000… xx                The '?' Motor… "A magi…
##  4 FALSE 4.22e6 ""       36208 tt0009… xx                A Dog's Life   "Poor C…
##  5 FALSE 4.22e6 ""       70804 tt0010… xx                J'accuse!      "The st…
##  6 FALSE 4.22e6 ""       47703 tt0013… xx                Дневник Глумо… "Filmic…
##  7 FALSE 4.22e6 ""       42565 tt0018… xx                Underworld     "Boiste…
##  8 FALSE 4.22e6 ""       3591… tt0053… xx                13 Fighting M… "A grou…
##  9 FALSE 1.20e7 ""       62204 tt0082… xx                La Guerre du … "A colo…
## 10 FALSE 4.22e6 ""       1237… tt0082… xx                Junkopia       "A shor…
## # ℹ 34 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …

status

  • Detected Issues: A level does not contain any information
  • Affected Rows: 84
  • Proposed Solution: Use the release_date variable to know which status give to movies without one
levels(movies$status)
## [1] ""                "Canceled"        "In Production"   "Planned"        
## [5] "Post Production" "Released"        "Rumored"
# See movies without status
movies %>% filter(status == "")
## # A tibble: 84 × 38
##    adult budget homepage id    imdb_id original_language original_title overview
##    <lgl>  <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
##  1 FALSE 4.22e6 ""       42496 tt0067… en                Millhouse      "Emile …
##  2 FALSE 4.22e6 ""       57868 tt0071… en                The Autobiogr… "In Feb…
##  3 FALSE 4.22e6 ""       46770 tt0094… en                Sur            " "     
##  4 FALSE 4.22e6 ""       41934 tt0095… en                Heavy Petting  "HEAVY …
##  5 FALSE 4.22e6 ""       41932 tt0097… en                Easy Wheels    "A grou…
##  6 FALSE 4.22e6 ""       41811 tt0099… en                Eating         "At a s…
##  7 FALSE 4.22e6 ""       77314 tt0101… fr                The Cabinet o… ""      
##  8 FALSE 4.22e6 ""       1236… tt0104… en                Dream Deceive… "A chil…
##  9 FALSE 4.22e6 ""       1242… tt0106… en                Anna: Ot shes… "Direct…
## 10 FALSE 4.22e6 ""       71687 tt0107… en                My Life's in … "No gir…
## # ℹ 74 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …

Most movies do contain a release date which has already happened, so for those cases where the movies have a release date before 2017 their status will be considered as “Released”.

# Assign movies with a release date to the level "Released"
movies <- movies %>%
  mutate(status = if_else(!is.na(release_date),fct_collapse(status, Released = c("Released","")),status))
# Check if the movies without status and a release date now form part of "Released"
movies %>% count(status)
## # A tibble: 7 × 2
##   status                n
##   <fct>             <int>
## 1 "Released"        45051
## 2 "Canceled"            2
## 3 "In Production"      20
## 4 "Planned"            15
## 5 "Post Production"    98
## 6 "Rumored"           227
## 7 ""                    4
# See remaining variables
movies %>% filter(status == "")
## # A tibble: 4 × 38
##   adult  budget homepage id    imdb_id original_language original_title overview
##   <lgl>   <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
## 1 FALSE  4.22e6 ""       82663 tt0113… en                Midnight Man   British…
## 2 FALSE  4.22e6 ""       94214 tt0210… en                Jails, Hospit… Jails, …
## 3 FALSE  4.22e6 "http:/… 1226… tt2423… ja                マルドゥック…  Third f…
## 4 FALSE  4.22e6 ""       2492… tt2622… en                Avalanche Sha… A group…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## #   country1_language <fct>, country2_language <fct>, …

Due to the fact only a few rows without status remain, their imdb_id’s were directly searched for in order to find their status. It seems all the remaining movies were also released, so we are adding these movies to the “Released” category.

# Add remaining movies to category "Released"
movies <- movies %>% mutate(status = fct_collapse(status, Released = c("Released","")))
# Check if any movie remains without status.
movies %>% count(status)
## # A tibble: 6 × 2
##   status              n
##   <fct>           <int>
## 1 Released        45055
## 2 Canceled            2
## 3 In Production      20
## 4 Planned            15
## 5 Post Production    98
## 6 Rumored           227
levels(droplevels(movies$status))
## [1] "Released"        "Canceled"        "In Production"   "Planned"        
## [5] "Post Production" "Rumored"

“status” column now contains the proper categories and does not need any additional fixes.

genre variables

  • Detected Issues: There are movies which does not specify at least one genre, all levels have a 2 character blank space at the start.
  • Affected Rows: 2437 (No genre), ALL (blank spaces at start)
  • Proposed Solution: Rename level without a genre to “Unspecified”, remove all levels blank spaces at start.
# See levels for a variable
levels(droplevels(movies$genre1))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
levels(droplevels(movies$genre2))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
levels(droplevels(movies$genre3))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
# See how much rows does not have a genre
movies %>% count(genre1)
## # A tibble: 21 × 2
##    genre1            n
##    <fct>         <int>
##  1 ""             2437
##  2 "Action"       4487
##  3 "Adventure"    1508
##  4 "Animation"    1123
##  5 "Comedy"       8815
##  6 "Crime"        1683
##  7 "Documentary"  3412
##  8 "Drama"       11952
##  9 "Family"        524
## 10 "Fantasy"       702
## # ℹ 11 more rows

Due to the amount of rows without a genre data cannot be manually added without taking a long amount of time and we do not have a way to extract large amounts of data from imdb, therefore we are going to put the rows without a genre in a category called “Unspecified” for genre2 and genre3 when it is not necessary a movie has more than 1 genre, we are going to use the term “NA” as the category name.

# Create the category "Unspecified"
movies <- movies %>% mutate(genre1 = fct_collapse(genre1, Unspecified = ""))
# See if the new category was created
levels(droplevels(movies$genre1))
##  [1] "Unspecified"     "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
movies %>% count(genre1)
## # A tibble: 21 × 2
##    genre1          n
##    <fct>       <int>
##  1 Unspecified  2437
##  2 Action       4487
##  3 Adventure    1508
##  4 Animation    1123
##  5 Comedy       8815
##  6 Crime        1683
##  7 Documentary  3412
##  8 Drama       11952
##  9 Family        524
## 10 Fantasy       702
## # ℹ 11 more rows
# Eliminate white space inconsistency
movies <- movies %>% mutate(genre1 = str_trim(genre1)) 
movies <- movies %>% mutate(genre2 = str_trim(genre2))
movies <- movies %>% mutate(genre3 = str_trim(genre3))
# Reconvert variables to factor data type
movies <- movies %>% mutate(genre1 = as.factor(movies$genre1))
movies <- movies %>% mutate(genre2 = as.factor(movies$genre2))
movies <- movies %>% mutate(genre3 = as.factor(movies$genre3))
# Levels should now have their white space removed
levels(droplevels(movies$genre1))
##  [1] "Action"          "Adventure"       "Animation"       "Comedy"         
##  [5] "Crime"           "Documentary"     "Drama"           "Family"         
##  [9] "Fantasy"         "Foreign"         "History"         "Horror"         
## [13] "Music"           "Mystery"         "Romance"         "Science Fiction"
## [17] "Thriller"        "TV Movie"        "Unspecified"     "War"            
## [21] "Western"
levels(droplevels(movies$genre2))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
levels(droplevels(movies$genre3))
##  [1] ""                "Action"          "Adventure"       "Animation"      
##  [5] "Comedy"          "Crime"           "Documentary"     "Drama"          
##  [9] "Family"          "Fantasy"         "Foreign"         "History"        
## [13] "Horror"          "Music"           "Mystery"         "Romance"        
## [17] "Science Fiction" "Thriller"        "TV Movie"        "War"            
## [21] "Western"
# Create a new column 'genre_count' to count the number of genres for each row
movies$genre_count <- rowSums(movies[, c("genre1", "genre2", "genre3")] != "")

I left columns genre2 and genre3 blank to make a count of those movies with 1 or more genres for later use. Genre variables are now clean and ready for use in analysis.

company variables

  • Detected Issues: 2 character blank space, lack of a company in certain cases and there are a lot of levels within the factor, which may make the analysis of companies inconvenient.
  • Affected Rows: 11862 (No company), ALL (blank spaces at start and excessive amount of levels.
  • Proposed Solution: Empty rows will be put into a category called “No Company”, blank spaces will be removed and in the case of the categories amount, they are going to be reduced, only the first 50 companies will have a category while the rest will be put into a category named “Other”
# See levels for a variable
levels(droplevels(movies$company1)) %>% head(25)
##  [1] ""                                 "01 Distribution"                 
##  [3] "1 85 Films"                       "100  Halal"                      
##  [5] "100 Bares"                        "101st Street Films"              
##  [7] "10dB Inc"                         "10th Hole Productions"           
##  [9] "11"                               "1201"                            
## [11] "120dB Films"                      "13 All Stars LLC"                
## [13] "14 Luglio Cinematografica"        "14 Reels Entertainment"          
## [15] "1492 Pictures"                    "1818"                            
## [17] "1821 Pictures"                    "185 Trax"                        
## [19] "185º Equator"                     "19 Entertainment"                
## [21] "1984 Private Defense Contractors" "2 4 7  Films"                    
## [23] "2 Man Production"                 "2 Player Productions"            
## [25] "2 Smooth Film Productions"
# Sort companies by amount of movies produced
company1_sort <- movies %>% count(company1) %>% arrange(desc(n))
company2_sort <- movies %>% count(company2) %>% arrange(desc(n))
company3_sort <- movies %>% count(company3) %>% arrange(desc(n))
# Get main companies and the cases where the company is not specified
top_50_company1 <- company1_sort$company1[1:51]
top_50_company2 <- company2_sort$company2[1:51]
top_50_company3 <- company3_sort$company3[1:51]
# Move all the companies that does not form part of the 50 biggest companies or are blank in the category "Other"
movies <- movies %>% mutate(company1 = fct_collapse(company1, "Other" = company1[!company1 %in% top_50_company1]))
movies <- movies %>% mutate(company2 = fct_collapse(company2, "Other" = company2[!company2 %in% top_50_company2]))
movies <- movies %>% mutate(company3 = fct_collapse(company3, "Other" = company3[!company3 %in% top_50_company2]))
# Check if the change was made sucessfully
levels(droplevels(movies$company1)) 
##  [1] ""                                      
##  [2] "Other"                                 
##  [3] "American International Pictures  AIP"  
##  [4] "BBC Films"                             
##  [5] "British Broadcasting Corporation  BBC" 
##  [6] "Canal+"                                
##  [7] "Channel Four Films"                    
##  [8] "CJ Entertainment"                      
##  [9] "Columbia Pictures"                     
## [10] "Columbia Pictures Corporation"         
## [11] "DC Comics"                             
## [12] "DreamWorks SKG"                        
## [13] "First National Pictures"               
## [14] "Fox Film Corporation"                  
## [15] "Fox Searchlight Pictures"              
## [16] "France 2 Cinéma"                       
## [17] "Gaumont"                               
## [18] "Hammer Film Productions"               
## [19] "Hollywood Pictures"                    
## [20] "Imagine Entertainment"                 
## [21] "Lions Gate Films"                      
## [22] "Lionsgate"                             
## [23] "Metro Goldwyn Mayer  MGM"              
## [24] "Miramax Films"                         
## [25] "Monogram Pictures"                     
## [26] "Mosfilm"                               
## [27] "New Line Cinema"                       
## [28] "New World Pictures"                    
## [29] "Nikkatsu"                              
## [30] "Nordisk Film"                          
## [31] "Orion Pictures"                        
## [32] "Paramount Pictures"                    
## [33] "Rai Cinema"                            
## [34] "Regency Enterprises"                   
## [35] "RKO Radio Pictures"                    
## [36] "Shaw Brothers"                         
## [37] "Shôchiku Eiga"                         
## [38] "StudioCanal"                           
## [39] "Summit Entertainment"                  
## [40] "The Rank Organisation"                 
## [41] "TLA Releasing"                         
## [42] "Toho Company"                          
## [43] "Touchstone Pictures"                   
## [44] "TriStar Pictures"                      
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"                        
## [47] "Universal International Pictures  UI"  
## [48] "Universal Pictures"                    
## [49] "Village Roadshow Pictures"             
## [50] "Walt Disney Pictures"                  
## [51] "Walt Disney Productions"               
## [52] "Warner Bros"
movies %>% count(company1) %>% arrange(desc(n))
## # A tibble: 52 × 2
##    company1                                     n
##    <fct>                                    <int>
##  1 "Other"                                  24274
##  2 ""                                       11861
##  3 "Paramount Pictures"                       996
##  4 "Metro Goldwyn Mayer  MGM"                 851
##  5 "Twentieth Century Fox Film Corporation"   780
##  6 "Warner Bros"                              757
##  7 "Universal Pictures"                       754
##  8 "Columbia Pictures"                        429
##  9 "Columbia Pictures Corporation"            401
## 10 "RKO Radio Pictures"                       290
## # ℹ 42 more rows
levels(droplevels(movies$company2)) 
##  [1] ""                                      
##  [2] "Other"                                 
##  [3] "Amblin Entertainment"                  
##  [4] "American International Pictures  AIP"  
##  [5] "BBC Films"                             
##  [6] "Blumhouse Productions"                 
##  [7] "British Broadcasting Corporation  BBC" 
##  [8] "Canal+"                                
##  [9] "Carolco Pictures"                      
## [10] "Castle Rock Entertainment"             
## [11] "Columbia Pictures Corporation"         
## [12] "Dimension Films"                       
## [13] "DreamWorks SKG"                        
## [14] "Dune Entertainment"                    
## [15] "Film i Väst"                           
## [16] "Film4"                                 
## [17] "Focus Features"                        
## [18] "Globo Filmes"                          
## [19] "Happy Madison Productions"             
## [20] "HBO Films"                             
## [21] "Hollywood Pictures"                    
## [22] "Lionsgate"                             
## [23] "M6 Films"                              
## [24] "Metro Goldwyn Mayer  MGM"              
## [25] "Millennium Films"                      
## [26] "Morgan Creek Productions"              
## [27] "Nickelodeon Movies"                    
## [28] "Original Film"                         
## [29] "Pixar Animation Studios"               
## [30] "PolyGram Filmed Entertainment"         
## [31] "Rai Cinema"                            
## [32] "Regency Enterprises"                   
## [33] "Relativity Media"                      
## [34] "Revolution Studios"                    
## [35] "Scott Rudin Productions"               
## [36] "Spyglass Entertainment"                
## [37] "StudioCanal"                           
## [38] "Svensk Filmindustri  SF"               
## [39] "TF1 Films Production"                  
## [40] "The Vitaphone Corporation"             
## [41] "Touchstone Pictures"                   
## [42] "TriStar Pictures"                      
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"                       
## [45] "United Artists Pictures"               
## [46] "Universal Pictures"                    
## [47] "Walt Disney Animation Studios"         
## [48] "Walt Disney Productions"               
## [49] "Warner Bros"                           
## [50] "Warner Bros  Animation"                
## [51] "Wild Bunch"                            
## [52] "Zweites Deutsches Fernsehen  ZDF"
movies %>% count(company2) %>% arrange(desc(n))
## # A tibble: 52 × 2
##    company2                                     n
##    <fct>                                    <int>
##  1 ""                                       28428
##  2 "Other"                                  15072
##  3 "Warner Bros"                              270
##  4 "Metro Goldwyn Mayer  MGM"                 149
##  5 "Canal+"                                   124
##  6 "Touchstone Pictures"                       75
##  7 "Universal Pictures"                        71
##  8 "TF1 Films Production"                      52
##  9 "StudioCanal"                               47
## 10 "Twentieth Century Fox Film Corporation"    45
## # ℹ 42 more rows
levels(droplevels(movies$company3)) 
##  [1] ""                                      
##  [2] "Other"                                 
##  [3] "Amblin Entertainment"                  
##  [4] "BBC Films"                             
##  [5] "Blumhouse Productions"                 
##  [6] "British Broadcasting Corporation  BBC" 
##  [7] "Canal+"                                
##  [8] "Carolco Pictures"                      
##  [9] "Castle Rock Entertainment"             
## [10] "Columbia Pictures Corporation"         
## [11] "Dimension Films"                       
## [12] "Dune Entertainment"                    
## [13] "Film i Väst"                           
## [14] "Film4"                                 
## [15] "Focus Features"                        
## [16] "Globo Filmes"                          
## [17] "Happy Madison Productions"             
## [18] "HBO Films"                             
## [19] "Hollywood Pictures"                    
## [20] "Lionsgate"                             
## [21] "M6 Films"                              
## [22] "Metro Goldwyn Mayer  MGM"              
## [23] "Millennium Films"                      
## [24] "Morgan Creek Productions"              
## [25] "Nickelodeon Movies"                    
## [26] "Original Film"                         
## [27] "PolyGram Filmed Entertainment"         
## [28] "Rai Cinema"                            
## [29] "Regency Enterprises"                   
## [30] "Relativity Media"                      
## [31] "Revolution Studios"                    
## [32] "Scott Rudin Productions"               
## [33] "Spyglass Entertainment"                
## [34] "StudioCanal"                           
## [35] "Svensk Filmindustri  SF"               
## [36] "TF1 Films Production"                  
## [37] "The Vitaphone Corporation"             
## [38] "Touchstone Pictures"                   
## [39] "TriStar Pictures"                      
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"                       
## [42] "Universal Pictures"                    
## [43] "Warner Bros"                           
## [44] "Warner Bros  Animation"                
## [45] "Wild Bunch"                            
## [46] "Zweites Deutsches Fernsehen  ZDF"
movies %>% count(company3) %>% arrange(desc(n))
## # A tibble: 46 × 2
##    company3                       n
##    <fct>                      <int>
##  1 ""                         36385
##  2 "Other"                     8332
##  3 "Warner Bros"                130
##  4 "Canal+"                     109
##  5 "Metro Goldwyn Mayer  MGM"    44
##  6 "Relativity Media"            42
##  7 "TF1 Films Production"        29
##  8 "Touchstone Pictures"         27
##  9 "Film4"                       20
## 10 "Millennium Films"            19
## # ℹ 36 more rows

The next step is to replace the blank category with a new category name “No Company” for company1, this because it is not important to call “No Company” for the other variables. To make use of this I will add a variable called company_count for later analysis.

movies <- movies %>% mutate(company1 = fct_collapse(company1, "No Company" = ""))
# Create a new column 'genre_count' to count the number of genres for each row
movies$company_count <- rowSums(movies[, c("company1", "company2", "company3")] != "")
# Check if the change was done correctly
levels(droplevels(movies$company1)) 
##  [1] "No Company"                            
##  [2] "Other"                                 
##  [3] "American International Pictures  AIP"  
##  [4] "BBC Films"                             
##  [5] "British Broadcasting Corporation  BBC" 
##  [6] "Canal+"                                
##  [7] "Channel Four Films"                    
##  [8] "CJ Entertainment"                      
##  [9] "Columbia Pictures"                     
## [10] "Columbia Pictures Corporation"         
## [11] "DC Comics"                             
## [12] "DreamWorks SKG"                        
## [13] "First National Pictures"               
## [14] "Fox Film Corporation"                  
## [15] "Fox Searchlight Pictures"              
## [16] "France 2 Cinéma"                       
## [17] "Gaumont"                               
## [18] "Hammer Film Productions"               
## [19] "Hollywood Pictures"                    
## [20] "Imagine Entertainment"                 
## [21] "Lions Gate Films"                      
## [22] "Lionsgate"                             
## [23] "Metro Goldwyn Mayer  MGM"              
## [24] "Miramax Films"                         
## [25] "Monogram Pictures"                     
## [26] "Mosfilm"                               
## [27] "New Line Cinema"                       
## [28] "New World Pictures"                    
## [29] "Nikkatsu"                              
## [30] "Nordisk Film"                          
## [31] "Orion Pictures"                        
## [32] "Paramount Pictures"                    
## [33] "Rai Cinema"                            
## [34] "Regency Enterprises"                   
## [35] "RKO Radio Pictures"                    
## [36] "Shaw Brothers"                         
## [37] "Shôchiku Eiga"                         
## [38] "StudioCanal"                           
## [39] "Summit Entertainment"                  
## [40] "The Rank Organisation"                 
## [41] "TLA Releasing"                         
## [42] "Toho Company"                          
## [43] "Touchstone Pictures"                   
## [44] "TriStar Pictures"                      
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"                        
## [47] "Universal International Pictures  UI"  
## [48] "Universal Pictures"                    
## [49] "Village Roadshow Pictures"             
## [50] "Walt Disney Pictures"                  
## [51] "Walt Disney Productions"               
## [52] "Warner Bros"
movies %>% count(company1) %>% arrange(desc(n))
## # A tibble: 52 × 2
##    company1                                   n
##    <fct>                                  <int>
##  1 Other                                  24274
##  2 No Company                             11861
##  3 Paramount Pictures                       996
##  4 Metro Goldwyn Mayer  MGM                 851
##  5 Twentieth Century Fox Film Corporation   780
##  6 Warner Bros                              757
##  7 Universal Pictures                       754
##  8 Columbia Pictures                        429
##  9 Columbia Pictures Corporation            401
## 10 RKO Radio Pictures                       290
## # ℹ 42 more rows
levels(droplevels(movies$company2)) 
##  [1] ""                                      
##  [2] "Other"                                 
##  [3] "Amblin Entertainment"                  
##  [4] "American International Pictures  AIP"  
##  [5] "BBC Films"                             
##  [6] "Blumhouse Productions"                 
##  [7] "British Broadcasting Corporation  BBC" 
##  [8] "Canal+"                                
##  [9] "Carolco Pictures"                      
## [10] "Castle Rock Entertainment"             
## [11] "Columbia Pictures Corporation"         
## [12] "Dimension Films"                       
## [13] "DreamWorks SKG"                        
## [14] "Dune Entertainment"                    
## [15] "Film i Väst"                           
## [16] "Film4"                                 
## [17] "Focus Features"                        
## [18] "Globo Filmes"                          
## [19] "Happy Madison Productions"             
## [20] "HBO Films"                             
## [21] "Hollywood Pictures"                    
## [22] "Lionsgate"                             
## [23] "M6 Films"                              
## [24] "Metro Goldwyn Mayer  MGM"              
## [25] "Millennium Films"                      
## [26] "Morgan Creek Productions"              
## [27] "Nickelodeon Movies"                    
## [28] "Original Film"                         
## [29] "Pixar Animation Studios"               
## [30] "PolyGram Filmed Entertainment"         
## [31] "Rai Cinema"                            
## [32] "Regency Enterprises"                   
## [33] "Relativity Media"                      
## [34] "Revolution Studios"                    
## [35] "Scott Rudin Productions"               
## [36] "Spyglass Entertainment"                
## [37] "StudioCanal"                           
## [38] "Svensk Filmindustri  SF"               
## [39] "TF1 Films Production"                  
## [40] "The Vitaphone Corporation"             
## [41] "Touchstone Pictures"                   
## [42] "TriStar Pictures"                      
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"                       
## [45] "United Artists Pictures"               
## [46] "Universal Pictures"                    
## [47] "Walt Disney Animation Studios"         
## [48] "Walt Disney Productions"               
## [49] "Warner Bros"                           
## [50] "Warner Bros  Animation"                
## [51] "Wild Bunch"                            
## [52] "Zweites Deutsches Fernsehen  ZDF"
movies %>% count(company2) %>% arrange(desc(n))
## # A tibble: 52 × 2
##    company2                                     n
##    <fct>                                    <int>
##  1 ""                                       28428
##  2 "Other"                                  15072
##  3 "Warner Bros"                              270
##  4 "Metro Goldwyn Mayer  MGM"                 149
##  5 "Canal+"                                   124
##  6 "Touchstone Pictures"                       75
##  7 "Universal Pictures"                        71
##  8 "TF1 Films Production"                      52
##  9 "StudioCanal"                               47
## 10 "Twentieth Century Fox Film Corporation"    45
## # ℹ 42 more rows
levels(droplevels(movies$company3)) 
##  [1] ""                                      
##  [2] "Other"                                 
##  [3] "Amblin Entertainment"                  
##  [4] "BBC Films"                             
##  [5] "Blumhouse Productions"                 
##  [6] "British Broadcasting Corporation  BBC" 
##  [7] "Canal+"                                
##  [8] "Carolco Pictures"                      
##  [9] "Castle Rock Entertainment"             
## [10] "Columbia Pictures Corporation"         
## [11] "Dimension Films"                       
## [12] "Dune Entertainment"                    
## [13] "Film i Väst"                           
## [14] "Film4"                                 
## [15] "Focus Features"                        
## [16] "Globo Filmes"                          
## [17] "Happy Madison Productions"             
## [18] "HBO Films"                             
## [19] "Hollywood Pictures"                    
## [20] "Lionsgate"                             
## [21] "M6 Films"                              
## [22] "Metro Goldwyn Mayer  MGM"              
## [23] "Millennium Films"                      
## [24] "Morgan Creek Productions"              
## [25] "Nickelodeon Movies"                    
## [26] "Original Film"                         
## [27] "PolyGram Filmed Entertainment"         
## [28] "Rai Cinema"                            
## [29] "Regency Enterprises"                   
## [30] "Relativity Media"                      
## [31] "Revolution Studios"                    
## [32] "Scott Rudin Productions"               
## [33] "Spyglass Entertainment"                
## [34] "StudioCanal"                           
## [35] "Svensk Filmindustri  SF"               
## [36] "TF1 Films Production"                  
## [37] "The Vitaphone Corporation"             
## [38] "Touchstone Pictures"                   
## [39] "TriStar Pictures"                      
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"                       
## [42] "Universal Pictures"                    
## [43] "Warner Bros"                           
## [44] "Warner Bros  Animation"                
## [45] "Wild Bunch"                            
## [46] "Zweites Deutsches Fernsehen  ZDF"
movies %>% count(company3) %>% arrange(desc(n))
## # A tibble: 46 × 2
##    company3                       n
##    <fct>                      <int>
##  1 ""                         36385
##  2 "Other"                     8332
##  3 "Warner Bros"                130
##  4 "Canal+"                     109
##  5 "Metro Goldwyn Mayer  MGM"    44
##  6 "Relativity Media"            42
##  7 "TF1 Films Production"        29
##  8 "Touchstone Pictures"         27
##  9 "Film4"                       20
## 10 "Millennium Films"            19
## # ℹ 36 more rows

Finally the blank spaces are going to be removed from each row.

# Eliminate white space inconsistency
movies <- movies %>% mutate(company1 = str_trim(company1)) 
movies <- movies %>% mutate(company2 = str_trim(company2))
movies <- movies %>% mutate(company3 = str_trim(company3))
# Reconvert variables to factor data type
movies <- movies %>% mutate(company1 = as.factor(movies$company1))
movies <- movies %>% mutate(company2 = as.factor(movies$company2))
movies <- movies %>% mutate(company3 = as.factor(movies$company3))
# Levels should now have their white space removed
levels(droplevels(movies$company1))
##  [1] "American International Pictures  AIP"  
##  [2] "BBC Films"                             
##  [3] "British Broadcasting Corporation  BBC" 
##  [4] "Canal+"                                
##  [5] "Channel Four Films"                    
##  [6] "CJ Entertainment"                      
##  [7] "Columbia Pictures"                     
##  [8] "Columbia Pictures Corporation"         
##  [9] "DC Comics"                             
## [10] "DreamWorks SKG"                        
## [11] "First National Pictures"               
## [12] "Fox Film Corporation"                  
## [13] "Fox Searchlight Pictures"              
## [14] "France 2 Cinéma"                       
## [15] "Gaumont"                               
## [16] "Hammer Film Productions"               
## [17] "Hollywood Pictures"                    
## [18] "Imagine Entertainment"                 
## [19] "Lions Gate Films"                      
## [20] "Lionsgate"                             
## [21] "Metro Goldwyn Mayer  MGM"              
## [22] "Miramax Films"                         
## [23] "Monogram Pictures"                     
## [24] "Mosfilm"                               
## [25] "New Line Cinema"                       
## [26] "New World Pictures"                    
## [27] "Nikkatsu"                              
## [28] "No Company"                            
## [29] "Nordisk Film"                          
## [30] "Orion Pictures"                        
## [31] "Other"                                 
## [32] "Paramount Pictures"                    
## [33] "Rai Cinema"                            
## [34] "Regency Enterprises"                   
## [35] "RKO Radio Pictures"                    
## [36] "Shaw Brothers"                         
## [37] "Shôchiku Eiga"                         
## [38] "StudioCanal"                           
## [39] "Summit Entertainment"                  
## [40] "The Rank Organisation"                 
## [41] "TLA Releasing"                         
## [42] "Toho Company"                          
## [43] "Touchstone Pictures"                   
## [44] "TriStar Pictures"                      
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"                        
## [47] "Universal International Pictures  UI"  
## [48] "Universal Pictures"                    
## [49] "Village Roadshow Pictures"             
## [50] "Walt Disney Pictures"                  
## [51] "Walt Disney Productions"               
## [52] "Warner Bros"
levels(droplevels(movies$company2))
##  [1] ""                                      
##  [2] "Amblin Entertainment"                  
##  [3] "American International Pictures  AIP"  
##  [4] "BBC Films"                             
##  [5] "Blumhouse Productions"                 
##  [6] "British Broadcasting Corporation  BBC" 
##  [7] "Canal+"                                
##  [8] "Carolco Pictures"                      
##  [9] "Castle Rock Entertainment"             
## [10] "Columbia Pictures Corporation"         
## [11] "Dimension Films"                       
## [12] "DreamWorks SKG"                        
## [13] "Dune Entertainment"                    
## [14] "Film i Väst"                           
## [15] "Film4"                                 
## [16] "Focus Features"                        
## [17] "Globo Filmes"                          
## [18] "Happy Madison Productions"             
## [19] "HBO Films"                             
## [20] "Hollywood Pictures"                    
## [21] "Lionsgate"                             
## [22] "M6 Films"                              
## [23] "Metro Goldwyn Mayer  MGM"              
## [24] "Millennium Films"                      
## [25] "Morgan Creek Productions"              
## [26] "Nickelodeon Movies"                    
## [27] "Original Film"                         
## [28] "Other"                                 
## [29] "Pixar Animation Studios"               
## [30] "PolyGram Filmed Entertainment"         
## [31] "Rai Cinema"                            
## [32] "Regency Enterprises"                   
## [33] "Relativity Media"                      
## [34] "Revolution Studios"                    
## [35] "Scott Rudin Productions"               
## [36] "Spyglass Entertainment"                
## [37] "StudioCanal"                           
## [38] "Svensk Filmindustri  SF"               
## [39] "TF1 Films Production"                  
## [40] "The Vitaphone Corporation"             
## [41] "Touchstone Pictures"                   
## [42] "TriStar Pictures"                      
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"                       
## [45] "United Artists Pictures"               
## [46] "Universal Pictures"                    
## [47] "Walt Disney Animation Studios"         
## [48] "Walt Disney Productions"               
## [49] "Warner Bros"                           
## [50] "Warner Bros  Animation"                
## [51] "Wild Bunch"                            
## [52] "Zweites Deutsches Fernsehen  ZDF"
levels(droplevels(movies$company3))
##  [1] ""                                      
##  [2] "Amblin Entertainment"                  
##  [3] "BBC Films"                             
##  [4] "Blumhouse Productions"                 
##  [5] "British Broadcasting Corporation  BBC" 
##  [6] "Canal+"                                
##  [7] "Carolco Pictures"                      
##  [8] "Castle Rock Entertainment"             
##  [9] "Columbia Pictures Corporation"         
## [10] "Dimension Films"                       
## [11] "Dune Entertainment"                    
## [12] "Film i Väst"                           
## [13] "Film4"                                 
## [14] "Focus Features"                        
## [15] "Globo Filmes"                          
## [16] "Happy Madison Productions"             
## [17] "HBO Films"                             
## [18] "Hollywood Pictures"                    
## [19] "Lionsgate"                             
## [20] "M6 Films"                              
## [21] "Metro Goldwyn Mayer  MGM"              
## [22] "Millennium Films"                      
## [23] "Morgan Creek Productions"              
## [24] "Nickelodeon Movies"                    
## [25] "Original Film"                         
## [26] "Other"                                 
## [27] "PolyGram Filmed Entertainment"         
## [28] "Rai Cinema"                            
## [29] "Regency Enterprises"                   
## [30] "Relativity Media"                      
## [31] "Revolution Studios"                    
## [32] "Scott Rudin Productions"               
## [33] "Spyglass Entertainment"                
## [34] "StudioCanal"                           
## [35] "Svensk Filmindustri  SF"               
## [36] "TF1 Films Production"                  
## [37] "The Vitaphone Corporation"             
## [38] "Touchstone Pictures"                   
## [39] "TriStar Pictures"                      
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"                       
## [42] "Universal Pictures"                    
## [43] "Warner Bros"                           
## [44] "Warner Bros  Animation"                
## [45] "Wild Bunch"                            
## [46] "Zweites Deutsches Fernsehen  ZDF"

Company column is now clean with proper categorization and its ready for use in analysis.

country_language variables

  • Detected Issues: 2 character blank space, lack of language and language names in their original language, this is an issue as it could complicate our efforts to understand the dataset.
  • Affected Rows: 4050 (No language), ALL (blank space and levels in original language)
  • Proposed Solution: Remove blank spaces, create a category for movies without a language called “Unspecified Language” and translate to English the language for all levels.
# See levels for a variable
levels(droplevels(movies$country1_language))
##  [1] ""                 "Afrikaans"        "Azərbaycan"       "Bahasa indonesia"
##  [5] "Bahasa melayu"    "Bamanankan"       "Bokmål"           "Bosanski"        
##  [9] "Català"           "Český"            "Cymraeg"          "Dansk"           
## [13] "Deutsch"          "Eesti"            "English"          "Español"         
## [17] "Esperanto"        "euskera"          "Français"         "Fulfulde"        
## [21] "Gaeilge"          "Galego"           "Hausa"            "Hrvatski"        
## [25] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [29] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [33] "Magyar"           "Nederlands"       "No Language"      "Norsk"           
## [37] "Polski"           "Português"        "Pусский"          "Română"          
## [41] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [45] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [49] "Türkçe"           "Wolof"            "ελληνικά"         "беларуская мова" 
## [53] "български език"   "қазақ"            "Український"      "ქართული"         
## [57] "עִבְרִית"            "اردو"             "العربية"          "پښتو"            
## [61] "فارسی"            "हिन्दी"            "বাংলা"            "ਪੰਜਾਬੀ"           
## [65] "தமிழ்"             "తెలుగు"            "ภาษาไทย"          "한국어 조선말"   
## [69] "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country2_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Català"           "Český"           
##  [9] "Cymraeg"          "Dansk"            "Deutsch"          "Eesti"           
## [13] "English"          "Español"          "Esperanto"        "Français"        
## [17] "Fulfulde"         "Gaeilge"          "Galego"           "Hrvatski"        
## [21] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [25] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [29] "Magyar"           "Malti"            "Nederlands"       "No Language"     
## [33] "Norsk"            "ozbek"            "Polski"           "Português"       
## [37] "Pусский"          "Română"           "shqip"            "Slovenčina"      
## [41] "Slovenščina"      "Somali"           "Srpski"           "suomi"           
## [45] "svenska"          "Tiếng Việt"       "Türkçe"           "Wolof"           
## [49] "ελληνικά"         "български език"   "қазақ"            "Український"     
## [53] "ქართული"          "עִבְרִית"            "اردو"             "العربية"         
## [57] "پښتو"             "فارسی"            "हिन्दी"            "বাংলা"           
## [61] "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"            "ภาษาไทย"         
## [65] "한국어 조선말"    "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country3_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Český"            "Cymraeg"         
##  [9] "Dansk"            "Deutsch"          "Eesti"            "English"         
## [13] "Español"          "Esperanto"        "euskera"          "Français"        
## [17] "Gaeilge"          "Hrvatski"         "isiZulu"          "Íslenska"        
## [21] "Italiano"         "Kiswahili"        "Latin"            "Latviešu"        
## [25] "Lietuvi x9akai"   "Magyar"           "Nederlands"       "Norsk"           
## [29] "Polski"           "Português"        "Pусский"          "Română"          
## [33] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [37] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [41] "Türkçe"           "Wolof"            "ελληνικά"         "български език"  
## [45] "қазақ"            "Український"      "ქართული"          "עִבְרִית"           
## [49] "اردو"             "العربية"          "پښتو"             "فارسی"           
## [53] "हिन्दी"            "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"           
## [57] "ภาษาไทย"          "한국어 조선말"    "广州话   廣州話"  "日本語"          
## [61] "普通话"

The fisrt step to clean the columns will be to remove the whitespace on each of the variables.

# Eliminate white space inconsistency
movies <- movies %>% mutate(country1_language = str_trim(country1_language)) 
movies <- movies %>% mutate(country2_language = str_trim(country2_language))
movies <- movies %>% mutate(country3_language = str_trim(country3_language))
# Reconvert variables to factor data type
movies <- movies %>% mutate(country1_language = as.factor(movies$country1_language))
movies <- movies %>% mutate(country2_language = as.factor(movies$country2_language))
movies <- movies %>% mutate(country3_language = as.factor(movies$country3_language))
# See levels for a variable
levels(droplevels(movies$country1_language))
##  [1] ""                 "Afrikaans"        "Azərbaycan"       "Bahasa indonesia"
##  [5] "Bahasa melayu"    "Bamanankan"       "Bokmål"           "Bosanski"        
##  [9] "Català"           "Český"            "Cymraeg"          "Dansk"           
## [13] "Deutsch"          "Eesti"            "English"          "Español"         
## [17] "Esperanto"        "euskera"          "Français"         "Fulfulde"        
## [21] "Gaeilge"          "Galego"           "Hausa"            "Hrvatski"        
## [25] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [29] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [33] "Magyar"           "Nederlands"       "No Language"      "Norsk"           
## [37] "Polski"           "Português"        "Pусский"          "Română"          
## [41] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [45] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [49] "Türkçe"           "Wolof"            "ελληνικά"         "беларуская мова" 
## [53] "български език"   "қазақ"            "Український"      "ქართული"         
## [57] "עִבְרִית"            "اردو"             "العربية"          "پښتو"            
## [61] "فارسی"            "हिन्दी"            "বাংলা"            "ਪੰਜਾਬੀ"           
## [65] "தமிழ்"             "తెలుగు"            "ภาษาไทย"          "한국어 조선말"   
## [69] "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country2_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Català"           "Český"           
##  [9] "Cymraeg"          "Dansk"            "Deutsch"          "Eesti"           
## [13] "English"          "Español"          "Esperanto"        "Français"        
## [17] "Fulfulde"         "Gaeilge"          "Galego"           "Hrvatski"        
## [21] "isiZulu"          "Íslenska"         "Italiano"         "Kinyarwanda"     
## [25] "Kiswahili"        "Latin"            "Latviešu"         "Lietuvi x9akai"  
## [29] "Magyar"           "Malti"            "Nederlands"       "No Language"     
## [33] "Norsk"            "ozbek"            "Polski"           "Português"       
## [37] "Pусский"          "Română"           "shqip"            "Slovenčina"      
## [41] "Slovenščina"      "Somali"           "Srpski"           "suomi"           
## [45] "svenska"          "Tiếng Việt"       "Türkçe"           "Wolof"           
## [49] "ελληνικά"         "български език"   "қазақ"            "Український"     
## [53] "ქართული"          "עִבְרִית"            "اردو"             "العربية"         
## [57] "پښتو"             "فارسی"            "हिन्दी"            "বাংলা"           
## [61] "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"            "ภาษาไทย"         
## [65] "한국어 조선말"    "广州话   廣州話"  "日本語"           "普通话"
levels(droplevels(movies$country3_language))
##  [1] ""                 "Afrikaans"        "Bahasa indonesia" "Bahasa melayu"   
##  [5] "Bamanankan"       "Bosanski"         "Český"            "Cymraeg"         
##  [9] "Dansk"            "Deutsch"          "Eesti"            "English"         
## [13] "Español"          "Esperanto"        "euskera"          "Français"        
## [17] "Gaeilge"          "Hrvatski"         "isiZulu"          "Íslenska"        
## [21] "Italiano"         "Kiswahili"        "Latin"            "Latviešu"        
## [25] "Lietuvi x9akai"   "Magyar"           "Nederlands"       "Norsk"           
## [29] "Polski"           "Português"        "Pусский"          "Română"          
## [33] "shqip"            "Slovenčina"       "Slovenščina"      "Somali"          
## [37] "Srpski"           "suomi"            "svenska"          "Tiếng Việt"      
## [41] "Türkçe"           "Wolof"            "ελληνικά"         "български език"  
## [45] "қазақ"            "Український"      "ქართული"          "עִבְרִית"           
## [49] "اردو"             "العربية"          "پښتو"             "فارسی"           
## [53] "हिन्दी"            "ਪੰਜਾਬੀ"            "தமிழ்"             "తెలుగు"           
## [57] "ภาษาไทย"          "한국어 조선말"    "广州话   廣州話"  "日本語"          
## [61] "普通话"

The second step to clean this column will be assing a name to the level that does not contain information in order to identify it faster.

# Create the category "Unspecified Language"
movies <- movies %>% mutate(country1_language = fct_collapse(country1_language, "Unspecified Language" = ""))
movies <- movies %>% mutate(country2_language = fct_collapse(country2_language, "Unspecified Language" = ""))
movies <- movies %>% mutate(country3_language = fct_collapse(country3_language, "Unspecified Language" = ""))
# See levels for a variable
levels(droplevels(movies$country1_language))
##  [1] "Unspecified Language" "Afrikaans"            "Azərbaycan"          
##  [4] "Bahasa indonesia"     "Bahasa melayu"        "Bamanankan"          
##  [7] "Bokmål"               "Bosanski"             "Català"              
## [10] "Český"                "Cymraeg"              "Dansk"               
## [13] "Deutsch"              "Eesti"                "English"             
## [16] "Español"              "Esperanto"            "euskera"             
## [19] "Français"             "Fulfulde"             "Gaeilge"             
## [22] "Galego"               "Hausa"                "Hrvatski"            
## [25] "isiZulu"              "Íslenska"             "Italiano"            
## [28] "Kinyarwanda"          "Kiswahili"            "Latin"               
## [31] "Latviešu"             "Lietuvi x9akai"       "Magyar"              
## [34] "Nederlands"           "No Language"          "Norsk"               
## [37] "Polski"               "Português"            "Pусский"             
## [40] "Română"               "shqip"                "Slovenčina"          
## [43] "Slovenščina"          "Somali"               "Srpski"              
## [46] "suomi"                "svenska"              "Tiếng Việt"          
## [49] "Türkçe"               "Wolof"                "ελληνικά"            
## [52] "беларуская мова"      "български език"       "қазақ"               
## [55] "Український"          "ქართული"              "עִבְרִית"               
## [58] "اردو"                 "العربية"              "پښتو"                
## [61] "فارسی"                "हिन्दी"                "বাংলা"               
## [64] "ਪੰਜਾਬੀ"                "தமிழ்"                 "తెలుగు"               
## [67] "ภาษาไทย"              "한국어 조선말"        "广州话   廣州話"     
## [70] "日本語"               "普通话"
levels(droplevels(movies$country2_language))
##  [1] "Unspecified Language" "Afrikaans"            "Bahasa indonesia"    
##  [4] "Bahasa melayu"        "Bamanankan"           "Bosanski"            
##  [7] "Català"               "Český"                "Cymraeg"             
## [10] "Dansk"                "Deutsch"              "Eesti"               
## [13] "English"              "Español"              "Esperanto"           
## [16] "Français"             "Fulfulde"             "Gaeilge"             
## [19] "Galego"               "Hrvatski"             "isiZulu"             
## [22] "Íslenska"             "Italiano"             "Kinyarwanda"         
## [25] "Kiswahili"            "Latin"                "Latviešu"            
## [28] "Lietuvi x9akai"       "Magyar"               "Malti"               
## [31] "Nederlands"           "No Language"          "Norsk"               
## [34] "ozbek"                "Polski"               "Português"           
## [37] "Pусский"              "Română"               "shqip"               
## [40] "Slovenčina"           "Slovenščina"          "Somali"              
## [43] "Srpski"               "suomi"                "svenska"             
## [46] "Tiếng Việt"           "Türkçe"               "Wolof"               
## [49] "ελληνικά"             "български език"       "қазақ"               
## [52] "Український"          "ქართული"              "עִבְרִית"               
## [55] "اردو"                 "العربية"              "پښتو"                
## [58] "فارسی"                "हिन्दी"                "বাংলা"               
## [61] "ਪੰਜਾਬੀ"                "தமிழ்"                 "తెలుగు"               
## [64] "ภาษาไทย"              "한국어 조선말"        "广州话   廣州話"     
## [67] "日本語"               "普通话"
levels(droplevels(movies$country3_language))
##  [1] "Unspecified Language" "Afrikaans"            "Bahasa indonesia"    
##  [4] "Bahasa melayu"        "Bamanankan"           "Bosanski"            
##  [7] "Český"                "Cymraeg"              "Dansk"               
## [10] "Deutsch"              "Eesti"                "English"             
## [13] "Español"              "Esperanto"            "euskera"             
## [16] "Français"             "Gaeilge"              "Hrvatski"            
## [19] "isiZulu"              "Íslenska"             "Italiano"            
## [22] "Kiswahili"            "Latin"                "Latviešu"            
## [25] "Lietuvi x9akai"       "Magyar"               "Nederlands"          
## [28] "Norsk"                "Polski"               "Português"           
## [31] "Pусский"              "Română"               "shqip"               
## [34] "Slovenčina"           "Slovenščina"          "Somali"              
## [37] "Srpski"               "suomi"                "svenska"             
## [40] "Tiếng Việt"           "Türkçe"               "Wolof"               
## [43] "ελληνικά"             "български език"       "қазақ"               
## [46] "Український"          "ქართული"              "עִבְרִית"               
## [49] "اردو"                 "العربية"              "پښتو"                
## [52] "فارسی"                "हिन्दी"                "ਪੰਜਾਬੀ"               
## [55] "தமிழ்"                 "తెలుగు"                "ภาษาไทย"             
## [58] "한국어 조선말"        "广州话   廣州話"      "日本語"              
## [61] "普通话"

For the mean time translation will not be made at this stage, however it is something that could further improve the overall cleanliness in the data set, but it is possible to analyze information in the current state of the three columns.

Finding invalid imdb_id

A imdb id should have the same character length regardless of the format, this is an example of how it should look “tt7158814”. In total it contains 9 characters therefore, any imdb_id that contains less than that should be changed.

# Find out the lenght of an imdb_id
str_length(movies$imdb_id) %>% head(10)
##  [1] 0 9 9 9 9 9 9 9 9 9

an error was encountered within the first 10 rows, however we need to see if there are more errors aside from that one.

# Search for invalid id's
movies %>%
filter(str_length(imdb_id) != 9)
## # A tibble: 1 × 40
##   adult  budget homepage id    imdb_id original_language original_title overview
##   <lgl>   <dbl> <chr>    <chr> <chr>   <fct>             <chr>          <chr>   
## 1 FALSE  4.22e6 ""       15257 ""      en                Hulk vs. Wolv… Departm…
## # ℹ 32 more variables: popularity <dbl>, poster_path <chr>,
## #   release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## #   tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## #   vote_count <int>, id_collection <chr>, name_collection <chr>,
## #   poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## #   genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## #   country1_language <fct>, country2_language <fct>, …

By running the code we can find that the only invalid id is the same we have detected previously, searching on imdb the title of the movie, the id for this movie was found, which is the following “tt1308622”

# Replace the invalid id with the correct one
movies <- movies %>% mutate(imdb_id = case_when(imdb_id == "" ~ str_replace(imdb_id, "^$", "tt1308622"),TRUE ~ imdb_id))
# See if invalid id's remain
movies %>% filter(str_length(imdb_id) != 9)
## # A tibble: 0 × 40
## # ℹ 40 variables: adult <lgl>, budget <dbl>, homepage <chr>, id <chr>,
## #   imdb_id <chr>, original_language <fct>, original_title <chr>,
## #   overview <chr>, popularity <dbl>, poster_path <chr>, release_date <date>,
## #   revenue <dbl>, runtime <dbl>, status <fct>, tagline <chr>, title <chr>,
## #   video <chr>, vote_average <dbl>, vote_count <int>, id_collection <chr>,
## #   name_collection <chr>, poster_path_collection <chr>,
## #   backdrop_path_collection <chr>, genre1 <fct>, genre2 <fct>, genre3 <fct>, …

Merging Datasets

Until this point, I have only focused on the movies_metadata csv files, however it is not the only file available that is related to this data set. There is other file that could be relevant to add to this data set in order to further expand our analysis possibilities. In order to do this we are going to use merge functions to successfully include the other information in this data set.

The file we are going to merge, are the keywords file which groups by id the keywords that identify a movie.

Importing and merging

Text Data and Distance

In order to fix any orthographic errors in the country columns we are going to use stringdist and fuzzyjoin packages, this will help us to correct any typo in the countries column

# Get the unique languages 
unique_languages <- table(movies$country1_language)
write.csv(unique_languages,"languages.csv")
# Read list with correct names 
languages_corrected <- read.csv("D:\\Business Analytics\\languages_corrected.csv")
# Join both datasets using string distance as the criteria
  movies <- movies %>%
    stringdist_left_join(languages_corrected, by = c("country1_language" = "Language"), method = "dl") %>%
    stringdist_left_join(languages_corrected, by = c("country2_language" = "Language"), method = "dl") %>%
    stringdist_left_join(languages_corrected, by = c("country3_language" = "Language"), method = "dl")
# Count the values
summary(movies$country1_language)
## Unspecified Language            Afrikaans           Azərbaycan 
##                 4051                   22                    4 
##     Bahasa indonesia        Bahasa melayu           Bamanankan 
##                   26                    5                    4 
##               Bokmål             Bosanski               Català 
##                    3                   26                   31 
##                Český              Cymraeg                Dansk 
##                  270                    2                  300 
##              Deutsch                Eesti              English 
##                 1321                   41                26890 
##              Español            Esperanto              euskera 
##                 1144                    3                   14 
##             Français             Fulfulde              Gaeilge 
##                 2430                    1                    6 
##               Galego                Hausa             Hrvatski 
##                    3                    1                   34 
##              isiZulu             Íslenska             Italiano 
##                    4                   32                 1416 
##          Kinyarwanda            Kiswahili                Latin 
##                    1                    2                   24 
##             Latviešu       Lietuvi x9akai               Magyar 
##                   17                   15                  144 
##           Nederlands          No Language                Norsk 
##                  297                  306                  112 
##               Polski            Português              Pусский 
##                  246                  330                  909 
##               Română                shqip           Slovenčina 
##                   75                   24                   18 
##          Slovenščina               Somali               Srpski 
##                   24                    1                   47 
##                suomi              svenska           Tiếng Việt 
##                  345                  676                   15 
##               Türkçe                Wolof             ελληνικά 
##                  149                    3                  133 
##      беларуская мова       български език                қазақ 
##                    2                   25                    8 
##          Український              ქართული                עִבְרִית 
##                   16                   21                   76 
##                 اردو              العربية                 پښتو 
##                   15                  269                    2 
##                فارسی                हिन्दी                বাংলা 
##                  102                  546                   43 
##                ਪੰਜਾਬੀ                 தமிழ்                తెలుగు 
##                    4                   81                   43 
##              ภาษาไทย        한국어 조선말      广州话   廣州話 
##                   72                  446                  405 
##               日本語               普通话 
##                 1385                  414
summary(movies$country2_language)
## Unspecified Language            Afrikaans     Bahasa indonesia 
##                37996                    4                    9 
##        Bahasa melayu           Bamanankan             Bosanski 
##                    4                    1                    3 
##               Català                Český              Cymraeg 
##                    5                   14                    4 
##                Dansk              Deutsch                Eesti 
##                   19                  924                   10 
##              English              Español            Esperanto 
##                 1617                  786                    3 
##             Français             Fulfulde              Gaeilge 
##                 1488                    1                   11 
##               Galego             Hrvatski              isiZulu 
##                    1                   14                    5 
##             Íslenska             Italiano          Kinyarwanda 
##                   22                  623                    2 
##            Kiswahili                Latin             Latviešu 
##                    7                   48                    2 
##       Lietuvi x9akai               Magyar                Malti 
##                    7                  138                    2 
##           Nederlands          No Language                Norsk 
##                   29                   13                   49 
##                ozbek               Polski            Português 
##                    2                  162                  160 
##              Pусский               Română                shqip 
##                  332                   27                    1 
##           Slovenčina          Slovenščina               Somali 
##                   18                   10                    4 
##               Srpski                suomi              svenska 
##                   27                   40                  244 
##           Tiếng Việt               Türkçe                Wolof 
##                   29                   51                    7 
##             ελληνικά       български език                қазақ 
##                   44                    4                    2 
##          Український              ქართული                עִבְרִית 
##                   17                    8                   74 
##                 اردو              العربية                 پښتو 
##                   19                   28                    2 
##                فارسی                हिन्दी                বাংলা 
##                   26                  126                    2 
##                ਪੰਜਾਬੀ                 தமிழ்                తెలుగు 
##                    6                   21                   16 
##              ภาษาไทย        한국어 조선말      广州话   廣州話 
##                   42                   54                   44 
##               日本語               普通话 
##                  240                  222
summary(movies$country3_language)
## Unspecified Language            Afrikaans     Bahasa indonesia 
##                43438                    2                    2 
##        Bahasa melayu           Bamanankan             Bosanski 
##                    6                    1                    2 
##                Český              Cymraeg                Dansk 
##                    2                    1                    5 
##              Deutsch                Eesti              English 
##                  330                    1                  243 
##              Español            Esperanto              euskera 
##                  309                    1                    1 
##             Français              Gaeilge             Hrvatski 
##                  237                    4                    3 
##              isiZulu             Íslenska             Italiano 
##                    6                    8                  225 
##            Kiswahili                Latin             Latviešu 
##                    9                   41                    2 
##       Lietuvi x9akai               Magyar           Nederlands 
##                    2                   60                    7 
##                Norsk               Polski            Português 
##                   24                   82                   65 
##              Pусский               Română                shqip 
##                  212                   18                    3 
##           Slovenčina          Slovenščina               Somali 
##                    4                    4                    4 
##               Srpski                suomi              svenska 
##                   21                    8                  112 
##           Tiếng Việt               Türkçe                Wolof 
##                    8                   29                    2 
##             ελληνικά       български език                қазақ 
##                   28                    2                    1 
##          Український              ქართული                עִבְרִית 
##                    8                    4                   37 
##                 اردو              العربية                 پښتو 
##                   15                   31                    1 
##                فارسی                हिन्दी                ਪੰਜਾਬੀ 
##                    8                   28                    7 
##                 தமிழ்                తెలుగు              ภาษาไทย 
##                    4                    8                   31 
##        한국어 조선말      广州话   廣州話               日本語 
##                   28                   12                   87 
##               普通话 
##                   88

The language values are imported and corrected in case there was a typo within the dataset. With this we ensure that the same language stays in one category only.

Now we import keywords.

# Importing files
keywords <- read.csv("D:\\Business Analytics\\keywords.csv")
# Remove duplicates from keywords and links
keywords <- distinct(keywords)
# Merging movies df with keywords df only maintaining coincidences with movies.
movies <- merge(movies,keywords,all.x =TRUE)

Now the movies contain their corresponding keywords when applicable, however the keywords are in JSON format which for analysis purposes is not adequate, therefore we are going to create three new columns for registering the first three keywords a movies uses.

Cleaning merged dataset

# Separate into columns by :
new_keywords <- str_split_fixed(movies$keywords, ":", n = Inf)
# Choose until the third keyword
new_keywords <- new_keywords[, 2:7]
# Only select columns with the keywords
new_keywords <- new_keywords[, c(2,4,6)]
summary(new_keywords)
##       V1                 V2                 V3           
##  Length:45972       Length:45972       Length:45972      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
# Convert to data frame
new_keywords <- as.data.frame(new_keywords)
# Cleaning and trimming spaces
new_keywords <- new_keywords %>% 
  mutate(
    keyword1 = str_replace_all(V1, "[[:punct:]]", " "),
    keyword2 = str_replace_all(V2, "[[:punct:]]", " "),
    keyword3 = str_replace_all(V3, "[[:punct:]]", " ")
  ) %>% 
  mutate(
    keyword1 = str_remove_all(keyword1, "\\bid\\b"),
    keyword2 = str_remove_all(keyword2, "\\bid\\b"),
    keyword3 = str_remove_all(keyword3, "\\bid\\b")
  )
# Remove the original columns V1, V2, V3
new_keywords <- select(new_keywords, keyword1, keyword2, keyword3)
# Trim all leading and trailing white spaces
new_keywords <- new_keywords %>% 
  mutate(
    keyword1 = str_trim(keyword1),
    keyword2 = str_trim(keyword2),
    keyword3 = str_trim(keyword3)
  )
# Convert to factor
new_keywords$keyword1 <- as.factor(new_keywords$keyword1)
new_keywords$keyword2 <- as.factor(new_keywords$keyword2)
new_keywords$keyword3 <- as.factor(new_keywords$keyword3)
# Display the summary to check the clean data
summary(new_keywords)
##              keyword1                 keyword2                 keyword3    
##                  :14520                   :21056                   :25845  
##  woman director  : 1344   woman director  :  387   woman director  :  248  
##  independent film:  700   independent film:  234   independent film:  210  
##  based on novel  :  446   sex             :  212   murder          :  175  
##  musical         :  386   based on novel  :  210   nudity          :  133  
##  female nudity   :  376   murder          :  196   sex             :  105  
##  (Other)         :28200   (Other)         :23677   (Other)         :19256
movies <- cbind(movies, new_keywords)
movies <- movies %>% 
  select(-keywords) 

By using the merge function we are now able to see which are the most popular keywords on the whole dataset and also search for any specific movie for their keywords, which could be important to consider when doing data analysis.

Removing unnessesary variables

# Check for missing values in the 'titles' variable
missing_title <- sum(is.na(movies$title))

# Display the number of missing values
missing_title
## [1] 0
tail(movies$title)
## [1] "The Great Mouse Detective" "Exit Smiling"             
## [3] "Turn It Up"                "Gabriel"                  
## [5] "Hot Stuff"                 "The Free Will"

Eventhough i checked for missing values with code, I looked manually and see missing values in title and original_title remains intact.

# Remove unnecessary rows
movies <- select(movies, -video, -title)

Visualizing missing values

# Split the dataset to obtain a better visualization when using visdat
moviesp1 <- select(movies,1:10)
moviesp2 <- select(movies,11:20)
moviesp3 <- select(movies,21:30)
moviesp4 <- select(movies,31:41)
# Visualize missing values
vis_miss(moviesp1,warn_large_data = FALSE)

vis_miss(moviesp2,warn_large_data = FALSE)

vis_miss(moviesp3,warn_large_data = FALSE)

vis_miss(moviesp4,warn_large_data = FALSE)

Exploring and visualizing categorical and numerical data

Data Explorer

Now that we have merged and cleaned the new dataset, we can create a report in with the data explorer library to obtain insights about our dataset. But first it is necessary to remove unnecessary rows in order to obtain a better report. For the report we will only include one of each variable in cases where there is more than 1.

# Creating a shorter data set
movies_short <- select(movies,adult,original_title,revenue,budget,runtime,release_date,status,vote_average,vote_count,popularity_max,genre1,genre_count,company1,company_count,country1,country1_language,keyword1)

Know that we have the simplified dataset to movies_short we will create the report using the revenue as our main focus for analysis.

# For the report we are going to use revenue as our dependent variable
#create_report(movies_short,y= "revenue")

A report was created with Data Explorer to see missing values and other data. I will keep this as a comment for loading purposes for it is not important that it opens up every time I run the code.

tables

## 
##          Action       Adventure       Animation          Comedy           Crime 
##            4518            1525            1133            8929            1705 
##     Documentary           Drama          Family         Fantasy         Foreign 
##            3447           12156             539             708             118 
##         History          Horror           Music         Mystery         Romance 
##             283            2634             493             560            1201 
## Science Fiction        Thriller        TV Movie     Unspecified             War 
##             647            1692             391            2458             384 
##         Western 
##             451
## 
##          Action       Adventure       Animation          Comedy           Crime 
##     0.098277212     0.033172366     0.024645436     0.194226921     0.037087793 
##     Documentary           Drama          Family         Fantasy         Foreign 
##     0.074980423     0.264421822     0.011724528     0.015400679     0.002566780 
##         History          Horror           Music         Mystery         Romance 
##     0.006155921     0.057295745     0.010723919     0.012181328     0.026124598 
## Science Fiction        Thriller        TV Movie     Unspecified             War 
##     0.014073784     0.036805012     0.008505177     0.053467328     0.008352910 
##         Western 
##     0.009810319

By a large margin, the most movies in the dataset are drama movies.

## 
##    xx 104.0  68.0  82.0    ab    af    am    ar    ay    bg    bm    bn    bo 
##    45     0     0     0    10     2     2    39     1    10     3    29     2 
##    bs    ca    cn    cs    cy    da    de    el    en    eo    es    et    eu 
##    14    12   313   135     1   241  1081   113 32365     1   995    24     3 
##    fa    fi    fr    fy    gl    he    hi    hr    hu    hy    id    is    it 
##   100   308  2438     1     1    67   508    30   102     1    20    24  1532 
##    iu    ja    jv    ka    kk    kn    ko    ku    ky    la    lb    lo    lt 
##     2  1347     1    18     3     3   444     3     3     1     1     2     9 
##    lv    mk    ml    mn    mr    ms    mt    nb    ne    nl    no    pa    pl 
##    18     5    36     2    25     5     1     6     2   248   119     2   218 
##    ps    pt    qu    ro    ru    rw    sh    si    sk    sl    sm    sq    sr 
##     2   316     1    57   826     1     5     1    18    33     1     5    63 
##    sv    ta    te    tg    th    tl    tr    uk    ur    uz    vi    wo    zh 
##   724    78    45     1    75    23   150    16     8     1    10     5   409 
##    zu 
##     1

As expected, most movies are in english.

##                  
##                        Afghanistan Albania Algeria Angola Argentina Armenia
##   Action           296           0       0       0      0        11       0
##   Adventure         72           0       0       1      0         5       0
##   Animation        131           0       0       0      0         5       1
##   Comedy           871           0       1       0      0        26       2
##   Crime             78           0       0       0      0         8       0
##   Documentary     1157           0       0       0      0         7       1
##   Drama           1129           3       1       5      2        99       3
##   Family            74           0       0       0      0         1       0
##   Fantasy           47           0       0       0      0         3       0
##   Foreign           17           0       0       0      0         1       0
##   History           28           0       0       1      0         0       0
##   Horror           204           0       0       0      0         5       0
##   Music            101           0       0       0      0         1       0
##   Mystery           41           0       0       0      0         3       0
##   Romance          110           0       0       0      0         5       0
##   Science Fiction   62           0       0       0      0         5       0
##   Thriller         149           0       1       0      0         6       0
##   TV Movie          85           0       0       0      0         1       0
##   Unspecified     1600           0       0       0      0        16       0
##   War               29           1       0       1      0         1       0
##   Western           33           0       0       0      0         3       0
##                  
##                   Aruba Australia Austria Azerbaijan Bahamas Bangladesh Belarus
##   Action              3        63       5          0       0          0       1
##   Adventure           1        28       2          0       2          0       0
##   Animation           0        12       0          0       0          0       0
##   Comedy              0        78      26          0       1          0       2
##   Crime               0        24       5          0       0          0       0
##   Documentary         0        16      26          0       0          0       0
##   Drama               0       131      54          1       0          1       1
##   Family              0        13       1          0       0          1       0
##   Fantasy             0        11       1          0       0          0       0
##   Foreign             0         3       0          0       0          0       0
##   History             0         3       0          0       0          0       0
##   Horror              0        38       5          0       0          0       0
##   Music               0         4       2          0       0          0       0
##   Mystery             0        13       1          0       0          0       0
##   Romance             0        13       2          0       0          0       0
##   Science Fiction     0        11       3          0       0          0       0
##   Thriller            0        32       5          0       1          0       0
##   TV Movie            0         0       0          0       0          0       0
##   Unspecified         0         5      11          0       0          0       0
##   War                 0         7       0          0       0          0       1
##   Western             0         1       2          0       0          0       0
##                  
##                   Belgium Bermuda Bhutan Bolivia Bosnia and Herzegovina
##   Action               10       0      1       0                      1
##   Adventure             9       0      1       1                      0
##   Animation            10       0      0       0                      0
##   Comedy               53       0      1       0                      3
##   Crime                 9       0      0       0                      0
##   Documentary           7       1      0       1                      0
##   Drama               117       0      0       4                     15
##   Family                6       0      0       0                      0
##   Fantasy               5       0      0       0                      0
##   Foreign               0       0      0       0                      0
##   History               5       0      0       0                      0
##   Horror               14       0      0       0                      0
##   Music                 2       0      0       0                      0
##   Mystery               5       0      0       0                      0
##   Romance              19       0      0       0                      0
##   Science Fiction       4       0      0       0                      0
##   Thriller              9       0      0       1                      0
##   TV Movie              1       0      0       0                      0
##   Unspecified          12       0      0       1                      0
##   War                   2       0      0       0                      3
##   Western               0       0      0       0                      0
##                  
##                   Botswana Brazil Brunei Darussalam Bulgaria Burkina Faso
##   Action                 1      7                 1        7            0
##   Adventure              0     11                 0        3            0
##   Animation              0      3                 0        0            0
##   Comedy                 0     54                 0        5            0
##   Crime                  0      7                 0        0            0
##   Documentary            1     30                 0        1            0
##   Drama                  0    113                 0       14            5
##   Family                 0      3                 0        0            0
##   Fantasy                0      0                 0        0            0
##   Foreign                0      1                 0        0            0
##   History                0      1                 0        0            1
##   Horror                 0      6                 0        2            0
##   Music                  0      2                 0        0            0
##   Mystery                0      2                 0        0            0
##   Romance                0      8                 0        0            1
##   Science Fiction        0      1                 0        2            0
##   Thriller               0      4                 0        2            0
##   TV Movie               0      0                 0        0            0
##   Unspecified            0      9                 0        0            0
##   War                    0      0                 0        0            1
##   Western                0      0                 0        0            0
##                  
##                   Cambodia Cameroon Canada Chad Chile China Colombia Congo
##   Action                 0        0    166    1     1    67        1     1
##   Adventure              0        0     56    0     0    15        1     1
##   Animation              0        0     43    0     1     5        1     0
##   Comedy                 1        0    202    0    11    25        1     0
##   Crime                  0        0     49    0     1     4        2     0
##   Documentary            4        1    118    0     6     8        3     1
##   Drama                  1        2    381    0    20   122        8     0
##   Family                 0        0     20    0     0     0        0     0
##   Fantasy                0        0     22    0     1     9        0     0
##   Foreign                0        0      0    0     1     0        0     0
##   History                0        0      4    0     0     3        0     0
##   Horror                 0        0    167    0     1     0        0     0
##   Music                  0        0      9    0     0     0        0     0
##   Mystery                0        0     23    0     0     4        0     0
##   Romance                0        0     35    0     1    17        0     0
##   Science Fiction        0        1     33    0     0     0        0     0
##   Thriller               0        0     93    0     1    10        1     0
##   TV Movie               0        0     40    0     0     0        0     0
##   Unspecified            0        0     29    0     5    10        1     0
##   War                    0        0      5    1     0     2        0     0
##   Western                0        0      6    0     0     0        0     0
##                  
##                   Costa Rica Cote D Ivoire Croatia Cuba Cyprus Czech Republic
##   Action                   0             0       1    0      0              5
##   Adventure                0             0       1    0      0             11
##   Animation                0             0       0    0      0             17
##   Comedy                   1             0       7    3      1             37
##   Crime                    0             0       1    0      0              4
##   Documentary              0             0       0    4      0              2
##   Drama                    3             2      16    6      1             42
##   Family                   0             0       0    0      0              9
##   Fantasy                  0             0       1    1      0              6
##   Foreign                  0             0       0    0      0              0
##   History                  0             0       1    0      0              5
##   Horror                   0             0       1    1      0              4
##   Music                    0             0       0    0      0              4
##   Mystery                  0             0       0    0      0              3
##   Romance                  0             0       1    1      0              4
##   Science Fiction          0             0       1    0      0              2
##   Thriller                 0             0       1    0      0              6
##   TV Movie                 0             0       0    0      0              0
##   Unspecified              0             0       0    0      0              4
##   War                      0             0       0    0      0              4
##   Western                  0             0       0    0      0              0
##                  
##                   Czechoslovakia Denmark Dominican Republic East Germany
##   Action                       0      13                  1            0
##   Adventure                    0      12                  0            0
##   Animation                    0       5                  0            0
##   Comedy                       3      50                  1            0
##   Crime                        0      16                  2            0
##   Documentary                  0      24                  0            0
##   Drama                        1     132                  2            1
##   Family                       0      13                  0            0
##   Fantasy                      0       1                  0            1
##   Foreign                      0       1                  0            0
##   History                      0       2                  0            1
##   Horror                       0      16                  0            0
##   Music                        0       1                  0            1
##   Mystery                      0       3                  0            0
##   Romance                      0       6                  0            0
##   Science Fiction              0       1                  0            1
##   Thriller                     0      22                  0            0
##   TV Movie                     0       0                  0            0
##   Unspecified                  0      10                  0            0
##   War                          0       2                  0            0
##   Western                      0       0                  0            0
##                  
##                   Ecuador Egypt El Salvador Estonia Ethiopia Finland France
##   Action                0     2           0       0        0      13    147
##   Adventure             0     0           0       4        0       3     85
##   Animation             0     0           0       3        0       2     42
##   Comedy                0     1           0       9        0      83    643
##   Crime                 1     0           0       3        0      12    117
##   Documentary           2     0           0       4        0      28    129
##   Drama                 1    10           1      14        3     110   1001
##   Family                0     0           0       1        0       5     22
##   Fantasy               0     0           0       0        0       4     57
##   Foreign               0     0           0       0        0       3      4
##   History               0     0           0       1        0       4     27
##   Horror                0     1           0       2        0       2     60
##   Music                 0     0           0       0        0       6     18
##   Mystery               0     1           0       1        0       1     28
##   Romance               1     1           0       0        0      12    105
##   Science Fiction       0     0           0       0        0       4     18
##   Thriller              0     2           0       1        0      12     96
##   TV Movie              0     0           0       1        0       2      4
##   Unspecified           0     1           0       3        0      30     60
##   War                   0     0           0       1        0       6     33
##   Western               0     0           0       0        0       0     10
##                  
##                   Georgia Germany Ghana Greece Guatemala Hong Kong Hungary
##   Action                1      93     0      3         0       246       2
##   Adventure             0      63     0      1         0        13       8
##   Animation             0      34     0      0         0         3       6
##   Comedy                4     283     0     33         0        35      15
##   Crime                 0      41     0      3         0        21       7
##   Documentary           2     113     0      3         0         3       2
##   Drama                 9     515     0     57         1        69      54
##   Family                0      24     0      0         0         0       1
##   Fantasy               0      32     0      3         0        14       0
##   Foreign               1       1     0      1         0         3       1
##   History               0      11     0      0         0         0       2
##   Horror                1      47     0      2         0        16       0
##   Music                 0      11     0      1         0         2       1
##   Mystery               0      20     0      1         0         0       1
##   Romance               3      40     1      6         0        11       4
##   Science Fiction       0      13     0      2         0         4       0
##   Thriller              0      42     1      6         0        18       5
##   TV Movie              0       8     0      0         0         0       0
##   Unspecified           0      28     0      8         0         7       9
##   War                   0      11     0      1         0         3       3
##   Western               0       5     0      0         0         0       0
##                  
##                   Iceland India Indonesia Iran Iraq Ireland Israel Italy
##   Action                1   141         6    2    0       6      7   119
##   Adventure             2    13         0    1    0       5      1    47
##   Animation             0     8         0    0    0       2      0     6
##   Comedy                7   120         3    4    0      29     18   360
##   Crime                 0    31         1    1    0       1      1    61
##   Documentary           4    10         0    7    1       4      5    28
##   Drama                21   257        11   68    1      60     48   361
##   Family                2     7         0    3    0       3      0     4
##   Fantasy               0     7         0    0    0       3      0    17
##   Foreign               0    15         2    0    0       0      0     9
##   History               1     4         0    0    0       1      2    18
##   Horror                2    19         3    0    0       9      4   123
##   Music                 1    12         0    0    0       1      0     1
##   Mystery               0     6         0    0    0       0      1    25
##   Romance               0    50         0    2    0       5      2    49
##   Science Fiction       1     2         0    0    0       2      1    21
##   Thriller              0    50         1    1    0       2      0    55
##   TV Movie              0     0         0    0    0       0      0     0
##   Unspecified           0    29         1    2    0       1      3    89
##   War                   0     2         0    0    0       2      3    17
##   Western               0     0         0    0    0       0      0    68
##                  
##                   Jamaica Japan Jordan Kazakhstan Kyrgyz Republic
##   Action                2   267      0          4               0
##   Adventure             0    72      1          0               0
##   Animation             0   180      0          0               0
##   Comedy                0   118      2          1               1
##   Crime                 0    42      0          1               0
##   Documentary           0    26      0          0               0
##   Drama                 1   378      1          4               4
##   Family                0     5      0          0               0
##   Fantasy               0    53      0          0               0
##   Foreign               0    23      0          0               0
##   History               0    14      0          0               0
##   Horror                0    92      0          0               0
##   Music                 1     7      0          0               0
##   Mystery               0    16      0          0               0
##   Romance               0    39      0          0               0
##   Science Fiction       0    50      0          0               0
##   Thriller              0    37      0          1               0
##   TV Movie              0     0      0          0               0
##   Unspecified           0    60      0          1               0
##   War                   0    11      0          0               0
##   Western               0     0      0          0               0
##                  
##                   Lao People s Democratic Republic Latvia Lebanon Liberia
##   Action                                         0      1       0       0
##   Adventure                                      0      0       1       0
##   Animation                                      0      0       0       0
##   Comedy                                         0      2       2       0
##   Crime                                          0      0       0       0
##   Documentary                                    0      5       1       1
##   Drama                                          1      9       1       1
##   Family                                         0      0       0       0
##   Fantasy                                        0      0       0       0
##   Foreign                                        0      0       0       0
##   History                                        0      0       0       0
##   Horror                                         0      0       0       0
##   Music                                          0      0       0       0
##   Mystery                                        0      0       0       0
##   Romance                                        0      1       1       0
##   Science Fiction                                0      0       0       0
##   Thriller                                       0      0       0       0
##   TV Movie                                       0      0       0       0
##   Unspecified                                    0      2       0       0
##   War                                            0      0       0       0
##   Western                                        0      0       0       0
##                  
##                   Libyan Arab Jamahiriya Liechtenstein Lithuania Luxembourg
##   Action                               1             0         3          4
##   Adventure                            2             0         0          1
##   Animation                            0             0         0          1
##   Comedy                               0             0         1          2
##   Crime                                0             0         1          1
##   Documentary                          0             0         0          0
##   Drama                                0             0         7          5
##   Family                               0             0         0          0
##   Fantasy                              0             0         0          1
##   Foreign                              0             0         0          0
##   History                              0             0         1          0
##   Horror                               0             1         1          3
##   Music                                0             0         0          0
##   Mystery                              0             0         0          2
##   Romance                              0             0         0          2
##   Science Fiction                      0             0         0          2
##   Thriller                             0             0         0          1
##   TV Movie                             0             0         0          0
##   Unspecified                          0             0         2          1
##   War                                  0             0         1          1
##   Western                              0             0         0          0
##                  
##                   Macedonia Malaysia Mali Malta Martinique Mauritania Mexico
##   Action                  1        1    0     1          0          0     18
##   Adventure               1        0    0     0          0          0     10
##   Animation               0        0    0     0          0          0      1
##   Comedy                  0        1    0     0          0          0     43
##   Crime                   0        0    0     0          0          0      9
##   Documentary             0        1    0     0          0          0     11
##   Drama                   4        1    1     1          1          3     91
##   Family                  0        0    0     0          0          0      1
##   Fantasy                 0        0    0     0          0          0      3
##   Foreign                 0        0    0     0          0          0      2
##   History                 0        0    0     0          0          0      2
##   Horror                  0        0    0     0          0          0     10
##   Music                   0        0    0     0          0          0      2
##   Mystery                 1        0    0     0          0          0      1
##   Romance                 0        0    0     0          0          0      8
##   Science Fiction         0        0    0     0          0          0      4
##   Thriller                0        2    0     0          0          0     10
##   TV Movie                0        0    0     0          0          0      1
##   Unspecified             0        0    0     0          0          0      4
##   War                     0        0    0     0          0          0      1
##   Western                 0        0    0     0          0          0      4
##                  
##                   Monaco Mongolia Montenegro Morocco Myanmar Namibia Nepal
##   Action               0        0          0       3       0       1     0
##   Adventure            0        0          0       0       0       0     1
##   Animation            0        0          0       0       0       0     0
##   Comedy               0        0          0       1       0       0     0
##   Crime                0        0          0       0       0       0     0
##   Documentary          0        2          0       1       0       0     0
##   Drama                0        0          1       6       0       0     1
##   Family               0        1          0       0       0       0     0
##   Fantasy              0        0          0       0       0       0     0
##   Foreign              0        0          0       0       0       0     0
##   History              0        0          0       1       0       0     0
##   Horror               1        0          0       0       0       0     0
##   Music                0        0          0       0       0       0     0
##   Mystery              0        0          0       0       0       0     0
##   Romance              0        0          0       0       0       0     0
##   Science Fiction      0        0          0       0       0       0     0
##   Thriller             0        0          0       0       0       0     0
##   TV Movie             0        0          0       0       0       0     0
##   Unspecified          0        0          0       2       1       0     0
##   War                  0        0          0       0       0       0     0
##   Western              0        0          0       0       0       0     0
##                  
##                   Netherlands New Zealand Nicaragua Nigeria North Korea Norway
##   Action                    7          17         0       0           0     13
##   Adventure                 9          10         0       0           0      7
##   Animation                 5           0         0       0           0      1
##   Comedy                   35          13         0       0           0     30
##   Crime                     4           0         0       0           0      5
##   Documentary              23           8         1       0           1      6
##   Drama                    87          22         0       2           0     46
##   Family                    5           0         0       0           0      3
##   Fantasy                   4           5         0       0           0      3
##   Foreign                   7           1         0       0           0      0
##   History                   4           0         0       0           0      0
##   Horror                    5           9         0       0           0      6
##   Music                     2           1         0       0           0      0
##   Mystery                   1           0         0       0           0      3
##   Romance                   7           4         0       1           0      3
##   Science Fiction           2           1         0       0           0      0
##   Thriller                  8           2         0       1           0      8
##   TV Movie                  0           0         0       0           0      0
##   Unspecified               8           0         0       0           0      4
##   War                       4           1         0       0           0      1
##   Western                   0           0         0       0           0      0
##                  
##                   Pakistan Palestinian Territory Panama Papua New Guinea
##   Action                 2                     0      0                0
##   Adventure              0                     0      0                0
##   Animation              0                     0      0                0
##   Comedy                 0                     0      0                1
##   Crime                  1                     0      0                0
##   Documentary            2                     1      1                0
##   Drama                  4                     4      1                0
##   Family                 2                     0      0                0
##   Fantasy                0                     0      0                0
##   Foreign                0                     0      0                0
##   History                1                     0      0                0
##   Horror                 1                     0      0                0
##   Music                  0                     0      0                0
##   Mystery                0                     0      0                0
##   Romance                0                     0      0                0
##   Science Fiction        0                     0      0                0
##   Thriller               0                     2      0                0
##   TV Movie               0                     0      0                0
##   Unspecified            1                     0      1                0
##   War                    0                     0      0                0
##   Western                0                     0      0                0
##                  
##                   Paraguay Peru Philippines Poland Portugal Puerto Rico Qatar
##   Action                 1    2          17     13        0           0     0
##   Adventure              0    0           1      3        1           0     0
##   Animation              0    0           0      5        1           0     0
##   Comedy                 0    2          12     53       12           0     0
##   Crime                  0    0           2      6        3           0     1
##   Documentary            0    1           0      9        5           1     1
##   Drama                  0    7          20    100       36           0     5
##   Family                 0    0           0      0        0           0     0
##   Fantasy                0    0           0      3        0           0     0
##   Foreign                0    0           2      0        0           0     0
##   History                0    0           0      2        2           0     0
##   Horror                 0    2          10      6        2           0     0
##   Music                  0    0           0      1        2           0     0
##   Mystery                0    1           0      4        2           0     0
##   Romance                0    0           2      4        2           2     0
##   Science Fiction        0    0           0      8        1           0     0
##   Thriller               0    0           1      9        0           0     1
##   TV Movie               0    0           0      0        0           0     0
##   Unspecified            0    0           2     13        4           1     0
##   War                    0    0           1      7        1           0     0
##   Western                0    0           0      0        0           0     0
##                  
##                   Romania Russia Rwanda Samoa Saudi Arabia Senegal Serbia
##   Action               10     59      0     0            0       0      4
##   Adventure             0     49      0     0            0       0      4
##   Animation             1     49      0     0            0       0      1
##   Comedy               16    179      0     0            0       2     20
##   Crime                 2     19      0     0            0       0      1
##   Documentary           2     26      0     0            0       0      2
##   Drama                29    227      2     1            1       8     20
##   Family                0     27      0     0            0       0      0
##   Fantasy               0     14      0     0            0       0      0
##   Foreign               0      0      0     0            0       0      0
##   History               3     11      0     0            0       0      0
##   Horror               11      7      0     0            0       0      2
##   Music                 0      2      0     0            0       0      1
##   Mystery               0     13      0     0            0       0      0
##   Romance               0     42      0     0            0       0      3
##   Science Fiction       1     10      0     0            0       0      0
##   Thriller              4     17      0     0            0       0      1
##   TV Movie              0      4      0     0            0       0      0
##   Unspecified           6     23      0     0            0       0      0
##   War                   2     22      0     0            0       0      6
##   Western               0      0      0     0            0       0      0
##                  
##                   Singapore Slovakia Slovenia South Africa South Korea
##   Action                  4        0        0           15          88
##   Adventure               1        0        0            4           7
##   Animation               1        1        0            1          10
##   Comedy                  2        0        2           13          38
##   Crime                   0        0        2            2          25
##   Documentary             1        2        2            3           5
##   Drama                   5        4       18           10         135
##   Family                  1        0        0            1           7
##   Fantasy                 0        0        0            0           7
##   Foreign                 0        0        0            0           6
##   History                 0        0        0            0           4
##   Horror                  0        0        0            1          32
##   Music                   0        0        2            1           1
##   Mystery                 1        0        2            0           9
##   Romance                 0        2        2            0          24
##   Science Fiction         0        0        0            4           4
##   Thriller                1        0        0            7          41
##   TV Movie                0        0        0            0           0
##   Unspecified             0        0        0            0          11
##   War                     0        2        0            2           3
##   Western                 0        0        0            1           0
##                  
##                   Soviet Union Spain Sri Lanka Sweden Switzerland
##   Action                     0    30         1     46           6
##   Adventure                  2    16         0     17           1
##   Animation                  0    11         0      5           3
##   Comedy                     1   123         0    163          17
##   Crime                      0    19         0     30           0
##   Documentary                3    25         1     45          18
##   Drama                      7   177         1    288          33
##   Family                     0     2         0     15           2
##   Fantasy                    0     8         0      7           0
##   Foreign                    0     4         0      0           0
##   History                    0     6         0      4           1
##   Horror                     0    64         0     22           3
##   Music                      0     1         0      6           3
##   Mystery                    0    15         0      8           0
##   Romance                    1    14         0     12           2
##   Science Fiction            1     5         0      2           0
##   Thriller                   0    39         0     36           4
##   TV Movie                   0     0         0      1           0
##   Unspecified                2    25         0     16           4
##   War                        0     5         0      5           2
##   Western                    0    12         0      0           0
##                  
##                   Syrian Arab Republic Taiwan Tajikistan Tanzania Thailand
##   Action                             0      9          0        0       21
##   Adventure                          0      0          0        0        3
##   Animation                          0      0          0        0        0
##   Comedy                             0     15          0        0       14
##   Crime                              0      4          0        0        1
##   Documentary                        0      1          0        1        2
##   Drama                              1     43          2        0       22
##   Family                             0      0          0        0        1
##   Fantasy                            0      0          0        0        2
##   Foreign                            0      1          0        0        1
##   History                            0      0          0        0        1
##   Horror                             0      1          0        0        8
##   Music                              0      1          0        0        1
##   Mystery                            0      1          0        0        1
##   Romance                            0      4          0        0        4
##   Science Fiction                    0      0          0        0        1
##   Thriller                           0      2          0        0        6
##   TV Movie                           0      0          0        0        0
##   Unspecified                        0      6          0        0        1
##   War                                0      0          0        0        0
##   Western                            0      0          0        0        0
##                  
##                   Trinidad and Tobago Tunisia Turkey Uganda Ukraine
##   Action                            0       0      7      1       2
##   Adventure                         1       0      5      0       1
##   Animation                         0       0      0      0       0
##   Comedy                            0       0     33      0       6
##   Crime                             0       0      4      0       0
##   Documentary                       0       0      1      1       1
##   Drama                             0       3     40      0      17
##   Family                            0       0      1      0       1
##   Fantasy                           0       0      2      0       0
##   Foreign                           0       0      0      0       0
##   History                           0       0      1      0       2
##   Horror                            0       0      5      0       0
##   Music                             0       0      2      0       0
##   Mystery                           0       0      3      0       0
##   Romance                           1       0      9      0       0
##   Science Fiction                   0       0      1      0       1
##   Thriller                          0       0      0      0       1
##   TV Movie                          0       0      0      0       0
##   Unspecified                       0       0     20      0       1
##   War                               0       0      0      0       0
##   Western                           0       0      0      0       0
##                  
##                   United Arab Emirates United Kingdom
##   Action                             1            227
##   Adventure                          0            129
##   Animation                          0             41
##   Comedy                             2            523
##   Crime                              1            154
##   Documentary                        2            228
##   Drama                              5            899
##   Family                             0             27
##   Fantasy                            0             56
##   Foreign                            1              3
##   History                            0             24
##   Horror                             1            250
##   Music                              0             41
##   Mystery                            0             48
##   Romance                            0             80
##   Science Fiction                    0             48
##   Thriller                           0            170
##   TV Movie                           0             31
##   Unspecified                        0             38
##   War                                0             50
##   Western                            0              6
##                  
##                   United States Minor Outlying Islands United States of America
##   Action                                             0                     2152
##   Adventure                                          0                      685
##   Animation                                          0                      474
##   Comedy                                             0                     4313
##   Crime                                              0                      852
##   Documentary                                        1                     1201
##   Drama                                              0                     4170
##   Family                                             0                      224
##   Fantasy                                            0                      294
##   Foreign                                            0                        2
##   History                                            0                       73
##   Horror                                             0                     1322
##   Music                                              0                      234
##   Mystery                                            0                      245
##   Romance                                            0                      410
##   Science Fiction                                    0                      306
##   Thriller                                           0                      644
##   TV Movie                                           0                      212
##   Unspecified                                        0                      213
##   War                                                0                      113
##   Western                                            0                      300
##                  
##                   Uruguay Uzbekistan Venezuela Vietnam Yugoslavia
##   Action                1          0         0       1          1
##   Adventure             0          1         0       0          0
##   Animation             0          1         0       0          0
##   Comedy                1          0         1       0          0
##   Crime                 0          0         1       1          0
##   Documentary           0          0         1       0          0
##   Drama                 6          2         6       6          2
##   Family                0          0         0       0          0
##   Fantasy               0          0         0       0          0
##   Foreign               0          0         1       0          0
##   History               0          0         0       0          0
##   Horror                0          0         1       0          0
##   Music                 0          0         0       0          0
##   Mystery               0          0         0       0          0
##   Romance               0          0         0       0          0
##   Science Fiction       0          0         0       0          0
##   Thriller              0          0         0       0          0
##   TV Movie              0          0         0       0          0
##   Unspecified           1          0         0       0          0
##   War                   0          0         0       0          1
##   Western               0          0         0       0          0

With this cross tabulation we can know what country makes the most amount of movies based on genre, United States, being the major producer, makes comedy movies the most. Would those be the movies that make the most revenue?

Interactive Visualization with Esquisse

For the following charts and graphs I used the package Esquisse in order to obtain more complex visualizations, yet the function is going to remain as a comment in order for it not to load every time I run the code.

graphs and charts

budget vs. revenue

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can see that if there is a higher budget there probably will be a higher revenue, yet it seems that there could be a limit because if you exceed a certain budget then it could be too high to make a profit out of it.

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

genre

By far, adventure movies make the most revenue.

With this past anayisis we now know 3 things:

  • United States is the biggest movie producer and mostly does comedy movies.
  • From all movies, drama is the genre that is used the most.
  • Adventure is the most popular genre as well as the one that makes more revenue in average.

With that, we can conclude that an adventure movie is the most likely to be a success.

runtime

The histogram visualizes the distribution of movie runtimes across different revenue ranges. If it is in very high it means that the movie made a lot of revenue. We can see that most movies made medium revenue which is around 10,000,000.

## [1] "Mean runtime (filtered): 96.48"
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Calculate the mean runtime of filtered movies
mean_runtime_filtered <- mean(filtered_movies$runtime, na.rm = TRUE)
print(paste("Mean runtime (filtered):", round(mean_runtime_filtered, 2)))
## [1] "Mean runtime (filtered): 96.48"

With this past runtime analysis, we can conlude that for a movie to be a success or in other words, have a higher revenue, it should last between 96 and 150 minutes. A good sweet spot for the best movie would be two hours.

release date

It was to be expected yet this clarifies the theory that movies that come out on peoples vacations are the most profitable. We can see that movies have more success when released in summer (june & july) and aslo in winter (november & december). I would not say holidays are the best but vacation time for sure.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Failed to fit group 1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Failed to fit group 1.

We can see that in past years, the audience was not too amused by movies yet the quality of them were still great. Nowadays people are way more interested.

companies

Here we can see that “Other” and “No company” are very high in comparison with the rest of the categories, that is because these categories make up for a set of companies that either are not available in the dataset or are too small to even bother measuring. When put together they are great but it cannot be considered as one. Individually, Paramount Pictures is the most successfull company.

NOTE:

This does not mean other production companies are not important, a collaboration between big and small production companies could result a better match.

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

In the analysis for production companies, It was found that movies made in collaboration between three companies are more likely to make the most revenue. This way we can determine that:

  • Collaboration is key in production
  • American movies make the best movies

titles

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## [1] "Correlation coefficient: NA"
## [1] "Mean title length: 16.33"

For movie titles, the analysis made was done to determine the right amount of characters a movie title hast to have in order to increase the probability of it being a success. It was found that titles with around 16 characters make the best length for a movie title.

Word clouds

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

In these word clouds we can see the most repeated words throughout all movies. In the second wordcloud, it shows us the words that are repeated the most with revenue-based weighting, meaning these words were repeated the most in movies with higher revenue. We can see that movies with more revenues mention the words: based on novel, woman director and saving the world, also big cities like paris, new york, london, etc. but that could be the location where the movie was made, the setting or another factor.

Statistical measures

# Convert budget and revenue as numeric
movies$budget_original <- as.numeric(as.character(movies$budget_original))
movies$revenue_original <- as.numeric(as.character(movies$revenue_original))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0         0         0   4194725         0 380000000
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 0.000e+00 0.000e+00 0.000e+00 1.114e+07 0.000e+00 2.788e+09         3
## Budget - Mode: 0 , Mean: 4194725 , Median: 0 , Standard Deviation: 17363925 , Minimum: 0 , Maximum: 3.8e+08
## Revenue - Mode: 0 , Mean: 11139954 , Median: 0 , Standard Deviation: 64127446 , Minimum: 0 , Maximum: 2787965087

We could have worked with the data as it was, but we decided to conduct a different analysis by comparing the categories of budget and revenue after excluding zero values. This is because there are many zeros in the data, and we want to have a clearer vision of the cases that do contain complete information.

## Budget (Filtrado) - Media:  21565635 , Mediana:  8e+06 , Desviación Estándar:  34286508 , Mínimo:  1 , Máximo:  3.8e+08
## Revenue (Filtrado) - Media:  68829646 , Mediana:  16801877 , Desviación Estándar:  146424469 , Mínimo:  1 , Máximo:  2787965087

Statistic analysis

## # A tibble: 2 × 3
##   USA   mean_revenue median_revenue
##   <lgl>        <dbl>          <dbl>
## 1 FALSE    73431073.       18800000
## 2 TRUE    100046323.       37170057
##             Statistic             Value
## 1                Mean  90440189.5394444
## 2              Median          29911946
## 3                Mode           1.2e+07
## 4  Standard Deviation  166189478.816842
## 5            Variance 27618942869413484
## 6                 IQR          92990947
## 7                 Min                 1
## 8                 Max        2787965087
## 9                Diff        2787965086
## 10              Range   1 to 2787965087

Removing outliers

# Filter out outliers
movies_filtered_no_outliers <- movies_filtered_both %>%
  filter(!is_outlier)

# Plot density distributions of log-transformed revenue without outliers
ggplot(movies_filtered_no_outliers, aes(x = log(revenue_original), fill = factor(USA))) +
  geom_density(alpha = 0.3) +
  labs(title = "Density Distribution of Log-transformed Revenue (Excluding Outliers)",
       x = "Log Revenue",
       y = "Density",
       fill = "United States of America") +
  scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "blue")) +
  theme_minimal()

# Calculating correlations between numerical variables
correlation_matrix <- cor(movies_filtered_no_outliers[, c("budget_original", "revenue_original", "popularity_max", "vote_average", "vote_count")])

# Scatterplot of budget vs revenue
ggplot(data = movies_filtered_no_outliers, aes(x = budget_original, y = revenue_original)) +
  geom_point(alpha = 0.5) +  
  labs(title = "Budget vs Revenue",
       x = "Budget",
       y = "Revenue") +
  theme_minimal()

# Analysis of the relationship between budget and revenue
budget_revenue_lm <- lm(revenue_original ~ budget_original, data = movies_filtered_no_outliers)
summary(budget_revenue_lm)
## 
## Call:
## lm(formula = revenue_original ~ budget_original, data = movies_filtered_no_outliers)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -677974787  -40715382   -5447567   15288916 2075081200 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.130e+06  2.044e+06  -1.531    0.126    
## budget_original  3.021e+00  3.951e-02  76.469   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115500000 on 5197 degrees of freedom
## Multiple R-squared:  0.5294, Adjusted R-squared:  0.5294 
## F-statistic:  5847 on 1 and 5197 DF,  p-value: < 2.2e-16
# Get the top 15 countries by revenue
top_countries <- movies_filtered_no_outliers %>%
  group_by(country1) %>%
  summarize(total_revenue = sum(revenue_original, na.rm = TRUE)) %>%
  top_n(15, total_revenue) %>%
  arrange(desc(total_revenue)) %>%
  pull(country1)

# Filter data for the top 15 countries
movies_filtered_top_countries <- movies_filtered_no_outliers %>%
  filter(country1 %in% top_countries)

# Scatterplot of revenue by country for top 15 countries
ggplot(data = movies_filtered_top_countries, aes(x = country1, y = revenue_original, fill = country1)) +
  geom_boxplot() +
  labs(title = "Revenue by Country of Origin (Top 15 Countries)",
       x = "Country",
       y = "Revenue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

shape

# Summary Statistics
summary_stats <- function(data, col) {
  cat("\n--- Summary Statistics for", col, "---\n")
  mean_val <- mean(data[[col]], na.rm = TRUE)
  median_val <- median(data[[col]], na.rm = TRUE)
  mode_val <- get_mode(data[[col]])
  sd_val <- sd(data[[col]], na.rm = TRUE)
  range_val <- range(data[[col]], na.rm = TRUE)
  iqr_val <- IQR(data[[col]], na.rm = TRUE)
  skewness_val <- skewness(data[[col]], na.rm = TRUE)
  kurtosis_val <- kurtosis(data[[col]], na.rm = TRUE)

  cat("Mean:", mean_val, "\nMedian:", median_val, "\nMode:", mode_val, 
      "\nStandard Deviation:", sd_val, "\nVariance:", sd_val^2,
      "\nRange: [", range_val[1], ",", range_val[2], "]",
      "\nInterquartile Range:", iqr_val,
      "\nSkewness:", skewness_val, "\nKurtosis:", kurtosis_val, "\n")
}

# Run the summary statistics function for 'budget_original' and 'revenue_original'
summary_stats(movies, "budget_original")
## 
## --- Summary Statistics for budget_original ---
## Mean: 4194725 
## Median: 0 
## Mode: 0 
## Standard Deviation: 17363925 
## Variance: 3.015059e+14 
## Range: [ 0 , 3.8e+08 ] 
## Interquartile Range: 0 
## Skewness: 7.146398 
## Kurtosis: 67.14456
summary_stats(movies, "revenue_original")
## 
## --- Summary Statistics for revenue_original ---
## Mean: 11139954 
## Median: 0 
## Mode: 0 
## Standard Deviation: 64127446 
## Variance: 4.112329e+15 
## Range: [ 0 , 2787965087 ] 
## Interquartile Range: 0 
## Skewness: 12.28317 
## Kurtosis: 238.2217

Results/findings

Almost all findings are based on revenue, as I chose it as my independent variable. Based on that, we can find many interesting and useful insights.

Recepie for a successful movie:

  • Primary language: English
  • Main theme/genre: Adventure
  • Sub-genre: Action/Fantasy/Science Fiction/Family
  • Duration of the movie: Around 2 hours long
  • Release time: Summer (June & July)
  • Companies: Collaborate with other companies (2 other companies). Best companies to work with are Paramount, Universal and Disney. Collaboration with smaller production companies would also benefit greatly as long as they are american.
  • Title: The movie name has to be 16 character long.
  • Plot: The movie must be “based on a novel” or “saving the world”.
  • Setting: Best places for the movie setting would be England, London, Paris or New york.
  • Director When looking at keywords in the database, one of the most repeated words among high revenue movies is “woman director”. However, based on internet search of highest grossing movies, most film directors of those movies are men.

Here are some visualizations used to confirm this movie recepie.

## 
##    xx 104.0  68.0  82.0    ab    af    am    ar    ay    bg    bm    bn    bo 
##    45     0     0     0    10     2     2    39     1    10     3    29     2 
##    bs    ca    cn    cs    cy    da    de    el    en    eo    es    et    eu 
##    14    12   313   135     1   241  1081   113 32365     1   995    24     3 
##    fa    fi    fr    fy    gl    he    hi    hr    hu    hy    id    is    it 
##   100   308  2438     1     1    67   508    30   102     1    20    24  1532 
##    iu    ja    jv    ka    kk    kn    ko    ku    ky    la    lb    lo    lt 
##     2  1347     1    18     3     3   444     3     3     1     1     2     9 
##    lv    mk    ml    mn    mr    ms    mt    nb    ne    nl    no    pa    pl 
##    18     5    36     2    25     5     1     6     2   248   119     2   218 
##    ps    pt    qu    ro    ru    rw    sh    si    sk    sl    sm    sq    sr 
##     2   316     1    57   826     1     5     1    18    33     1     5    63 
##    sv    ta    te    tg    th    tl    tr    uk    ur    uz    vi    wo    zh 
##   724    78    45     1    75    23   150    16     8     1    10     5   409 
##    zu 
##     1

## [1] "Mean runtime (filtered): 96.48"
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## [1] "Correlation coefficient: NA"

Conclusions

In conclusion, the analysis of revenue data in the film industry reveals a refined understanding of the factors influencing movie success. English-language films, particularly within the adventure genre, dominate the box office, suggesting a clear preference among audiences. Strategic elements such as runtime, release timing, and collaborative partnerships with renowned production companies significantly impact revenue outcomes. Moreover, the presence of female directors emerges as a potentially lucrative path for enhancing film profitability. These findings reinforce the importance of strategic decision-making and audience-centric content creation in driving revenue growth in the film industry. By leveraging these insights, stakeholders can navigate market dynamics more effectively, ultimately fostering sustained success and innovation in movie production.

Business Recommendations:

  1. Strategic Partnerships: Establish partnerships with leading production companies like Paramount, Universal, and Disney to leverage their expertise and resources. Additionally, explore collaborations with emerging American production houses to diversify content offerings.

  2. Content Development: Focus on producing English-language adventure films with compelling storylines centered on themes like action, fantasy, science fiction, or family-oriented narratives. Consider adapting popular novels for cinematic adaptations to capitalize on existing fan bases.

  3. Release Strategy: Plan movie releases during the summer season, particularly in June and July, to maximize box office performance. Utilize data-driven insights to identify optimal release dates and avoid clashes with major blockbuster releases.

  4. Directorial Diversity: Encourage diversity in directorial roles by actively seeking opportunities to collaborate with talented female directors. Embrace inclusivity and promote gender diversity in creative decision-making processes.

  5. Market Expansion: Explore opportunities to expand into international markets while maintaining a focus on English-speaking audiences. Tailor marketing strategies and localization efforts to resonate with diverse cultural preferences and sensibilities.

Implementing these recommendations can enhance the overall success and profitability of movie productions, driving sustained growth in the dynamic entertainment industry landscape.

References

In the data cleaning process, I searched for various movies that were possibly silent and found that they in fact were. For example: Blacksmith Scene and Le manoir du diable. I also searched for movies that had zero in runtime to verify if they in fact did to determine if i had to imputate those values, here is an example of the movies: Torno a vivere da solo, The Black Waters of Echo’s Pond and star Force: Fugitive Alien II runtime. Lastly, I consulted searches of the most profitable movies to see if the director was a man or woman, Box Office Mojo is a website that listed the highest grossing movies and displays their information.

