Abstract
In a landscape shaped by the explosion of digital content and
shifting audience preferences, This report embarks on a journey through
the Movies dataset from Kaggle
to unravel the secrets of cinematic success. Armed with meticulous data
cleaning and advanced statistical techniques, we uncover the critical
ingredients that define the modern blockbuster. These insights guide
strategic decisions for our production company as we navigate the
ever-changing currents of audience taste, ensuring our films resonate
deeply and soar at the box office.
The film industry stands as a dynamic and ever-evolving landscape, characterized by its ability to captivate global audiences and shape cultural narratives. Within this realm of creativity and commerce, understanding the intricacies of what makes a movie successful is paramount for filmmakers, producers, and industry stakeholders alike. This report digs into the world of film analysis, aiming to uncover the underlying factors driving box office success.
As the context for our investigation, we recognize the increasing importance of data-driven decision-making in an industry traditionally driven by intuition and creativity. In today’s competitive marketplace, filmmakers and production companies face mounting pressures to deliver commercially successful films while balancing artistic integrity and audience preferences. Against this context, our research seeks to illuminate the key determinants of movie revenue, providing insights for the industry.
With a focus on revenue as the primary metric of success, our analysis spans various dimensions, including genre preferences, production budgets, release timing, and geographical considerations. By dissecting these factors, we aim to uncover patterns and trends that offer valuable guidance for the film production company executives seeking to optimize their strategies and maximize returns on investment.
Through a systematic examination of revenue data and industry trends, this report strives to empower stakeholders with actionable intelligence, fostering informed decision-making and strategic innovation in the realm of film production and distribution. By explaining the underlying drivers of box office success, we aim to contribute to the ongoing dialogue surrounding the art and business of filmmaking, ultimately shaping a more prosperous and vibrant future for our production company.
The analysis in this report draws upon the Movies
dataset obtained from Kaggle,
encompassing various attributes of movies such as budgets, revenues,
genres, release dates, production countries, and production companies.
This dataset offers a comprehensive view of the global film industry,
spanning diverse genres, languages, and production contexts. Prior to
analysis, rigorous preprocessing and cleaning were conducted to ensure
data integrity and reliability. This involved addressing missing values
through imputation or exclusion, removing duplicates to prevent
redundancy, and standardizing data formats for consistency. The goal of
these cleaning procedures was to enhance the dataset’s quality and
usability, providing a solid foundation for robust analysis of film
revenue trends and patterns.
Observations:
adult column data type could be logical
belongs_to_collection column must be cleaned for
better understanding of data
budget column data type should be integer or
numeric
genres column must be cleaned for better
understanding of data
original_language column could be factor
popularity column must be numeric
production_companies column must be cleaned to show
the relevant information
production_countries column must be cleaned to avoid
redundant and irrelevant information
release_date column data type should be
Date
spoken_languages column must be cleaned to avoid
redundant and irrelevant information
status column data type could be factor
video column is not necessary for analysis
purposes
Variables to be converted:
* adult (logical)
* budget (numeric)
* original_language (factor)
* popularity (numeric)
* release_date (Date)
* status (factor)
Variables the must be cleaned,each part of the variable should be
separated erasing tags.
* belongs_to_collection: Contains
“id”,“name”,“poster_part”,“backdrop_part”
* genres: Contains “id”,“name”
* production_companies: Contains “name”,“id”
* production_countries: Contains “abbreviated_name”,“name”
* spoken_languages: Contains “abbreviated_name”,“name”
# Divide string in columns delimiting by ":"
collection <- str_split_fixed(movies$belongs_to_collection, ":", n = Inf)## V1 V2 V3 V4
## Length:45466 Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
#Eliminate punctuation signs except "." and "/"
collection <- collection %>%
mutate(id_collection = str_replace_all(collection$V1, "[[:punct:]&&[^./]]", " "))
collection <- collection %>%
mutate(name_collection = str_replace_all(collection$V2, "[[:punct:]&&[^./]]", " "))
collection <- collection %>%
mutate(poster_path_collection = str_replace_all(collection$V3, "[[:punct:]&&[^./]]", " "))
collection <- collection %>%
mutate(backdrop_path_collection = str_replace_all(collection$V4, "[[:punct:]&&[^./]]", " "))# Remove specfic words from data frame
collection$id_collection <- str_remove(collection$id_collection,"name")
collection$name_collection <- str_remove(collection$name_collection,"poster path")
collection$poster_path_collection <- str_remove(collection$poster_path_collection,"backdrop path")## id_collection name_collection poster_path_collection
## Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## backdrop_path_collection
## Length:45466
## Class :character
## Mode :character
# Remove whitespace
collection$id_collection <- str_trim(collection$id_collection, "right")
collection$name_collection <- str_trim(collection$name_collection, "right")
collection$poster_path_collection <- str_trim(collection$poster_path_collection, "right")
collection$backdrop_path_collection <- str_trim(collection$backdrop_path_collection, "right")## id_collection name_collection poster_path_collection
## Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## backdrop_path_collection
## Length:45466
## Class :character
## Mode :character
new_genres <- new_genres %>%
mutate(genre1 = str_replace_all(new_genres$V1, "[[:punct:]]", " "))
new_genres <- new_genres %>%
mutate(genre2 = str_replace_all(new_genres$V2, "[[:punct:]]", " "))
new_genres <- new_genres %>%
mutate(genre3 = str_replace_all(new_genres$V3, "[[:punct:]]", " "))new_genres$genre1 <- str_remove(new_genres$genre1,"id")
new_genres$genre2 <- str_remove(new_genres$genre2,"id")
new_genres$genre3 <- str_remove(new_genres$genre3,"id")# Trim leading and trailing spaces in genre columns
new_genres <- new_genres %>%
mutate(genre1 = str_trim(genre1),
genre2 = str_trim(genre2),
genre3 = str_trim(genre3))new_genres$genre1 <- as.factor(new_genres$genre1)
new_genres$genre2 <- as.factor(new_genres$genre2)
new_genres$genre3 <- as.factor(new_genres$genre3)## genre1 genre2 genre3
## Drama :11966 :17001 :31481
## Comedy : 8820 Drama : 6308 Thriller : 2235
## Action : 4489 Comedy : 3265 Romance : 2045
## Documentary: 3415 Romance : 2859 Drama : 1677
## Horror : 2619 Thriller: 2523 Comedy : 911
## : 2442 Action : 1546 Science Fiction: 873
## (Other) :11715 (Other) :11964 (Other) : 6244
new_production_countries <- new_production_countries %>%
mutate(country1 = str_replace_all(new_production_countries$V1, "[[:punct:]]", " "))
new_production_countries <- new_production_countries %>%
mutate(country2 = str_replace_all(new_production_countries$V2, "[[:punct:]]", " "))
new_production_countries <- new_production_countries %>%
mutate(country3 = str_replace_all(new_production_countries$V3, "[[:punct:]]", " "))new_production_countries$country1 <- str_remove(new_production_countries$country1,"iso 3166 1")
new_production_countries$country2 <- str_remove(new_production_countries$country2,"iso 3166 1")
new_production_countries$country3 <- str_remove(new_production_countries$country3,"iso 3166 1")# Trim leading and trailing spaces in country columns
new_production_countries <- new_production_countries %>%
mutate(country1 = str_trim(country1),
country2 = str_trim(country2),
country3 = str_trim(country3))new_production_countries$country1 <- as.factor(new_production_countries$country1)
new_production_countries$country2 <- as.factor(new_production_countries$country2)
new_production_countries$country3 <- as.factor(new_production_countries$country3)## country1 country2
## United States of America:18425 :38439
## : 6288 United States of America: 2131
## United Kingdom : 3070 France : 917
## France : 2705 United Kingdom : 659
## Canada : 1498 Germany : 528
## Japan : 1493 Italy : 482
## (Other) :11987 (Other) : 2310
## country3
## :43314
## United States of America: 410
## France : 247
## Germany : 232
## United Kingdom : 231
## Italy : 153
## (Other) : 879
new_spoken_languages <- new_spoken_languages %>%
mutate(country1_language = str_replace_all(new_spoken_languages$V1, "[[:punct:]]", " "))
new_spoken_languages <- new_spoken_languages %>%
mutate(country2_language = str_replace_all(new_spoken_languages$V2, "[[:punct:]]", " "))
new_spoken_languages <- new_spoken_languages %>%
mutate(country3_language = str_replace_all(new_spoken_languages$V3, "[[:punct:]]", " "))new_spoken_languages$country1_language <- str_remove(new_spoken_languages$country1_language,"iso 639 1")
new_spoken_languages$country2_language <- str_remove(new_spoken_languages$country2_language,"iso 639 1")
new_spoken_languages$country3_language <- str_remove(new_spoken_languages$country3_language,"iso 639 1")# Trim leading and trailing spaces in language columns
new_spoken_languages <- new_spoken_languages %>%
mutate(country1_language = str_trim(country1_language),
country2_language = str_trim(country2_language),
country3_language = str_trim(country3_language))new_spoken_languages$country1_language <- as.factor(new_spoken_languages$country1_language)
new_spoken_languages$country2_language <- as.factor(new_spoken_languages$country2_language)
new_spoken_languages$country3_language <- as.factor(new_spoken_languages$country3_language)## country1_language country2_language country3_language
## English :26840 :37707 :43018
## : 4062 English : 1593 Deutsch : 328
## Français: 2428 Français: 1477 Español : 308
## Italiano: 1411 Deutsch : 919 Français: 234
## 日本語 : 1388 Español : 782 English : 232
## Deutsch : 1301 Italiano: 616 Italiano: 225
## (Other) : 8036 (Other) : 2372 (Other) : 1121
new_production_companies <- new_production_companies %>%
mutate(company1 = str_replace_all(new_production_companies$V1, "[[:punct:]]", " "))
new_production_companies <- new_production_companies %>%
mutate(company2 = str_replace_all(new_production_companies$V2, "[[:punct:]]", " "))
new_production_companies <- new_production_companies %>%
mutate(company3 = str_replace_all(new_production_companies$V3, "[[:punct:]]", " "))new_production_companies$company1 <- str_remove(new_production_companies$company1,"id")
new_production_companies$company2 <- str_remove(new_production_companies$company2,"id")
new_production_companies$company3 <- str_remove(new_production_companies$company3,"id")# Trim leading and trailing spaces in company columns
new_production_companies <- new_production_companies %>%
mutate(company1 = str_trim(company1),
company2 = str_trim(company2),
company3 = str_trim(company3))new_production_companies$company1 <- as.factor(new_production_companies$company1)
new_production_companies$company2 <- as.factor(new_production_companies$company2)
new_production_companies$company3 <- as.factor(new_production_companies$company3)## company1
## :11881
## Paramount Pictures : 998
## Metro Goldwyn Mayer MGM : 852
## Twentieth Century Fox Film Corporation: 780
## Warner Bros : 757
## Universal Pictures : 754
## (Other) :29444
## company2 company3
## :28458 :36419
## Warner Bros : 270 Warner Bros : 130
## Metro Goldwyn Mayer MGM: 150 Canal+ : 109
## Canal+ : 124 Metro Goldwyn Mayer MGM: 44
## Touchstone Pictures : 75 Relativity Media : 42
## Universal Pictures : 71 TF1 Films Production : 29
## (Other) :16318 (Other) : 8693
Variables that require to be checked.
budget: Does not have a range, but could be important
to detect outliers for further analysispopularity: Has a range from 0 to 100runtime: Does not have a range, but could be important
to detect outliers for further analysisvote_average: Has a range from 0 to 10vote_count: Does not have a range, but could be
important to detect outliers for further analysisrevenue: too many movies with 0 revenue, could be
better to imputate or remove those values## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Sort and obtain highest 10 and lowest 10 rows by budget
sorted_budget <- sort(movies$budget)
tail(sorted_budget,10) %>% format(scientific = FALSE)## [1] "250000000" "255000000" "258000000" "260000000" "260000000" "260000000"
## [7] "270000000" "280000000" "300000000" "380000000"
## [1] 0 0 0 0 0 0 0 0 0 0
## budget == 0 n
## 1 FALSE 8890
## 2 TRUE 36573
## 3 NA 3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 4224579 0 380000000 3
It is highly unlikely that a movie has a cost of 0 dollars to
produce, and due to the high amount of movies that have this budget it
could mean the budget information was not available. An
imputation method must be applied in this case.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Sort and obtain highest 10 and lowest 10 rows by budget
sorted_revenue <- sort(movies$revenue)
tail(sorted_revenue,10) %>% format(scientific = FALSE)## [1] "1262886337" "1274219009" "1342000000" "1405403694" "1506249360"
## [6] "1513528810" "1519557910" "1845034188" "2068223624" "2787965087"
## [1] 0 0 0 0 0 0 0 0 0 0
## revenue == 0 n
## 1 FALSE 7408
## 2 TRUE 38052
## 3 NA 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09 6
Because an imputation will be done for budget, the same
has to be done to revenue to balance out the data and get
rid of it’s volatility.
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
# Sort and obtain highest 10 and lowest 10 rows by popularity
sorted_popularity <- sort(movies$popularity)
tail(sorted_popularity,10)## [1] 154.8010 183.8704 185.0709 185.3310 187.8605 213.8499 228.0327 287.2537
## [9] 294.3370 547.4883
## [1] 0 0 0 0 0 0 0 0 0 0
## popularity == 0 n
## 1 FALSE 45394
## 2 TRUE 66
## 3 NA 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.3859 1.1277 2.9215 3.6789 547.4883 6
There are some movies that exceed the 100 points limit, the variable must be imputated for a better analysis of the data.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 263 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Sort and obtain highest 10 and lowest 10 rows by runtime
sorted_runtime <- sort(movies$runtime)
tail(sorted_runtime,10)## [1] 840 840 874 877 900 925 931 1140 1140 1256
## [1] 0 0 0 0 0 0 0 0 0 0
## runtime == 0 n
## 1 FALSE 43645
## 2 TRUE 1558
## 3 NA 263
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 85.00 95.00 94.13 107.00 1256.00 263
Some movies last 0 minutes, a movie cannot have that duration, an imputation must be done to fix it.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Sort and obtain highest 10 and lowest 10 rows by vote_average
sorted_vote_average <- sort(movies$vote_average)
tail(sorted_vote_average,10)## [1] 10 10 10 10 10 10 10 10 10 10
## [1] 0 0 0 0 0 0 0 0 0 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 5.000 6.000 5.618 6.800 10.000 6
Vote averages are in order, there is no need for applying an imputation method.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Sort and obtain highest 10 and lowest 10 rows by vote_count
sorted_vote_count <- sort(movies$vote_count)
tail(sorted_vote_count,10)## [1] 9634 9678 10014 10297 11187 11444 12000 12114 12269 14075
## [1] 0 0 0 0 0 0 0 0 0 0
## vote_count == 0 n
## 1 FALSE 42561
## 2 TRUE 2899
## 3 NA 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 3.0 10.0 109.9 34.0 14075.0 6
Vote counts are in order, there is no need to apply imputation methods to this variable.
Only 4 variables must be imputated based on the search of out of range values done before which are the following:
budget: Values are going to be replaced with the mean
to avoid the multiple zeroes from distorting statistical
descriptors.revenue: The same case as in budgetpopularity: Out of range values are going to be
replaced with the range limit of 100 to avoid eliminating them from the
dataset.runtime: Values are going to be replaced with the mean
to avoid the multiple zeroes from distorting statistical
descriptors.It seems that there is very few data that was left as NA in the database after the cleaning process done during the deliverable of the progress setup, however there are some missing data in revenue, runtime and votes which could be adressed with mice.
For imputation we will be using the MICE package along with
the variables detected before which are revenue,
runtime, vote_count and
vote_average.
I encountered technical difficulties while attempting to use the MICE library for multiple imputation. Despite efforts to resolve these issues using alternative platforms such as posit and Google Colab, I was unable to overcome the challenges. As a result, I acknowledge that not using the MICE library limited my ability to perform multiple imputation and address missing data comprehensively. Instead, I employed alternative approaches to handle missing data. However, it’s important to acknowledge that these methods may introduce additional uncertainty and potential biases into my analysis, impacting the validity of the results.
# Specifies the characteristics of the imputation
#movies_mice <- mice(movies2,m=1,maxit=50,meth='pmm',seed=500)
# Summarizes the the imputation characteristics defined before
#summary(movies_mice)
# Allows to see the result MICE assigned to missing values
#movies_mice$imp$runtime
# Fill the dataset with results from first option
# movies_clean <- complete(movies_mice,1)
# Serves as a way to check if imputation was done sucessfully
#sum(is.na(movies_clean))
# Generates a density plot
#densityplot(movies_clean)Density plot can help to determine the effectiveness of the imputation in the dataset, it is possible that imputation is less precise once it get past certain values, therefore it is important to check and use other methods if necessary.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 4224579 0 380000000 3
The average budget is 4,224,579
# Replace zeroes with the budget mean
movies$budget_original <- movies$budget # Create a copy of the original column
movies$budget <- ifelse(movies$budget == 0, mean(movies$budget, na.rm = TRUE), movies$budget)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 4224579 4224579 7623068 4224579 380000000 3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 0.000e+00 0.000e+00 1.121e+07 0.000e+00 2.788e+09 6
The average revenue is 11,210,000
# Replace zeroes with the revenue mean
movies$revenue_original <- movies$revenue # Create a copy of the original column
movies$revenue <- ifelse(movies$revenue == 0, mean(movies$revenue, na.rm = TRUE), movies$revenue)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000e+00 1.121e+07 1.121e+07 2.059e+07 1.121e+07 2.788e+09 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.3859 1.1277 2.9215 3.6789 547.4883 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.386 1.128 2.884 3.679 100.000 6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 85.00 95.00 94.13 107.00 1256.00 263
A movie average runtime is 94.13 minutes
# Replace zeroes and NA values in the 'runtime' column with the average runtime (94)
movies <- movies %>%
mutate(runtime = ifelse(runtime == 0 | is.na(runtime), 94, runtime))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 87.00 95.00 97.35 107.00 1256.00
Since multiple values were NA’s the values were also replaced with the average
Most of the variables in these dataset does not require to check for duplicates as for example it is completely normal that movies share the same genres, spoken language, company, etc. However there are three variables that must be checked for partial duplicates which are the following.
## [1] 17
# Creating a data frame for full duplicates visualization
duplicated_rows <- movies[duplicated(movies), ]
duplicated_rows## adult budget homepage id imdb_id
## 1466 FALSE 4224579 105045 tt0111613
## 9166 FALSE 4224579 5511 tt0062229
## 9328 FALSE 4224579 23305 tt0295682
## 13376 FALSE 4224579 141971 tt1180333
## 16765 FALSE 4224579 141971 tt1180333
## 21166 FALSE 4224579 119916 tt0080000
## 21855 FALSE 4224579 152795 tt1821641
## 22152 FALSE 4224579 http://www.daysofdarknessthemovie.com/ 18440 tt0499456
## 23045 FALSE 4224579 25541 tt1327820
## 24845 FALSE 4224579 http://www.dealthemovie.com/ 11115 tt0446676
## 28861 FALSE 4224579 168538 tt0084387
## 29375 FALSE 4224579 42495 tt0067306
## 35799 FALSE 4224579 159849 tt0173769
## 38872 FALSE 4224579 99080 tt0022537
## 40041 FALSE 980000 298721 tt2818654
## 40277 FALSE 4224579 97995 tt0127834
## 45266 FALSE 4224579 265189 tt2121382
## original_language original_title
## 1466 de Das Versprechen
## 9166 fr Le Samouraï
## 9328 en The Warrior
## 13376 fi Blackout
## 16765 fi Blackout
## 21166 en The Tempest
## 21855 en The Congress
## 22152 en Days of Darkness
## 23045 da Broderskab
## 24845 en Deal
## 28861 en Nana
## 29375 en King Lear
## 35799 en Why We Fight: Divide and Conquer
## 38872 en The Viking
## 40041 th รักที่ขอนแก่น
## 40277 en Seven Years Bad Luck
## 45266 sv Turist
## overview
## 1466 East-Berlin, 1961, shortly after the erection of the Wall. Konrad, Sophie and three of their friends plan a daring escape to Western Germany. The attempt is successful, except for Konrad, who remains behind. From then on, and for the next 28 years, Konrad and Sophie will attempt to meet again, in spite of the Iron Curtain. Konrad, who has become a reputed Astrophysicist, tries to take advantage of scientific congresses outside Eastern Germany to arrange encounters with Sophie. But in a country where the political police, the Stasi, monitors the moves of all suspicious people (such as Konrad's sister Barbara and her husband Harald), preserving one's privacy, ideals and self-respect becomes an exhausting fight, even as the Eastern block begins its long process of disintegration.
## 9166 Hitman Jef Costello is a perfectionist who always carefully plans his murders and who never gets caught.
## 9328 In feudal India, a warrior (Khan) who renounces his role as the longtime enforcer to a local lord becomes the prey in a murderous hunt through the Himalayan mountains.
## 13376 Recovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.
## 16765 Recovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.
## 21166 Prospero, the true Duke of Milan is now living on an enchanted island with his daughter Miranda, the savage Caliban and Ariel, a spirit of the air. Raising a sorm to bring his brother - the usurper of his dukedom - along with his royal entourage. to the island. Prospero contrives his revenge.
## 21855 More than two decades after catapulting to stardom with The Princess Bride, an aging actress (Robin Wright, playing a version of herself) decides to take her final job: preserving her digital likeness for a future Hollywood. Through a deal brokered by her loyal, longtime agent and the head of Miramount Studios, her alias will be controlled by the studio, and will star in any film they want with no restrictions. In return, she receives healthy compensation so she can care for her ailing son and her digitized character will stay forever young. Twenty years later, under the creative vision of the studio’s head animator, Wright’s digital double rises to immortal stardom. With her contract expiring, she is invited to take part in “The Congress” convention as she makes her comeback straight into the world of future fantasy cinema.
## 22152 When a comet strikes Earth and kicks up a cloud of toxic dust, hundreds of humans join the ranks of the living dead. But there's bad news for the survivors: The newly minted zombies are hell-bent on eradicating every last person from the planet. For the few human beings who remain, going head to head with the flesh-eating fiends is their only chance for long-term survival. Yet their battle will be dark and cold, with overwhelming odds.
## 23045 Former Danish servicemen Lars and Jimmy are thrown together while training in a neo-Nazi group. Moving from hostility through grudging admiration to friendship and finally passion, events take a darker turn when their illicit relationship is uncovered.
## 24845 As an ex-gambler teaches a hot-shot college kid some things about playing cards, he finds himself pulled into the world series of poker, where his protégé is his toughest competition.
## 28861 In Zola's Paris, an ingenue arrives at a tony bordello: she's Nana, guileless, but quickly learning to use her erotic innocence to get what she wants. She's an actress for a soft-core filmmaker and soon is the most popular courtesan in Paris, parlaying this into a house, bought for her by a wealthy banker. She tosses him and takes up with her neighbor, a count of impeccable rectitude, and with the count's impressionable son. The count is soon fetching sticks like a dog and mortgaging his lands to satisfy her whims.
## 29375 King Lear, old and tired, divides his kingdom among his daughters, giving great importance to their protestations of love for him. When Cordelia, youngest and most honest, refuses to idly flatter the old man in return for favor, he banishes her and turns for support to his remaining daughters. But Goneril and Regan have no love for him and instead plot to take all his power from him. In a parallel, Lear's loyal courtier Gloucester favors his illegitimate son Edmund after being told lies about his faithful son Edgar. Madness and tragedy befall both ill-starred fathers.
## 35799 The third film of Frank Capra's 'Why We Fight" propaganda film series, dealing with the Nazi conquest of Western Europe in 1940.
## 38872 Originally called White Thunder, American producer Varick Frissell's 1931 film was inspired by his love for the Canadian Arctic Circle. Set in a beautifully black-and-white filmed Newfoundland, it is the story of a rivalry between two seal hunters that plays out on the ice floes during a hunt. Unsatisfied with the first cut, Frissell arranged for the crew to accompany an actual Newfoundland seal hunt on The SS Viking, on which an explosion of dynamite (carried regularly at the time on Arctic ships to combat ice jams) killed many members of the crew, including Frissell. The film was renamed in honor of the dead.
## 40041 In a hospital, ten soldiers are being treated for a mysterious sleeping sickness. In a story in which dreams can be experienced by others, and in which goddesses can sit casually with mortals, a nurse learns the reason why the patients will never be cured, and forms a telepathic bond with one of them.
## 40277 After breaking a mirror in his home, superstitious Max tries to avoid situations which could bring bad luck but in doing so, causes himself the worst luck imaginable.
## 45266 While holidaying in the French Alps, a Swedish family deals with acts of cowardliness as an avalanche breaks out.
## popularity poster_path release_date revenue runtime
## 1466 0.122178 /5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg 1995-02-16 11209349 115
## 9166 9.091288 /cvNW8IXigbaMNo4gKEIps0NGnhA.jpg 1967-10-25 39481 105
## 9328 1.967992 /9GlrmbZO7VGyqhaSR1utinRJz3L.jpg 2001-09-23 11209349 86
## 13376 0.411949 /8VSZ9coCzxOCW2wE2Qene1H1fKO.jpg 2008-12-26 11209349 108
## 16765 0.411949 /8VSZ9coCzxOCW2wE2Qene1H1fKO.jpg 2008-12-26 11209349 108
## 21166 0.000018 /gLVRTxaLtUDkfscFKPyYrCtRnTk.jpg 1980-02-27 11209349 123
## 21855 8.534039 /nnKX3ahYoT7P3au92dNgLf4pKwA.jpg 2013-05-16 455815 122
## 22152 1.436085 /tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg 2007-01-01 11209349 89
## 23045 2.587911 /q19Q5BRZpMXoNCA4OYodVozfjUh.jpg 2009-10-21 11209349 90
## 24845 6.880365 /kHaBqrrozaG7rj6GJg3sUCiM29B.jpg 2008-01-29 11209349 85
## 28861 1.276602 /pg4PUHRFrgNfACHSh5MITQ2gYch.jpg 1983-06-13 11209349 92
## 29375 0.187901 /xuE1IlUCohbxMY0fiqKTT6d013n.jpg 1971-02-04 11209349 137
## 35799 0.473322 /g21ruZZ3BZeUDuKMb82kejjtufk.jpg 1943-01-01 11209349 57
## 38872 0.002362 /qenjwRvW9itR5pVp4CBkYfhVAOp.jpg 1931-06-21 11209349 70
## 40041 2.535419 /5GasjPRAy5rlEyDOH7MeOyxyQGX.jpg 2015-09-02 11209349 122
## 40277 0.141558 /4J6Ai4C5YRgfRUTlirrJ7QsmJKU.jpg 1921-02-06 11209349 62
## 45266 12.165685 /rGMtc9AtZsnWSSL5VnLaGvx1PI6.jpg 2014-08-15 1359497 118
## status
## 1466 Released
## 9166 Released
## 9328 Released
## 13376 Released
## 16765 Released
## 21166 Released
## 21855 Released
## 22152 Released
## 23045 Released
## 24845 Released
## 28861 Released
## 29375 Rumored
## 35799 Released
## 38872 Released
## 40041 Released
## 40277 Released
## 45266 Released
## tagline
## 1466 A love, a hope, a wall.
## 9166 There is no solitude greater than that of the Samurai
## 9328
## 13376 Which one is the first to return - memory or the murderer?
## 16765 Which one is the first to return - memory or the murderer?
## 21166
## 21855
## 22152
## 23045
## 24845
## 28861
## 29375
## 35799
## 38872 Actually produced during the Great Newfoundland Seal Hunt and You see the REAL thing
## 40041
## 40277
## 45266
## title video vote_average vote_count
## 1466 The Promise False 5.0 1
## 9166 Le Samouraï False 7.9 187
## 9328 The Warrior False 6.3 15
## 13376 Blackout False 6.7 3
## 16765 Blackout False 6.7 3
## 21166 The Tempest False 0.0 0
## 21855 The Congress False 6.4 165
## 22152 Days of Darkness False 5.0 5
## 23045 Brotherhood False 7.1 21
## 24845 Deal False 5.2 22
## 28861 Nana, the True Key of Pleasure False 4.7 3
## 29375 King Lear False 8.0 3
## 35799 Why We Fight: Divide and Conquer False 5.0 1
## 38872 The Viking False 0.0 0
## 40041 Cemetery of Splendour False 4.4 50
## 40277 Seven Years Bad Luck False 5.6 4
## 45266 Force Majeure False 6.8 255
## id_collection name_collection poster_path_collection
## 1466
## 9166
## 9328
## 13376
## 16765
## 21166
## 21855
## 22152
## 23045
## 24845
## 28861
## 29375
## 35799 158365 Why We Fight /fFYBLu2Hnx27CWLOMd425ExDkgK.jpg
## 38872
## 40041
## 40277
## 45266
## backdrop_path_collection genre1 genre2 genre3
## 1466 Drama Romance
## 9166 Crime Drama Thriller
## 9328 Adventure Animation Drama
## 13376 Thriller Mystery
## 16765 Thriller Mystery
## 21166 Fantasy Drama Science Fiction
## 21855 Drama Science Fiction Animation
## 22152 Action Horror Science Fiction
## 23045 Drama
## 24845 Comedy Drama
## 28861 Drama Comedy
## 29375 Drama Foreign
## 35799 None Documentary
## 38872 Action Drama Romance
## 40041 Drama Fantasy
## 40277 Comedy
## 45266 Comedy Drama
## country1 country2 country3
## 1466 Germany
## 9166 France Italy
## 9328 France Germany India
## 13376 Finland
## 16765 Finland
## 21166
## 21855 Belgium France Germany
## 22152 United States of America
## 23045 Sweden Denmark
## 24845 United States of America
## 28861
## 29375 Denmark United Kingdom
## 35799 United States of America
## 38872
## 40041 United Kingdom United States of America France
## 40277 United States of America
## 45266 Norway Sweden France
## country1_language country2_language country3_language
## 1466 Deutsch
## 9166 Français
## 9328 हिन्दी
## 13376 suomi
## 16765 suomi
## 21166
## 21855 English
## 22152 English
## 23045 Dansk
## 24845 English
## 28861
## 29375 English
## 35799 English
## 38872 English
## 40041 English ภาษาไทย
## 40277 English
## 45266 Français Norsk svenska
## company1
## 1466 Studio Babelsberg
## 9166 Fa cinematografica id
## 9328 Filmfour
## 13376 Filmiteollisuus Fine
## 16765 Filmiteollisuus Fine
## 21166
## 21855 Pandora Filmproduktion
## 22152
## 23045
## 24845 Andertainment Group
## 28861 Cannon Group
## 29375 Royal Shakespeare Company
## 35799
## 38872
## 40041 Match Factory The
## 40277 Max Linder Productions
## 45266 Motlys
## company2
## 1466 Centre National de la Cinématographie
## 9166 Compagnie Industrielle et Commerciale Cinématographique CICC
## 9328
## 13376
## 16765
## 21166
## 21855 Entre Chien et Loup
## 22152
## 23045
## 24845 Crescent City Pictures
## 28861 Metro Goldwyn Mayer MGM
## 29375 Laterna Film
## 35799
## 38872
## 40041 Louverture Films
## 40277
## 45266 Coproduction Office
## company3 budget_original revenue_original popularity_max
## 1466 Odessa Films 0 0 0.122178
## 9166 TC Productions 0 39481 9.091288
## 9328 0 0 1.967992
## 13376 0 0 0.411949
## 16765 0 0 0.411949
## 21166 0 0 0.000018
## 21855 Opus Film 0 455815 8.534039
## 22152 0 0 1.436085
## 23045 0 0 2.587911
## 24845 Tag Entertainment 0 0 6.880365
## 28861 0 0 1.276602
## 29375 Athena Film A S 0 0 0.187901
## 35799 0 0 0.473322
## 38872 0 0 0.002362
## 40041 Tordenfilm AS 980000 0 2.535419
## 40277 0 0 0.141558
## 45266 Film i Väst 0 1359497 12.165685
## [1] 0
Full Duplicates were eliminated from the dataset.
## id n
## 1 10991 2
## 2 109962 2
## 3 110428 2
## 4 12600 2
## 5 13209 2
## 6 132641 2
## 7 14788 2
## 8 15028 2
## 9 22649 2
## 10 4912 2
## 11 69234 2
## 12 77221 2
## 13 84198 2
## imdb_id n
## 1 17
## 2 0 3
## 3 tt0022879 2
## 4 tt0046468 2
## 5 tt0082992 2
## 6 tt0100361 2
## 7 tt0157472 2
## 8 tt0235679 2
## 9 tt0270288 2
## 10 tt0287635 2
## 11 tt0454792 2
## 12 tt0499537 2
## 13 tt1701210 2
## 14 tt1736049 2
## 15 tt2018086 2
## original_title n
## 1 12 Angry Men 2
## 2 20,000 Leagues Under the Sea 4
## 3 2:22 2
## 4 3:10 to Yuma 2
## 5 8 3
## 6 9 2
## 7 A Bucket of Blood 2
## 8 A Christmas Carol 7
## 9 A Dangerous Place 2
## 10 A Farewell to Arms 2
## 11 A Foreign Affair 2
## 12 A Girl in Every Port 2
## 13 A Hole in the Head 2
## 14 A Kiss Before Dying 2
## 15 A Letter to Three Wives 2
## 16 A Little Princess 2
## 17 A Madea Christmas 2
## 18 A Midsummer Night's Dream 4
## 19 A Night to Remember 2
## 20 A Nightmare on Elm Street 2
## 21 A Place at the Table 2
## 22 A Raisin in the Sun 2
## 23 A Star Is Born 3
## 24 A Streetcar Named Desire 3
## 25 A Tale of Two Cities 3
## 26 Aakrosh 2
## 27 Aankhen 2
## 28 Abduction 2
## 29 Abel 2
## 30 Abendland 2
## 31 Above Suspicion 2
## 32 Absolution 2
## 33 Adam 3
## 34 Adventures in Babysitting 2
## 35 After 2
## 36 After Midnight 2
## 37 Aftermath 4
## 38 Airborne 2
## 39 Aladin 2
## 40 Alfie 2
## 41 Alice 3
## 42 Alice Through the Looking Glass 2
## 43 Alice in Wonderland 8
## 44 All Night Long 2
## 45 All Quiet on the Western Front 2
## 46 All of Me 2
## 47 All the King's Men 2
## 48 All the Way Home 2
## 49 Alone in the Dark 2
## 50 Altitude 2
## 51 Always 2
## 52 Amber Alert 2
## 53 America 2
## 54 American Gun 2
## 55 American Virgin 2
## 56 Americano 2
## 57 Amy 2
## 58 An Enemy of the People 2
## 59 An Ideal Husband 2
## 60 An Inspector Calls 2
## 61 Anastasia 2
## 62 And Soon the Darkness 2
## 63 And Then There Were None 2
## 64 Angel 4
## 65 Angel Baby 2
## 66 Angels in the Outfield 2
## 67 Angst 2
## 68 Animal 2
## 69 Animal Farm 2
## 70 Animals 3
## 71 Anita 3
## 72 Anna Karenina 4
## 73 Anne of Green Gables 3
## 74 Annie 3
## 75 Annie Oakley 2
## 76 Another World 2
## 77 April Fool's Day 2
## 78 Arabian Nights 2
## 79 Archangel 2
## 80 Arena 3
## 81 Around the World in 80 Days 2
## 82 Arrowhead 2
## 83 Arsène Lupin 2
## 84 Arthur 2
## 85 As You Like It 2
## 86 Aschenputtel 4
## 87 Assault on Precinct 13 2
## 88 Asylum 4
## 89 Attila 3
## 90 August 3
## 91 Aurora 3
## 92 Avalon 3
## 93 Awaken 2
## 94 Babes in Toyland 3
## 95 Back Street 2
## 96 Back in the Day 2
## 97 Backfire 2
## 98 Backstage 2
## 99 Bad Boys 3
## 100 Bad Company 3
## 101 Bad Girl 2
## 102 Bad Karma 3
## 103 Bait 2
## 104 Ballerina 2
## 105 Bandidos 2
## 106 Bandits 2
## 107 Barabbas 3
## 108 Barbara 2
## 109 Bare Knuckles 2
## 110 Barely Legal 2
## 111 Barnacle Bill 2
## 112 Barricade 2
## 113 Bartleby 2
## 114 Batman 2
## 115 Battleground 2
## 116 Beau Geste 2
## 117 Beautiful 2
## 118 Beautiful Creatures 2
## 119 Beauty and the Beast 5
## 120 Bed of Roses 2
## 121 Bedazzled 2
## 122 Behind Enemy Lines 2
## 123 Belle Starr 2
## 124 Ben-Hur 2
## 125 Beneath 2
## 126 Benji 2
## 127 Beowulf 2
## 128 Bernie 2
## 129 Best Friends 2
## 130 Betrayal 2
## 131 Betrayed 2
## 132 Between Us 2
## 133 Bewitched 2
## 134 Beyond 3
## 135 Beyond a Reasonable Doubt 2
## 136 Big Game 2
## 137 Big Trouble 2
## 138 Bigfoot 3
## 139 Billy the Kid 2
## 140 Bingo 2
## 141 Bird on a Wire 2
## 142 Black 2
## 143 Black Angel 2
## 144 Black Beauty 2
## 145 Black Christmas 2
## 146 Black Friday 2
## 147 Black Gold 2
## 148 Black Magic 2
## 149 Black Moon 2
## 150 Black Sheep 2
## 151 Black Widow 3
## 152 Blackbird 3
## 153 Blackout 4
## 154 Blast 2
## 155 Blind 4
## 156 Blind Date 4
## 157 Bliss 2
## 158 Blood Moon 2
## 159 Blood Ties 2
## 160 Blood and Sand 2
## 161 Blood: The Last Vampire 2
## 162 Bloodline 2
## 163 Blown Away 2
## 164 Blue 3
## 165 Blue Steel 2
## 166 Bluebeard 2
## 167 Bluebird 2
## 168 Body and Soul 3
## 169 Bomber 2
## 170 Book of Love 2
## 171 Borderline 4
## 172 Bordertown 2
## 173 Born Reckless 3
## 174 Born Yesterday 2
## 175 Born to Be Bad 2
## 176 Born to Be Wild 2
## 177 Borrowed Time 2
## 178 Boulevard 3
## 179 Bound 2
## 180 Boy 3
## 181 Boy Meets Girl 4
## 182 Brainstorm 2
## 183 Branded 2
## 184 Brave 2
## 185 Breaking Point 4
## 186 Breaking and Entering 2
## 187 Breakout 3
## 188 Breathing Room 2
## 189 Breathless 2
## 190 Brewster's Millions 2
## 191 Bright Lights 2
## 192 Brighton Rock 2
## 193 Broken 2
## 194 Broken Arrow 2
## 195 Broken Blossoms 2
## 196 Broken English 2
## 197 Brother's Keeper 2
## 198 Brutal 2
## 199 Bubble 2
## 200 Buddy 2
## 201 Bug 2
## 202 Bullet 2
## 203 Bulletproof 2
## 204 Bully 2
## 205 Buried Alive 2
## 206 By Dawn's Early Light 2
## 207 By the Sea 3
## 208 Ca$h 2
## 209 Cabin Fever 2
## 210 Cake 2
## 211 California 2
## 212 Camille 3
## 213 Camille Claudel 1915 2
## 214 Camino 2
## 215 Camp 2
## 216 Candy 2
## 217 Cape Fear 2
## 218 Caprice 2
## 219 Captain America 3
## 220 Captain January 2
## 221 Captive 3
## 222 Caravaggio 2
## 223 Cargo 4
## 224 Carmen 4
## 225 Carnival of Souls 2
## 226 Carny 2
## 227 Carrie 4
## 228 Casanova 2
## 229 Casino Royale 2
## 230 Cat People 2
## 231 Catacombs 2
## 232 Catch Me If You Can 2
## 233 Caught 4
## 234 Chain Reaction 2
## 235 Chain of Command 3
## 236 Chained 2
## 237 Champion 2
## 238 Chaos 3
## 239 Charlotte's Web 2
## 240 Charly 2
## 241 Cheaper by the Dozen 2
## 242 Chicago 2
## 243 Child's Play 2
## 244 Children of the Corn 2
## 245 China Gate 2
## 246 Chocolat 3
## 247 Christine 3
## 248 Christmas Eve 2
## 249 Christmas in Connecticut 2
## 250 Chrysalis 2
## 251 Ciało 2
## 252 Cimarron 2
## 253 Cinderella 7
## 254 City of Ghosts 2
## 255 Clash of the Titans 2
## 256 Cleopatra 5
## 257 Clockstoppers 2
## 258 Cloud 9 2
## 259 Cobra 2
## 260 Cocktail 2
## 261 Cold Sweat 2
## 262 Colegas 2
## 263 College 2
## 264 Coming Soon 2
## 265 Committed 3
## 266 Company 2
## 267 Compulsion 2
## 268 Conan the Barbarian 2
## 269 Concussion 2
## 270 Coney Island 2
## 271 Confessions of a Dangerous Mind 2
## 272 Conspiracy 2
## 273 Contagion 2
## 274 Contraband 2
## 275 Control 2
## 276 Cosmos 2
## 277 Countdown 2
## 278 Crackerjack 2
## 279 Crash 2
## 280 Crash Dive 2
## 281 Crawlspace 3
## 282 Crazy 2
## 283 Crazy Horse 2
## 284 Crazy Love 2
## 285 Creature 4
## 286 Creep 2
## 287 Crime Wave 2
## 288 Crime and Punishment 2
## 289 Criminal 2
## 290 Crossroads 3
## 291 Crush 4
## 292 Cry, the Beloved Country 2
## 293 Cyberbully 2
## 294 Cyrano de Bergerac 2
## 295 D.O.A. 3
## 296 Dad's Army 2
## 297 Dante's Inferno 2
## 298 Dark City 2
## 299 Dark Horse 3
## 300 Dark House 2
## 301 Darkness Falls 2
## 302 Darling 4
## 303 Das Versprechen 2
## 304 David Copperfield 2
## 305 David and Lisa 2
## 306 Dawn of the Dead 2
## 307 Day One 2
## 308 Day of the Dead 2
## 309 Daylight 2
## 310 Dead Awake 3
## 311 Dead Birds 2
## 312 Dead End 3
## 313 Dead Heat 2
## 314 Dead Silence 2
## 315 Dead of Night 2
## 316 Deadfall 3
## 317 Deadline 4
## 318 Deal 2
## 319 Death Sentence 2
## 320 Death Valley 2
## 321 Death at a Funeral 2
## 322 Death of a Salesman 4
## 323 Deception 2
## 324 Deck the Halls 2
## 325 Defiance 2
## 326 Delirious 2
## 327 Delirium 2
## 328 Deliver Us from Evil 3
## 329 Dementia 2
## 330 Demon Hunter 2
## 331 Der Tunnel 2
## 332 Der var engang 2
## 333 Derailed 2
## 334 Deranged 2
## 335 Destroyer 2
## 336 Detention 2
## 337 Detour 4
## 338 Devil's Playground 2
## 339 Die Brücke 2
## 340 Die goldene Gans 2
## 341 Dillinger 2
## 342 Dinosaur Island 2
## 343 Dirty Dancing 2
## 344 Dirty Deeds 2
## 345 Django 2
## 346 Do Not Disturb 3
## 347 Doctor Dolittle 2
## 348 Doctor Strange 2
## 349 Dog Tags 2
## 350 Don Juan 2
## 351 Don Quixote 2
## 352 Don't Be Afraid of the Dark 2
## 353 Don't Drink the Water 2
## 354 Don't Hang Up 2
## 355 Double Indemnity 2
## 356 Double Take 2
## 357 Double Trouble 2
## 358 Double Wedding 2
## 359 Downhill 2
## 360 Dr. Jekyll and Mr. Hyde 4
## 361 Dracula 3
## 362 Dragonfly 2
## 363 Dragonslayer 2
## 364 Dreamcatcher 2
## 365 Dreamland 3
## 366 Dressed to Kill 3
## 367 Drive 2
## 368 Driving Me Crazy 2
## 369 Drone 2
## 370 Dunkirk 2
## 371 Déjà Vu 2
## 372 Earth vs. the Spider 2
## 373 Easy Living 2
## 374 Easy Virtue 2
## 375 Eat 3
## 376 Eden 5
## 377 Edge of Darkness 2
## 378 El Dorado 2
## 379 El Estudiante 2
## 380 El Greco 2
## 381 Elegy 2
## 382 Elephant 2
## 383 Elevator 2
## 384 Elokuu 2
## 385 Embrace of the Vampire 2
## 386 Emma 5
## 387 Empire 2
## 388 Employee of the Month 2
## 389 Enchanted April 2
## 390 Enchantment 2
## 391 Encore 2
## 392 End of the Line 2
## 393 End of the World 2
## 394 Endangered Species 2
## 395 Endgame 3
## 396 Endless Love 2
## 397 Enigma 3
## 398 Equinox 2
## 399 Erotikon 2
## 400 Escape 2
## 401 Escape to Witch Mountain 2
## 402 Eva 2
## 403 Everest 2
## 404 Evergreen 2
## 405 Evidence 3
## 406 Exit 3
## 407 Exposed 2
## 408 Extraction 2
## 409 FC Venus 2
## 410 Face 2
## 411 Fade to Black 3
## 412 Fair Game 3
## 413 Fallen 4
## 414 Fame 2
## 415 Fanny 3
## 416 Fanny Hill 2
## 417 Fantastic Four 2
## 418 Far from the Madding Crowd 2
## 419 Father of the Bride 2
## 420 Fatherland 2
## 421 Fatso 2
## 422 Faust 3
## 423 Fear 2
## 424 Fear in the Night 2
## 425 Feast 2
## 426 Feed 2
## 427 Fever 2
## 428 Fever Pitch 2
## 429 Final Justice 2
## 430 Finders Keepers 2
## 431 Fire Down Below 2
## 432 Fire with Fire 2
## 433 First Daughter 2
## 434 Five 4
## 435 Flash Gordon 2
## 436 Flashback 2
## 437 Flawless 2
## 438 Flipper 2
## 439 Flowers in the Attic 2
## 440 Focus 2
## 441 Footloose 2
## 442 For Love or Money 2
## 443 Forbidden 3
## 444 Forever 2
## 445 Forget Me Not 2
## 446 Forsaken 2
## 447 Fortress 3
## 448 Fotograf 2
## 449 Four Sons 2
## 450 Foxfire 2
## 451 Fracture 2
## 452 Framed 2
## 453 Frankenstein 6
## 454 Frankenweenie 2
## 455 Frankie and Johnny 2
## 456 Freaky Friday 3
## 457 Freedom 2
## 458 Freeheld 2
## 459 Freeway 2
## 460 Fresh 2
## 461 Friday the 13th 2
## 462 Fright Night 2
## 463 From the Earth to the Moon 2
## 464 Frozen 3
## 465 Fun with Dick and Jane 2
## 466 Funny Farm 2
## 467 Funny Games 2
## 468 Fury 2
## 469 Gabriel 2
## 470 Gabrielle 2
## 471 Gambit 2
## 472 Game Over 2
## 473 Gamer 2
## 474 Gaslight 2
## 475 Genius 2
## 476 Geronimo 2
## 477 Get Carter 2
## 478 Get Out 2
## 479 Ghajini 2
## 480 Ghost 2
## 481 Ghostbusters 2
## 482 Ghoul 2
## 483 Gigi 2
## 484 Girls Town 2
## 485 Gloria 4
## 486 Go West 2
## 487 Godzilla 2
## 488 Going in Style 2
## 489 Going the Distance 2
## 490 Gold 4
## 491 Gone 2
## 492 Goodbye, Mr. Chips 2
## 493 Gossip 2
## 494 Grace 3
## 495 Grand Hotel 2
## 496 Grandma's Boy 2
## 497 Graveyard Shift 2
## 498 Great Expectations 5
## 499 Grey Gardens 2
## 500 Gulliver's Travels 4
## 501 Gus 2
## 502 Guys and Dolls 2
## 503 Gypsy 2
## 504 Hairspray 2
## 505 Halloween 3
## 506 Halloween II 2
## 507 Hamlet 8
## 508 Hansel & Gretel 2
## 509 Hansel and Gretel 2
## 510 Happiness 2
## 511 Happy 2
## 512 Happy End 2
## 513 Happy New Year 2
## 514 Hard Luck 2
## 515 Hardcore 2
## 516 Harvest 2
## 517 Harvey 2
## 518 Hawaii 2
## 519 Hawking 2
## 520 Head Over Heels 2
## 521 Heartbreak Hotel 2
## 522 Heartbreakers 2
## 523 Hearts and Minds 2
## 524 Heat 3
## 525 Heaven 2
## 526 Heaven Can Wait 2
## 527 Heavy Petting 2
## 528 Hector 2
## 529 Heidi 6
## 530 Heist 2
## 531 Held Up 2
## 532 Helen 2
## 533 Hellgate 2
## 534 Helter Skelter 2
## 535 Hercules 4
## 536 Hero 2
## 537 Hidden Agenda 2
## 538 Hide and Seek 2
## 539 High Noon 2
## 540 High School 2
## 541 High Society 2
## 542 High Strung 2
## 543 Hiroshima 2
## 544 Holiday 2
## 545 Holy Matrimony 2
## 546 Home 5
## 547 Home Movie 2
## 548 Home Sweet Home 2
## 549 Home for the Holidays 2
## 550 Home of the Brave 4
## 551 Honeymoon 2
## 552 Hope Springs 2
## 553 Horton Hears a Who! 2
## 554 Hot Pursuit 2
## 555 Hotel 3
## 556 Houdini 2
## 557 House 2
## 558 House of Cards 2
## 559 House of Usher 2
## 560 House of Wax 2
## 561 House on Haunted Hill 2
## 562 Housekeeping 2
## 563 How to Make a Monster 2
## 564 Howl 2
## 565 Hunger 3
## 566 Hurricane 2
## 567 Hush 3
## 568 I Love Trouble 2
## 569 I'll Sleep When I'm Dead 2
## 570 Ice Castles 2
## 571 Imitation of Life 2
## 572 Impact 2
## 573 Impulse 3
## 574 In Cold Blood 2
## 575 In the Blood 2
## 576 Incubus 3
## 577 Indian Summer 2
## 578 Inferno 4
## 579 Inherit the Wind 2
## 580 Innocence 2
## 581 Inside 2
## 582 Inside Out 3
## 583 Insomnia 2
## 584 Into the Storm 2
## 585 Into the Sun 2
## 586 Into the West 2
## 587 Into the Woods 3
## 588 Intruder 2
## 589 Intruders 3
## 590 Invaders from Mars 2
## 591 Invasion of the Body Snatchers 2
## 592 Invincible 2
## 593 Iris 3
## 594 Iron Man 3
## 595 Isolation 2
## 596 It Takes Two 2
## 597 It's Alive 2
## 598 It's Such a Beautiful Day 2
## 599 Ivanhoe 2
## 600 Jack 3
## 601 Jack Frost 3
## 602 Jack and the Beanstalk 4
## 603 Jack the Giant Killer 2
## 604 Jack the Ripper 2
## 605 Jailbait 2
## 606 Jane Eyre 6
## 607 Jason and the Argonauts 2
## 608 Jersey Girl 2
## 609 Jerusalem 2
## 610 Jesus 2
## 611 Jesus Christ Superstar 2
## 612 Jigsaw 3
## 613 Joan of Arc 3
## 614 Joanna 2
## 615 Joe 2
## 616 Joey 2
## 617 Joshua 2
## 618 Journey Into Fear 2
## 619 Journey to the Center of the Earth 4
## 620 Joy 2
## 621 Judas Kiss 2
## 622 Judex 2
## 623 Juha 2
## 624 Julia 3
## 625 Julie 2
## 626 Julius Caesar 3
## 627 Junior 2
## 628 Just My Luck 2
## 629 Just for Kicks 2
## 630 Kamikaze 2
## 631 Kid Galahad 2
## 632 Kidnapped 2
## 633 Kiki 3
## 634 Kill 'em All 2
## 635 Kill Switch 2
## 636 Kill Your Darlings 2
## 637 Killjoy 2
## 638 Kind Lady 2
## 639 King Cobra 2
## 640 King Kong 3
## 641 King Lear 5
## 642 King Solomon's Mines 4
## 643 Kingdom Come 2
## 644 Kismet 2
## 645 Kiss Me Goodbye 2
## 646 Kiss of Death 2
## 647 Knock Knock 2
## 648 Knockout 2
## 649 Kon-Tiki 2
## 650 L'Attentat 2
## 651 L'Auberge rouge 2
## 652 L'Enfer 2
## 653 L'amour fou 2
## 654 LOL 2
## 655 La maschera del demonio 2
## 656 La religieuse 2
## 657 Labyrinth 2
## 658 Ladrones 2
## 659 Lamerica 2
## 660 Lassie 2
## 661 Last Holiday 2
## 662 Last Man Standing 2
## 663 Last Night 2
## 664 Last Resort 2
## 665 Last Summer 2
## 666 Late Bloomers 2
## 667 Law and Order 2
## 668 Le Comte de Monte-Cristo 2
## 669 Le fils 2
## 670 Left Behind 2
## 671 Legacy 3
## 672 Legend 2
## 673 Legion 2
## 674 Les Misérables 7
## 675 Les liaisons dangereuses 2
## 676 Leviathan 2
## 677 Life 4
## 678 Lifted 2
## 679 Lights Out 2
## 680 Liliom 2
## 681 Limelight 2
## 682 Lionheart 2
## 683 Little Dorrit 2
## 684 Little Lord Fauntleroy 2
## 685 Little Men 2
## 686 Little Miss Marker 2
## 687 Little Monsters 2
## 688 Little Sister 2
## 689 Little Women 3
## 690 Lizzie 2
## 691 Loaded 3
## 692 Local Color 2
## 693 Loft 2
## 694 Logan 2
## 695 Lola 3
## 696 Lolita 2
## 697 London 2
## 698 London After Midnight 2
## 699 Long Day's Journey Into Night 2
## 700 Long Weekend 2
## 701 Lord of the Flies 2
## 702 Lost & Found 2
## 703 Lost Horizon 2
## 704 Love 4
## 705 Love Affair 2
## 706 Loverboy 3
## 707 Lovesick 2
## 708 Loving 2
## 709 Lucky 3
## 710 Lucky Luke 2
## 711 Lucy 2
## 712 Lullaby 2
## 713 Luther 2
## 714 M 2
## 715 Macbeth 7
## 716 Mad Love 2
## 717 Madagascar 2
## 718 Madame Bovary 4
## 719 Mademoiselle 2
## 720 Madhouse 3
## 721 Magnificent Obsession 2
## 722 Magnus 2
## 723 Mail Order Bride 2
## 724 Malcolm X 2
## 725 Mama 2
## 726 Mammoth 2
## 727 Man of the House 2
## 728 Man of the Moment 2
## 729 Man of the Year 2
## 730 Man on Fire 2
## 731 Maniac 4
## 732 Mannequin 2
## 733 Mansfield Park 2
## 734 Margaret 2
## 735 Marie Antoinette 2
## 736 Marius 2
## 737 Mars 2
## 738 Martyrs 2
## 739 Masterminds 2
## 740 Mata Hari 2
## 741 Matilda 2
## 742 Max 3
## 743 Mayerling 2
## 744 Medea 2
## 745 Melody 2
## 746 Memorial Day 2
## 747 Memory Lane 2
## 748 Men with Guns 2
## 749 Mercenaries 2
## 750 Mercy 4
## 751 Meteor 2
## 752 Metropolis 2
## 753 Michael 3
## 754 Mickey 2
## 755 Middle of Nowhere 2
## 756 Midnight 3
## 757 Midnight Man 2
## 758 Mighty Joe Young 2
## 759 Mildred Pierce 2
## 760 Milk 2
## 761 Mine 2
## 762 Miracle on 34th Street 2
## 763 Mirage 2
## 764 Miranda 3
## 765 Mirror Mirror 2
## 766 Mischief Night 2
## 767 Miss Julie 2
## 768 Moana 2
## 769 Molière 2
## 770 Momentum 2
## 771 Mommy 2
## 772 Monkey Business 2
## 773 Monster 2
## 774 Montana 2
## 775 Monte Carlo 2
## 776 More 2
## 777 Morgan 2
## 778 Morning Glory 2
## 779 Mortuary 2
## 780 Mosaic 2
## 781 Mother's Day 3
## 782 Moving Target 2
## 783 Mr. & Mrs. Smith 2
## 784 Mr. Jones 2
## 785 Mr. Right 3
## 786 Much Ado About Nothing 2
## 787 Murder on the Orient Express 2
## 788 Mutants 2
## 789 Mutiny on the Bounty 2
## 790 My Bloody Valentine 2
## 791 My Blue Heaven 2
## 792 My Cousin Rachel 2
## 793 My Man Godfrey 2
## 794 My Sister Eileen 2
## 795 Mysterious Island 2
## 796 Mädchen in Uniform 2
## 797 Naked 2
## 798 Nana 3
## 799 Nancy Drew 2
## 800 Natural Selection 2
## 801 Nature of the Beast 2
## 802 Ned Kelly 2
## 803 Neighbors 3
## 804 Never a Dull Moment 2
## 805 Next of Kin 3
## 806 Night Moves 2
## 807 Night and the City 2
## 808 Night of the Demon 2
## 809 Night of the Demons 2
## 810 Night of the Living Dead 2
## 811 Nightmare 3
## 812 Nightmares 2
## 813 Nina 2
## 814 Nine Lives 3
## 815 No Escape 2
## 816 No Good Deed 2
## 817 No Man of Her Own 2
## 818 No Man's Land 4
## 819 No Smoking 2
## 820 No Way Out 2
## 821 Noah 2
## 822 Noah's Ark 2
## 823 Nobody's Fool 2
## 824 Nobody's Perfect 2
## 825 Nocturna 2
## 826 Noise 2
## 827 Non-Stop 2
## 828 Normal 2
## 829 Nothing Personal 2
## 830 Nothing to Lose 2
## 831 Notorious 2
## 832 Oblivion 2
## 833 Obsessed 3
## 834 Obsession 2
## 835 Ocean's Eleven 2
## 836 Of Mice and Men 2
## 837 Offside 2
## 838 Oklahoma! 2
## 839 Oliver Twist 4
## 840 On the Beach 2
## 841 Once a Thief 2
## 842 One More Time 2
## 843 One Week 2
## 844 Open Season 2
## 845 Opening Night 2
## 846 Operator 2
## 847 Oscar 2
## 848 Othello 4
## 849 Our Town 2
## 850 Out Cold 2
## 851 Out of Reach 2
## 852 Out of Time 2
## 853 Out of the Blue 3
## 854 Out on a Limb 2
## 855 Outrage 3
## 856 Paid in Full 2
## 857 Pan 2
## 858 Panic Button 2
## 859 Paparazzi 2
## 860 Paradise 3
## 861 Paradox 2
## 862 Paranoia 2
## 863 Parineeta 2
## 864 Party Girl 3
## 865 Party Monster 2
## 866 Passengers 2
## 867 Passion 2
## 868 Patrick 2
## 869 Penelope 2
## 870 Pennies from Heaven 3
## 871 Penumbra 2
## 872 Persuasion 3
## 873 Pete's Dragon 2
## 874 Peter Pan 5
## 875 Phantom 3
## 876 Phoenix 3
## 877 Pie in the Sky 2
## 878 Pilgrimage 2
## 879 Pinocchio 4
## 880 Piranha 2
## 881 Pixels 2
## 882 Pizza 2
## 883 Planet of the Apes 2
## 884 Playing for Keeps 2
## 885 Poil de carotte 2
## 886 Point Break 2
## 887 Poison 2
## 888 Poison Ivy 2
## 889 Pokémon 3: The Movie 2
## 890 Police 2
## 891 Pollyanna 2
## 892 Poltergeist 2
## 893 Popcorn 2
## 894 Posse 2
## 895 Possessed 3
## 896 Possession 3
## 897 Pretty Baby 2
## 898 Pride 2
## 899 Pride and Prejudice 2
## 900 Priest 2
## 901 Priklyucheniya Buratino 2
## 902 Prince Valiant 2
## 903 Princess 2
## 904 Prinsessa 2
## 905 Private Parts 2
## 906 Project X 3
## 907 Prom Night 2
## 908 Promised Land 2
## 909 Proof 2
## 910 Proteus 2
## 911 Providence 2
## 912 Psycho 2
## 913 Public Enemies 3
## 914 Pulse 2
## 915 Pusher 2
## 916 Pünktchen und Anton 2
## 917 Q 2
## 918 Quartet 3
## 919 Quo Vadis 2
## 920 Race 2
## 921 Raffles 2
## 922 Rage 3
## 923 Rain 2
## 924 Rampage 2
## 925 Ransom 2
## 926 Rapid Fire 2
## 927 Rat 2
## 928 Raw Deal 2
## 929 Rear Window 2
## 930 Rebirth 2
## 931 Reckless 2
## 932 Red Dawn 2
## 933 Red Dust 2
## 934 Red Heat 2
## 935 Red Riding Hood 2
## 936 Refuge 2
## 937 Regeneration 2
## 938 Rembrandt 3
## 939 Remote Control 2
## 940 Requiem 2
## 941 Resistance 2
## 942 Respire 2
## 943 Restless 2
## 944 Restoration 2
## 945 Return to Sender 2
## 946 Revolution 2
## 947 Rich and Famous 2
## 948 Richard III 3
## 949 Ricochet 2
## 950 Ride 2
## 951 Riff-Raff 2
## 952 Rings 2
## 953 Rio 2
## 954 Riot 2
## 955 Ritual 2
## 956 River 2
## 957 Riverworld 2
## 958 Road 2
## 959 Road House 2
## 960 Roadie 2
## 961 Robin Hood 4
## 962 Robinson Crusoe 3
## 963 RoboCop 2
## 964 Rollerball 2
## 965 Roma 2
## 966 Romance 3
## 967 Romeo and Juliet 2
## 968 Room 2
## 969 Rosemary's Baby 2
## 970 Ruby 2
## 971 Run 3
## 972 Runaway 3
## 973 Running Scared 3
## 974 Rush 2
## 975 Sabotage 2
## 976 Sabrina 2
## 977 Sacrifice 3
## 978 Safe 2
## 979 Safe House 2
## 980 Sahara 4
## 981 Salem's Lot 2
## 982 Salomé 3
## 983 Salvage 2
## 984 Samsara 2
## 985 Samson and Delilah 3
## 986 San Quentin 2
## 987 Santa Claus 2
## 988 Santa Claws 2
## 989 Savages 2
## 990 Save Me 2
## 991 Saving Face 2
## 992 Saw 2
## 993 Scaramouche 2
## 994 Scarecrow 2
## 995 Scarface 2
## 996 School for Scoundrels 2
## 997 Scorned 2
## 998 Screamers 2
## 999 Screwed 2
## 1000 Scrooge 3
## 1001 Second Skin 2
## 1002 Secret Défense 2
## 1003 See No Evil 2
## 1004 Seizure 2
## 1005 Sense and Sensibility 4
## 1006 Senseless 2
## 1007 September 2
## 1008 Sequoia 2
## 1009 Serena 2
## 1010 Shadow People 2
## 1011 Shaft 2
## 1012 Shakedown 3
## 1013 Shank 2
## 1014 She 3
## 1015 Shelter 3
## 1016 Shelter Island 2
## 1017 Shenandoah 2
## 1018 Sherlock Holmes 4
## 1019 Shiva 2
## 1020 Shock Treatment 2
## 1021 Shoot to Kill 2
## 1022 Show Boat 2
## 1023 Sicario 2
## 1024 Side Effects 2
## 1025 Sidewalks of New York 2
## 1026 Signs 2
## 1027 Silent Retreat 2
## 1028 Silk 2
## 1029 Simon 2
## 1030 Sink or Swim 2
## 1031 Siren 2
## 1032 Sisters 3
## 1033 Ski Patrol 2
## 1034 Skin 2
## 1035 Skinwalkers 2
## 1036 Skylark 2
## 1037 Sleeping Beauty 4
## 1038 Sleuth 2
## 1039 Slipstream 2
## 1040 Slither 2
## 1041 Slow Burn 3
## 1042 Smile 2
## 1043 Snatched 2
## 1044 Snow White 4
## 1045 Soldier 2
## 1046 Solo 3
## 1047 Somebody Up There Likes Me 2
## 1048 Something Wild 2
## 1049 Something to Sing About 2
## 1050 Son of Dracula 2
## 1051 Sonny Boy 2
## 1052 Sorceress 2
## 1053 Sounder 2
## 1054 Sour Grapes 2
## 1055 Southern Comfort 2
## 1056 Sparkle 2
## 1057 Spartacus 2
## 1058 Speedway 2
## 1059 Spellbound 2
## 1060 Spider 2
## 1061 Spiders 2
## 1062 Spin 3
## 1063 Splendor 2
## 1064 Split Second 2
## 1065 Stage Fright 2
## 1066 Stage Struck 3
## 1067 Stagecoach 2
## 1068 Standoff 2
## 1069 Stardust 2
## 1070 State Fair 3
## 1071 Stay 2
## 1072 Steel 2
## 1073 Stella 3
## 1074 Stereo 2
## 1075 Stevie 2
## 1076 Still Life 2
## 1077 Stitches 2
## 1078 Stone Cold 2
## 1079 Stonewall 2
## 1080 Storm 2
## 1081 Storm Warning 2
## 1082 Stormy Weather 2
## 1083 Stranded 3
## 1084 Strange Invaders 2
## 1085 Strapped 2
## 1086 Straw Dogs 2
## 1087 Stuck 2
## 1088 Submarine 2
## 1089 Submerged 2
## 1090 Suddenly 2
## 1091 Sugar 3
## 1092 Sugar Hill 2
## 1093 Sultan 2
## 1094 Summer Camp 2
## 1095 Summer Holiday 2
## 1096 Summer School 3
## 1097 Sundown 2
## 1098 Sunset Strip 2
## 1099 Sunshine 2
## 1100 Superman 2
## 1101 Supernova 2
## 1102 Superstar 2
## 1103 Sur 2
## 1104 Survivor 3
## 1105 Suspect 3
## 1106 Swallows and Amazons 2
## 1107 Sweeney Todd: The Demon Barber of Fleet Street 3
## 1108 Sweet November 2
## 1109 Sweet Revenge 2
## 1110 Sweet Sixteen 2
## 1111 Swingers 2
## 1112 Switch 3
## 1113 Sybil 2
## 1114 Sylvia 2
## 1115 Tabu 2
## 1116 Take Me to the River 2
## 1117 Taken 2
## 1118 Tangerine 2
## 1119 Tangled 2
## 1120 Tango 2
## 1121 Taras Bulba 2
## 1122 Target 2
## 1123 Tarzan 2
## 1124 Taxi 2
## 1125 Teacher's Pet 3
## 1126 Teenage Mutant Ninja Turtles 2
## 1127 Tempest 2
## 1128 Terminus 2
## 1129 Tess of the D'Urbervilles 2
## 1130 The 39 Steps 3
## 1131 The Abandoned 2
## 1132 The Accused 2
## 1133 The Adventures of Huckleberry Finn 2
## 1134 The Adventures of Mark Twain 2
## 1135 The Age of Innocence 2
## 1136 The Alamo 2
## 1137 The Amityville Horror 2
## 1138 The Andromeda Strain 2
## 1139 The Architect 2
## 1140 The Art of the Steal 2
## 1141 The Assignment 2
## 1142 The Avengers 2
## 1143 The Aviator 2
## 1144 The Awakening 2
## 1145 The Awful Truth 2
## 1146 The Bachelor 2
## 1147 The Bad Seed 2
## 1148 The Bank 2
## 1149 The Barber 2
## 1150 The Bat 2
## 1151 The Beguiled 2
## 1152 The Best Man 2
## 1153 The Big Fix 2
## 1154 The Big Sleep 2
## 1155 The Big Steal 2
## 1156 The Birth of a Nation 2
## 1157 The Biscuit Eater 2
## 1158 The Black Cat 3
## 1159 The Black Hole 4
## 1160 The Black Room 2
## 1161 The Bling Ring 2
## 1162 The Blob 2
## 1163 The Blue Bird 2
## 1164 The Blue Lagoon 2
## 1165 The Book of Life 2
## 1166 The Borrowers 3
## 1167 The Boss 2
## 1168 The Bourne Identity 2
## 1169 The Box 2
## 1170 The Boxer 2
## 1171 The Boy 2
## 1172 The Boy Next Door 2
## 1173 The Boys 2
## 1174 The Brave One 2
## 1175 The Breed 2
## 1176 The Bridge 2
## 1177 The Browning Version 2
## 1178 The Buccaneer 2
## 1179 The Call of the Wild 3
## 1180 The Caller 2
## 1181 The Canterville Ghost 2
## 1182 The Captive 2
## 1183 The Caretaker 2
## 1184 The Case for Christ 2
## 1185 The Cat and the Canary 3
## 1186 The Cat in the Hat 2
## 1187 The Challenge 3
## 1188 The Champ 2
## 1189 The Charge of the Light Brigade 2
## 1190 The Chase 3
## 1191 The Child 2
## 1192 The Chosen 2
## 1193 The Circle 3
## 1194 The Club 2
## 1195 The Cold Light of Day 2
## 1196 The Collection 2
## 1197 The Collector 2
## 1198 The Comedian 3
## 1199 The Condemned 2
## 1200 The Confession 3
## 1201 The Connection 2
## 1202 The Cottage 2
## 1203 The Count of Monte Cristo 2
## 1204 The Covenant 2
## 1205 The Crazies 2
## 1206 The Crew 3
## 1207 The Cure 2
## 1208 The Damned 2
## 1209 The Dark 2
## 1210 The Dark Horse 2
## 1211 The Dark Knight 2
## 1212 The Dark Tower 2
## 1213 The Dawn Patrol 2
## 1214 The Day of the Triffids 2
## 1215 The Day the Earth Stood Still 2
## 1216 The Dead 2
## 1217 The Dead Zone 2
## 1218 The Deal 2
## 1219 The Deep Blue Sea 2
## 1220 The Defiant Ones 2
## 1221 The Dentist 2
## 1222 The Desert Song 3
## 1223 The Diary of Anne Frank 3
## 1224 The Disappeared 2
## 1225 The Double 2
## 1226 The Dream Team 2
## 1227 The Dresser 2
## 1228 The Dunwich Horror 2
## 1229 The Edge 2
## 1230 The Elephant Man 2
## 1231 The Emperor's New Clothes 2
## 1232 The Encounter 2
## 1233 The End 2
## 1234 The End of the Affair 2
## 1235 The Enforcer 2
## 1236 The Escapist 2
## 1237 The Falls 2
## 1238 The Fan 2
## 1239 The Fast and the Furious 2
## 1240 The Final Cut 2
## 1241 The Firm 2
## 1242 The First Time 2
## 1243 The Fly 2
## 1244 The Fog 2
## 1245 The Foreigner 2
## 1246 The Forest 2
## 1247 The Forger 2
## 1248 The Forgotten 2
## 1249 The Formula 2
## 1250 The Four Feathers 4
## 1251 The Four Horsemen of the Apocalypse 2
## 1252 The Freshman 2
## 1253 The Front 2
## 1254 The Front Page 2
## 1255 The Frozen North 2
## 1256 The Fugitive 2
## 1257 The Gambler 3
## 1258 The Garden 2
## 1259 The Gathering Storm 2
## 1260 The Gauntlet 2
## 1261 The General 2
## 1262 The Getaway 2
## 1263 The Ghost Train 2
## 1264 The Ghoul 2
## 1265 The Gift 3
## 1266 The Girl 2
## 1267 The Girl Next Door 3
## 1268 The Girl Said No 2
## 1269 The Girl on the Train 2
## 1270 The Glass House 2
## 1271 The Glass Menagerie 3
## 1272 The Good Humor Man 2
## 1273 The Good Lie 2
## 1274 The Good Shepherd 2
## 1275 The Goodbye Girl 2
## 1276 The Great Gatsby 4
## 1277 The Great Waltz 2
## 1278 The Greatest 2
## 1279 The Green Hornet 3
## 1280 The Guardian 2
## 1281 The Gunfighter 2
## 1282 The Happening 2
## 1283 The Hard Way 2
## 1284 The Haunted House 2
## 1285 The Haunting 2
## 1286 The Heartbreak Kid 2
## 1287 The Hills Have Eyes 2
## 1288 The Hitcher 2
## 1289 The Hive 2
## 1290 The Hole 2
## 1291 The Hollow 3
## 1292 The Hoodlum 2
## 1293 The Hound of the Baskervilles 6
## 1294 The Hunchback of Notre Dame 3
## 1295 The Hunted 2
## 1296 The Hunter 2
## 1297 The Hunters 4
## 1298 The Hunting Party 2
## 1299 The Hurricane 2
## 1300 The Immigrant 2
## 1301 The Importance of Being Earnest 3
## 1302 The In Crowd 2
## 1303 The In-Laws 2
## 1304 The Incident 3
## 1305 The Initiation of Sarah 2
## 1306 The Institute 2
## 1307 The Intern 2
## 1308 The Interview 2
## 1309 The Intruder 2
## 1310 The Invisible Woman 2
## 1311 The Invitation 2
## 1312 The Island 2
## 1313 The Island of Dr. Moreau 2
## 1314 The Italian Job 2
## 1315 The Jazz Singer 2
## 1316 The Journey 4
## 1317 The Jungle Book 3
## 1318 The Karate Kid 2
## 1319 The Keeper 3
## 1320 The Key 2
## 1321 The Kid 2
## 1322 The Killers 2
## 1323 The King and I 2
## 1324 The Kiss 4
## 1325 The Ladies Man 2
## 1326 The Lady Vanishes 3
## 1327 The Ladykillers 2
## 1328 The Land 2
## 1329 The Last House on the Left 2
## 1330 The Last Man on Earth 2
## 1331 The Last Patrol 2
## 1332 The Last Run 2
## 1333 The Last Word 2
## 1334 The Last of the Mohicans 3
## 1335 The Legend of Sleepy Hollow 3
## 1336 The Letter 3
## 1337 The Life & Adventures of Santa Claus 2
## 1338 The Little Prince 2
## 1339 The Lodger 2
## 1340 The Lone Ranger 3
## 1341 The Longest Yard 2
## 1342 The Lorax 2
## 1343 The Lost 2
## 1344 The Lost World 3
## 1345 The Lottery 2
## 1346 The Love Bug 2
## 1347 The Love Letter 2
## 1348 The Lovers 2
## 1349 The Luck of the Irish 2
## 1350 The Magician 2
## 1351 The Magnificent Seven 2
## 1352 The Maker 2
## 1353 The Maltese Falcon 2
## 1354 The Man 2
## 1355 The Man Who Knew Too Much 2
## 1356 The Man Who Wasn't There 2
## 1357 The Man in the Iron Mask 3
## 1358 The Manchurian Candidate 2
## 1359 The Mark 2
## 1360 The Mark of Cain 2
## 1361 The Mark of Zorro 2
## 1362 The Mask 2
## 1363 The Master 2
## 1364 The Matador 2
## 1365 The Matchmaker 2
## 1366 The Maze 2
## 1367 The Mechanic 2
## 1368 The Merchant of Venice 2
## 1369 The Merry Widow 3
## 1370 The Miracle Worker 3
## 1371 The Monster 2
## 1372 The Morning After 2
## 1373 The Mummy 4
## 1374 The Music Man 2
## 1375 The Neighbor 3
## 1376 The Night Before 2
## 1377 The Night Stalker 2
## 1378 The Nutty Professor 2
## 1379 The Old Dark House 2
## 1380 The Old Man and the Sea 2
## 1381 The Omen 2
## 1382 The One 2
## 1383 The Open Road 2
## 1384 The Order 2
## 1385 The Other Woman 2
## 1386 The Outsider 2
## 1387 The Pack 2
## 1388 The Package 2
## 1389 The Painted Veil 2
## 1390 The Paleface 2
## 1391 The Parent Trap 2
## 1392 The Patriot 2
## 1393 The Patsy 2
## 1394 The Penalty 2
## 1395 The Perils of Pauline 3
## 1396 The Phantom 3
## 1397 The Phantom of the Opera 4
## 1398 The Philadelphia Experiment 2
## 1399 The Pied Piper 2
## 1400 The Pink Panther 2
## 1401 The Pirates of Penzance 2
## 1402 The Pit 2
## 1403 The Pit and the Pendulum 3
## 1404 The Plainsman 2
## 1405 The Poseidon Adventure 2
## 1406 The Postman Always Rings Twice 2
## 1407 The Power and the Glory 2
## 1408 The Prince and the Pauper 4
## 1409 The Prisoner of Zenda 3
## 1410 The Producers 2
## 1411 The Program 2
## 1412 The Promise 2
## 1413 The Proposition 2
## 1414 The Prowler 2
## 1415 The Punisher 2
## 1416 The Queen 2
## 1417 The Quick and the Dead 2
## 1418 The Quiet American 2
## 1419 The Racket 2
## 1420 The Rainmaker 2
## 1421 The Raven 4
## 1422 The Razor's Edge 2
## 1423 The Real McCoy 2
## 1424 The Reckoning 3
## 1425 The Return 2
## 1426 The Revenant 2
## 1427 The Rift 2
## 1428 The Ring 3
## 1429 The River 2
## 1430 The Road 2
## 1431 The Roman Spring of Mrs. Stone 2
## 1432 The Rookie 2
## 1433 The Saint 2
## 1434 The Scapegoat 2
## 1435 The Scarecrow 2
## 1436 The Scarlet Letter 3
## 1437 The Scarlet Pimpernel 2
## 1438 The Sea Hawk 2
## 1439 The Search 2
## 1440 The Secret Garden 3
## 1441 The Secret Life of Walter Mitty 2
## 1442 The Sentinel 2
## 1443 The Shaggy Dog 2
## 1444 The Sheik 2
## 1445 The Show 2
## 1446 The Signal 2
## 1447 The Sitter 2
## 1448 The Snowman 2
## 1449 The Sound and the Fury 2
## 1450 The Spiral Staircase 2
## 1451 The Spirit of Christmas 3
## 1452 The Square 2
## 1453 The Squeeze 3
## 1454 The Stepfather 2
## 1455 The Stepford Wives 2
## 1456 The Stranger 4
## 1457 The Substitute 2
## 1458 The Sunshine Boys 2
## 1459 The Suspect 2
## 1460 The Take 3
## 1461 The Taming of the Shrew 3
## 1462 The Tempest 2
## 1463 The Ten Commandments 2
## 1464 The Theory of Everything 2
## 1465 The Thief of Bagdad 2
## 1466 The Thing 2
## 1467 The Thomas Crown Affair 2
## 1468 The Three Musketeers 7
## 1469 The Three Stooges 2
## 1470 The Time Machine 2
## 1471 The Time of Your Life 2
## 1472 The Toolbox Murders 2
## 1473 The Tracker 2
## 1474 The Trap 2
## 1475 The Trip 3
## 1476 The Trip to Bountiful 2
## 1477 The Tunnel 2
## 1478 The Turning Point 2
## 1479 The Undefeated 2
## 1480 The Underneath 2
## 1481 The Unholy Three 2
## 1482 The Uninvited 3
## 1483 The Unseen 2
## 1484 The Van 2
## 1485 The Verdict 2
## 1486 The Violent Kind 2
## 1487 The Virginian 2
## 1488 The Visit 4
## 1489 The Void 3
## 1490 The Waiting Room 2
## 1491 The Walking Dead 2
## 1492 The War 2
## 1493 The War at Home 2
## 1494 The Wedding March 2
## 1495 The Well 2
## 1496 The Wicker Man 2
## 1497 The Wild Party 3
## 1498 The Wind in the Willows 3
## 1499 The Winslow Boy 2
## 1500 The Witches 2
## 1501 The Wizard of Gore 2
## 1502 The Wizard of Oz 2
## 1503 The Woman in Black 2
## 1504 The Women 2
## 1505 The Wrecking Crew 2
## 1506 The Wrong Girl 2
## 1507 The Wrong Man 2
## 1508 The Yearling 2
## 1509 Thick as Thieves 2
## 1510 Thief 3
## 1511 Thin Ice 2
## 1512 Thirst 3
## 1513 This Land Is Mine 2
## 1514 Three Men in a Boat 2
## 1515 Thunderstruck 2
## 1516 Timecode 2
## 1517 Tinker Tailor Soldier Spy 2
## 1518 Titanic 3
## 1519 To Be or Not to Be 2
## 1520 To the Ends of the Earth 2
## 1521 Tobruk 2
## 1522 Tom Sawyer 3
## 1523 Tomboy 2
## 1524 Too Hot to Handle 2
## 1525 Topaze 2
## 1526 Tormented 2
## 1527 Total Recall 2
## 1528 Toy Soldiers 2
## 1529 Tracks 3
## 1530 Trance 2
## 1531 Trapped 4
## 1532 Trash 2
## 1533 Trauma 2
## 1534 Treading Water 2
## 1535 Treasure Island 6
## 1536 Trespass 2
## 1537 Trick 2
## 1538 Trick or Treat 2
## 1539 Triple Trouble 2
## 1540 True Blue 2
## 1541 True Crime 2
## 1542 True Grit 2
## 1543 Trumbo 2
## 1544 Trust 2
## 1545 Truth 2
## 1546 Tsuma 2
## 1547 Tumbledown 3
## 1548 Tuntematon sotilas 2
## 1549 Turkey Shoot 2
## 1550 Tusk 2
## 1551 Twilight 2
## 1552 Twist 2
## 1553 Twist of Faith 2
## 1554 Twisted 3
## 1555 Twister 2
## 1556 Two of a Kind 2
## 1557 Tyson 2
## 1558 Under Suspicion 2
## 1559 Under the Gun 3
## 1560 Under the Skin 2
## 1561 Underground 2
## 1562 Undertow 2
## 1563 Underworld 3
## 1564 Unfaithfully Yours 2
## 1565 Unforgettable 2
## 1566 United 2
## 1567 Universal Soldier 2
## 1568 Unknown 2
## 1569 Unmade Beds 2
## 1570 Unstoppable 2
## 1571 Valentino 2
## 1572 Valerie 2
## 1573 Vampires 2
## 1574 Vendetta 2
## 1575 Venom 2
## 1576 Vice 2
## 1577 Vice Squad 2
## 1578 Victim 3
## 1579 Victoria 2
## 1580 Village of the Damned 2
## 1581 Virus 2
## 1582 Viva 2
## 1583 Walker 2
## 1584 Walking Tall 2
## 1585 Walter 2
## 1586 Wanderlust 2
## 1587 Wanted 2
## 1588 War and Peace 3
## 1589 Warlock 2
## 1590 Water 2
## 1591 Waterloo Bridge 2
## 1592 We're No Angels 2
## 1593 Weekend of a Champion 2
## 1594 Welcome 2
## 1595 Welcome to the Jungle 2
## 1596 Western 2
## 1597 What Price Glory 2
## 1598 When Ladies Meet 2
## 1599 When a Stranger Calls 2
## 1600 When in Rome 2
## 1601 When the Bough Breaks 2
## 1602 Where the Heart Is 2
## 1603 While the City Sleeps 2
## 1604 Whiplash 3
## 1605 Whistle and I'll Come to You 2
## 1606 Wild 2
## 1607 Wild Bill 2
## 1608 Willard 2
## 1609 Wilson 2
## 1610 Wind 2
## 1611 Wish You Were Here 2
## 1612 Witch Hunt 2
## 1613 Witchcraft 2
## 1614 Without Warning 2
## 1615 Wolf 2
## 1616 Wolves 3
## 1617 Women in Love 2
## 1618 Wonder Woman 3
## 1619 Wonderland 3
## 1620 World Without End 2
## 1621 Woyzeck 2
## 1622 Wuthering Heights 6
## 1623 Xue di zi 2
## 1624 Youngblood 2
## 1625 Zero 2
## 1626 Zig Zag 2
## 1627 Zodiac 2
## 1628 Zolushka 2
## 1629 Zoom 2
## 1630 Zulu 2
## 1631 [{'iso_639_1': 'en', 'name': 'English'}] 2
## 1632 Долгая счастливая жизнь 2
## 1633 Мастер и Маргарита 2
## 1634 Обыкновенное чудо 2
## 1635 Окраина 2
## 1636 Русалка 2
## 1637 Снежная королева 2
## 1638 Солярис 2
## 1639 Сталинград 2
## 1640 修羅雪姫 2
## 1641 倩女幽魂 2
## 1642 劇場版ポケットモンスター セレビィ 時を越えた遭遇(であい) 2
## 1643 十三人の刺客 2
## 1644 座頭市 2
## 1645 怪談 2
## 1646 日本のいちばん長い日 2
## 1647 時をかける少女 3
## 1648 楢山節考 2
## 1649 野火 2
## 1650 魔女の宅急便 2
## 1651 하녀 2
There are partial duplicates where movies have the same id, imdb_id or original_title however they have variation in 1 or more columns, in the case of original title it was found that some movies had the same name, however their producer, runtime, year and overview were completely different meaning that those movies are different from each other.
In the case of id and imdb_id after filtering the cases in the movies data frame it was found all the columns matched with the exception of popularity, this could be caused if the same movie was registered twice by accident but on a different time, causing the popularity shift. In order to solve this problem the maximum popularity will used to merge the duplicates. Further action will be made depending of the result.
# Keeps the maximum popularity value of a variable and delete its duplicate
movies <- movies %>% group_by(imdb_id) %>% slice_max(popularity_max) %>% ungroup()## # A tibble: 0 × 2
## # ℹ 2 variables: id <chr>, n <int>
## # A tibble: 1 × 2
## imdb_id n
## <chr> <int>
## 1 0 3
## # A tibble: 1,638 × 2
## original_title n
## <chr> <int>
## 1 12 Angry Men 2
## 2 20,000 Leagues Under the Sea 4
## 3 2:22 2
## 4 3:10 to Yuma 2
## 5 8 2
## 6 9 2
## 7 A Bucket of Blood 2
## 8 A Christmas Carol 7
## 9 A Dangerous Place 2
## 10 A Foreign Affair 2
## # ℹ 1,628 more rows
Most of the partial duplicates for the id columns have been eliminated, there are only three instances where an imdb_id is duplicated and this value is likely invalid.
## # A tibble: 3 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 NA NA [{'iso_3… 1997… 0 104.0 [{'iso_639_1'… Released
## 2 NA NA [{'iso_3… 2012… 0 68.0 [{'iso_639_1'… Released
## 3 NA NA [{'iso_3… 2014… 0 82.0 [{'iso_639_1'… Released
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## # country1_language <fct>, country2_language <fct>, …
Looking at the remaining duplicates it seems most of the information is missing while the rest is out of place, due to this a proper adjustment in the dataset could be difficult to do, and with their imdb_id missing it could be hard to manually enter the information, therefore these aspects in combination to the low amount of data the duplicates represent to the dataset, the three rows are going to be deleted on their entirety.
## # A tibble: 0 × 2
## # ℹ 2 variables: imdb_id <chr>, n <int>
Although partial duplicates remain for the original_title this are going to be kept in the dataset, when checking the dataset the were significant differences in all the columns, also given each of those duplicates has its own id and imdb_id even if the title suggest they are the same movie, in reality they are completely different in all aspects, therefore these rows must be seen as different movies and not duplicates.
Eligible variables were converted to factors on the previous steps and once the columns were properly cleaned.
## Rows: 45,417
## Columns: 38
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ budget <dbl> 4224579, 4224579, 4224579, 4224579, 4224579, …
## $ homepage <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ id <chr> "15257", "16612", "88013", "16624", "105158",…
## $ imdb_id <chr> "", "tt0000001", "tt0000003", "tt0000005", "t…
## $ original_language <fct> en, en, fr, xx, en, fr, es, fr, fr, fr, fr, f…
## $ original_title <chr> "Hulk vs. Wolverine", "Carmencita", "Pauvre P…
## $ overview <chr> "Department H sends in Wolverine to track dow…
## $ popularity <dbl> 5.539197, 1.273072, 0.673164, 1.061591, 0.312…
## $ poster_path <chr> "/dXjbsjVkpykJECOO0kgThsipSYP.jpg", "/6QJowxF…
## $ release_date <date> 2009-01-27, 1894-03-14, 1892-10-28, 1893-05-…
## $ revenue <dbl> 11209349, 11209349, 11209349, 11209349, 11209…
## $ runtime <dbl> 38, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,…
## $ status <fct> Released, Released, Released, Released, Relea…
## $ tagline <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ title <chr> "Hulk vs. Wolverine", "Carmencita", "Poor Pie…
## $ video <chr> "False", "False", "False", "False", "False", …
## $ vote_average <dbl> 6.8, 4.9, 6.1, 5.8, 4.7, 6.2, 6.9, 7.0, 5.3, …
## $ vote_count <int> 48, 18, 19, 19, 12, 52, 87, 44, 12, 22, 17, 2…
## $ id_collection <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ name_collection <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ poster_path_collection <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ backdrop_path_collection <chr> "", "", "", "", "", "", "", "", "", "", "", "…
## $ genre1 <fct> Animation, Documentary, Comedy, Drama, Docume…
## $ genre2 <fct> Action, , Animation, , , , , , , , , , Horror…
## $ genre3 <fct> Science Fiction, , , , , , , , , , , , , , , …
## $ country1 <fct> United States of America, United States of Am…
## $ country2 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country3 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country1_language <fct> English, No Language, No Language, No Languag…
## $ country2_language <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ country3_language <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ company1 <fct> Marvel Studios, Edison Manufacturing Company,…
## $ company2 <fct> , , , , , , , , , , , , , , Star Film Company…
## $ company3 <fct> , , , , , , , , , , , , , , , , , , , , , , ,…
## $ budget_original <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ revenue_original <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ popularity_max <dbl> 5.539197, 1.273072, 0.673164, 1.061591, 0.312…
The following variables will be considered as factors:
* original_language
* status
* genre1-3
* company1-3
* country1-3
* country1-3_language
In this stage factor levels for each variable will be explored in order to find inconsistencies or errors in the available categories for a variable.
# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$original_language))## [1] "" "ab" "af" "am" "ar" "ay" "bg" "bm" "bn" "bo" "bs" "ca" "cn" "cs" "cy"
## [16] "da" "de" "el" "en" "eo" "es" "et" "eu" "fa" "fi" "fr" "fy" "gl" "he" "hi"
## [31] "hr" "hu" "hy" "id" "is" "it" "iu" "ja" "jv" "ka" "kk" "kn" "ko" "ku" "ky"
## [46] "la" "lb" "lo" "lt" "lv" "mk" "ml" "mn" "mr" "ms" "mt" "nb" "ne" "nl" "no"
## [61] "pa" "pl" "ps" "pt" "qu" "ro" "ru" "rw" "sh" "si" "sk" "sl" "sm" "sq" "sr"
## [76] "sv" "ta" "te" "tg" "th" "tl" "tr" "uk" "ur" "uz" "vi" "wo" "xx" "zh" "zu"
By examining the levels of the original language only the level in blank must be further investigated, other than that all remaining factor levels are valid and have the same formatting.
## # A tibble: 90 × 2
## original_language n
## <fct> <int>
## 1 "" 11
## 2 "ab" 10
## 3 "af" 2
## 4 "am" 2
## 5 "ar" 39
## 6 "ay" 1
## 7 "bg" 10
## 8 "bm" 3
## 9 "bn" 29
## 10 "bo" 2
## # ℹ 80 more rows
There are 11 movies in which the column original_language does not contain information.
## # A tibble: 11 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 3591… tt0053… "" 13 Fighting M… "A grou…
## 2 FALSE 4.22e6 "" 1470… tt0122… "" Lambchops "George…
## 3 FALSE 4.22e6 "" 1444… tt0154… "" Annabelle Ser… "Two da…
## 4 FALSE 4.22e6 "" 1044… tt0223… "" La prise de T… "Three …
## 5 FALSE 4.22e6 "" 2570… tt0225… "" Bajaja "The fi…
## 6 FALSE 4.22e6 "" 3804… tt0298… "" Lettre d'une … ""
## 7 FALSE 4.22e6 "" 2831… tt0429… "" Shadowing the… "Docume…
## 8 FALSE 4.22e6 "" 1039… tt0838… "" Unfinished Sky "An Out…
## 9 FALSE 4.22e6 "" 3327… tt4432… "" Song of Lahore "Until …
## 10 FALSE 4.22e6 "" 3810… tt5333… "" Garn "The tr…
## 11 FALSE 4.22e6 "" 3815… tt5376… "" WiNWiN "Americ…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## # country1_language <fct>, country2_language <fct>, …
# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$status))## [1] "" "Canceled" "In Production" "Planned"
## [5] "Post Production" "Released" "Rumored"
The same issue is present on the original_language for
these variable, with no other inconsistencies encountered.
## # A tibble: 7 × 2
## status n
## <fct> <int>
## 1 "" 84
## 2 "Canceled" 2
## 3 "In Production" 20
## 4 "Planned" 15
## 5 "Post Production" 98
## 6 "Released" 44971
## 7 "Rumored" 227
In total 84 rows contains a blank status.
## # A tibble: 84 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 42496 tt0067… en Millhouse "Emile …
## 2 FALSE 4.22e6 "" 57868 tt0071… en The Autobiogr… "In Feb…
## 3 FALSE 4.22e6 "" 46770 tt0094… en Sur " "
## 4 FALSE 4.22e6 "" 41934 tt0095… en Heavy Petting "HEAVY …
## 5 FALSE 4.22e6 "" 41932 tt0097… en Easy Wheels "A grou…
## 6 FALSE 4.22e6 "" 41811 tt0099… en Eating "At a s…
## 7 FALSE 4.22e6 "" 77314 tt0101… fr The Cabinet o… ""
## 8 FALSE 4.22e6 "" 1236… tt0104… en Dream Deceive… "A chil…
## 9 FALSE 4.22e6 "" 1242… tt0106… en Anna: Ot shes… "Direct…
## 10 FALSE 4.22e6 "" 71687 tt0107… en My Life's in … "No gir…
## # ℹ 74 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …
# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$genre1))## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
Genres are properly described on the factors, the only
observations are the blank values and also that there is a 2 letter
white space in each factor, probably due to the column separation
process that was made to the columns in JSON format.
## # A tibble: 21 × 2
## genre1 n
## <fct> <int>
## 1 "" 2437
## 2 "Action" 4487
## 3 "Adventure" 1508
## 4 "Animation" 1123
## 5 "Comedy" 8815
## 6 "Crime" 1683
## 7 "Documentary" 3412
## 8 "Drama" 11952
## 9 "Family" 524
## 10 "Fantasy" 702
## # ℹ 11 more rows
## # A tibble: 21 × 2
## genre2 n
## <fct> <int>
## 1 "" 16988
## 2 "Action" 1544
## 3 "Adventure" 1412
## 4 "Animation" 617
## 5 "Comedy" 3262
## 6 "Crime" 1428
## 7 "Documentary" 469
## 8 "Drama" 6301
## 9 "Family" 1109
## 10 "Fantasy" 764
## # ℹ 11 more rows
## # A tibble: 21 × 2
## genre3 n
## <fct> <int>
## 1 "" 31454
## 2 "Action" 451
## 3 "Adventure" 422
## 4 "Animation" 171
## 5 "Comedy" 911
## 6 "Crime" 852
## 7 "Documentary" 38
## 8 "Drama" 1673
## 9 "Family" 756
## 10 "Fantasy" 538
## # ℹ 11 more rows
There are 2437 rows in the dataset where movies does not
have a defined genre, for the other two variables the number increases,
however we will only focus on the genre1 undefined genres as it is not
necessary for a movie to have more than one genre.
# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset. We will only use the first 50 levels for demostration purposes.
levels(droplevels(movies$company1)) %>% head(50)## [1] "" "01 Distribution"
## [3] "1 85 Films" "100 Halal"
## [5] "100 Bares" "101st Street Films"
## [7] "10dB Inc" "10th Hole Productions"
## [9] "11" "1201"
## [11] "120dB Films" "13 All Stars LLC"
## [13] "14 Luglio Cinematografica" "14 Reels Entertainment"
## [15] "1492 Pictures" "1818"
## [17] "1821 Pictures" "185 Trax"
## [19] "185º Equator" "19 Entertainment"
## [21] "1984 Private Defense Contractors" "2 4 7 Films"
## [23] "2 Man Production" "2 Player Productions"
## [25] "2 Smooth Film Productions" "2 Team Productions"
## [27] "20 Steps Productions" "20ten Media"
## [29] "20th Century Fox" "20th Century Fox Film Corporation"
## [31] "20th Century Fox Home Entertainment" "20th Century Fox Russia"
## [33] "20th Century Fox Television" "20th Century Pictures"
## [35] "21 Laps Entertainment" "21 One Productions"
## [37] "21st Century Film Corporation" "22 Dicembre"
## [39] "23 5 Filmproduktion" "23 Giugno"
## [41] "24 7 Films" "2425 PRODUCTION"
## [43] "26 Films" "27 Films Production"
## [45] "27 Productions" "29 fevralya"
## [47] "2929 Productions" "2afilm"
## [49] "2B Films" "2DS Productions"
#The other variables won't be show for presentation purposes but this is the code that would show their levels
#levels(droplevels(movies$company2))
#levels(droplevels(movies$company3))Looking at the factor levels for this categories, it is clear there are a lot of companies involved and therefore a lot of levels within the factors.
## # A tibble: 10,590 × 2
## company1 n
## <fct> <int>
## 1 "" 11861
## 2 "Paramount Pictures" 996
## 3 "Metro Goldwyn Mayer MGM" 851
## 4 "Twentieth Century Fox Film Corporation" 780
## 5 "Warner Bros" 757
## 6 "Universal Pictures" 754
## 7 "Columbia Pictures" 429
## 8 "Columbia Pictures Corporation" 401
## 9 "RKO Radio Pictures" 290
## 10 "United Artists" 272
## # ℹ 10,580 more rows
## # A tibble: 9,040 × 2
## company2 n
## <fct> <int>
## 1 "" 28428
## 2 "Warner Bros" 270
## 3 "Metro Goldwyn Mayer MGM" 149
## 4 "Canal+" 124
## 5 "Touchstone Pictures" 75
## 6 "Universal Pictures" 71
## 7 "TF1 Films Production" 52
## 8 "StudioCanal" 47
## 9 "Twentieth Century Fox Film Corporation" 45
## 10 "Amblin Entertainment" 43
## # ℹ 9,030 more rows
## # A tibble: 5,980 × 2
## company3 n
## <fct> <int>
## 1 "" 36385
## 2 "Warner Bros" 130
## 3 "Canal+" 109
## 4 "Metro Goldwyn Mayer MGM" 44
## 5 "Relativity Media" 42
## 6 "TF1 Films Production" 29
## 7 "Touchstone Pictures" 27
## 8 "Working Title Films" 24
## 9 "Centre National de la Cinématographie CNC" 20
## 10 "Film4" 20
## # ℹ 5,970 more rows
There are two ways to proceed with this issue, a first solution could be to only consider companies which have the most amount of movies produced as a levels and smaller companies consider them as others, the other solution would be to drop this variable as a factor and replace it’s data type as a string. We consider the first solution is the way to go as it would allow us to still make an analysis of the companies.
Other issues are the rows that contain a blank and also that there is a 2 space white space before each company name, which also should be corrected.
# Looking for the levels of a factor, droplevels will ensure that the factor only takes into account the levels that still exist in the dataset.
levels(droplevels(movies$country1_language))## [1] "" "Afrikaans" "Azərbaycan" "Bahasa indonesia"
## [5] "Bahasa melayu" "Bamanankan" "Bokmål" "Bosanski"
## [9] "Català" "Český" "Cymraeg" "Dansk"
## [13] "Deutsch" "Eesti" "English" "Español"
## [17] "Esperanto" "euskera" "Français" "Fulfulde"
## [21] "Gaeilge" "Galego" "Hausa" "Hrvatski"
## [25] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [29] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [33] "Magyar" "Nederlands" "No Language" "Norsk"
## [37] "Polski" "Português" "Pусский" "Română"
## [41] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [45] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [49] "Türkçe" "Wolof" "ελληνικά" "беларуская мова"
## [53] "български език" "қазақ" "Український" "ქართული"
## [57] "עִבְרִית" "اردو" "العربية" "پښتو"
## [61] "فارسی" "हिन्दी" "বাংলা" "ਪੰਜਾਬੀ"
## [65] "தமிழ்" "తెలుగు" "ภาษาไทย" "한국어 조선말"
## [69] "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Català" "Český"
## [9] "Cymraeg" "Dansk" "Deutsch" "Eesti"
## [13] "English" "Español" "Esperanto" "Français"
## [17] "Fulfulde" "Gaeilge" "Galego" "Hrvatski"
## [21] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [25] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [29] "Magyar" "Malti" "Nederlands" "No Language"
## [33] "Norsk" "ozbek" "Polski" "Português"
## [37] "Pусский" "Română" "shqip" "Slovenčina"
## [41] "Slovenščina" "Somali" "Srpski" "suomi"
## [45] "svenska" "Tiếng Việt" "Türkçe" "Wolof"
## [49] "ελληνικά" "български език" "қазақ" "Український"
## [53] "ქართული" "עִבְרִית" "اردو" "العربية"
## [57] "پښتو" "فارسی" "हिन्दी" "বাংলা"
## [61] "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు" "ภาษาไทย"
## [65] "한국어 조선말" "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Český" "Cymraeg"
## [9] "Dansk" "Deutsch" "Eesti" "English"
## [13] "Español" "Esperanto" "euskera" "Français"
## [17] "Gaeilge" "Hrvatski" "isiZulu" "Íslenska"
## [21] "Italiano" "Kiswahili" "Latin" "Latviešu"
## [25] "Lietuvi x9akai" "Magyar" "Nederlands" "Norsk"
## [29] "Polski" "Português" "Pусский" "Română"
## [33] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [37] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [41] "Türkçe" "Wolof" "ελληνικά" "български език"
## [45] "қазақ" "Український" "ქართული" "עִבְרִית"
## [49] "اردو" "العربية" "پښتو" "فارسی"
## [53] "हिन्दी" "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు"
## [57] "ภาษาไทย" "한국어 조선말" "广州话 廣州話" "日本語"
## [61] "普通话"
Factor levels have the first 2 characters in blank in a similar way to other variables plus a level without characters, however the biggest issue is that the languages are written in their original language with may complicate our efforts to analyze variables related to language, so a solution could be to translate the language names to English.
## # A tibble: 71 × 2
## country1_language n
## <fct> <int>
## 1 "" 4050
## 2 "Afrikaans" 22
## 3 "Azərbaycan" 4
## 4 "Bahasa indonesia" 26
## 5 "Bahasa melayu" 5
## 6 "Bamanankan" 4
## 7 "Bokmål" 3
## 8 "Bosanski" 25
## 9 "Català" 31
## 10 "Český" 263
## # ℹ 61 more rows
## # A tibble: 68 × 2
## country2_language n
## <fct> <int>
## 1 "" 37667
## 2 "Afrikaans" 4
## 3 "Bahasa indonesia" 9
## 4 "Bahasa melayu" 4
## 5 "Bamanankan" 1
## 6 "Bosanski" 3
## 7 "Català" 5
## 8 "Český" 14
## 9 "Cymraeg" 4
## 10 "Dansk" 18
## # ℹ 58 more rows
## # A tibble: 61 × 2
## country3_language n
## <fct> <int>
## 1 "" 42970
## 2 "Afrikaans" 2
## 3 "Bahasa indonesia" 2
## 4 "Bahasa melayu" 6
## 5 "Bamanankan" 1
## 6 "Bosanski" 2
## 7 "Český" 2
## 8 "Cymraeg" 1
## 9 "Dansk" 5
## 10 "Deutsch" 328
## # ℹ 51 more rows
Only the first country_language variable with a blank
level must be fixed, as it is not necessary for a movie to have more
than 1 language available.
By using functions on factors we were able to detect inconsistencies, errors and opportunities to improve the data legibility by making adjustments to the factor levels. In this sections we will be fixing all related categorical data issues that were detected previously.
## [1] "" "ab" "af" "am" "ar" "ay" "bg" "bm" "bn" "bo" "bs" "ca" "cn" "cs" "cy"
## [16] "da" "de" "el" "en" "eo" "es" "et" "eu" "fa" "fi" "fr" "fy" "gl" "he" "hi"
## [31] "hr" "hu" "hy" "id" "is" "it" "iu" "ja" "jv" "ka" "kk" "kn" "ko" "ku" "ky"
## [46] "la" "lb" "lo" "lt" "lv" "mk" "ml" "mn" "mr" "ms" "mt" "nb" "ne" "nl" "no"
## [61] "pa" "pl" "ps" "pt" "qu" "ro" "ru" "rw" "sh" "si" "sk" "sl" "sm" "sq" "sr"
## [76] "sv" "ta" "te" "tg" "th" "tl" "tr" "uk" "ur" "uz" "vi" "wo" "xx" "zh" "zu"
## # A tibble: 11 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 3591… tt0053… "" 13 Fighting M… "A grou…
## 2 FALSE 4.22e6 "" 1470… tt0122… "" Lambchops "George…
## 3 FALSE 4.22e6 "" 1444… tt0154… "" Annabelle Ser… "Two da…
## 4 FALSE 4.22e6 "" 1044… tt0223… "" La prise de T… "Three …
## 5 FALSE 4.22e6 "" 2570… tt0225… "" Bajaja "The fi…
## 6 FALSE 4.22e6 "" 3804… tt0298… "" Lettre d'une … ""
## 7 FALSE 4.22e6 "" 2831… tt0429… "" Shadowing the… "Docume…
## 8 FALSE 4.22e6 "" 1039… tt0838… "" Unfinished Sky "An Out…
## 9 FALSE 4.22e6 "" 3327… tt4432… "" Song of Lahore "Until …
## 10 FALSE 4.22e6 "" 3810… tt5333… "" Garn "The tr…
## 11 FALSE 4.22e6 "" 3815… tt5376… "" WiNWiN "Americ…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## # country1_language <fct>, country2_language <fct>, …
There are only 11 rows in which the original language is not present
# Add rows without original language to the category "xx"
movies <- movies %>% mutate(original_language = fct_collapse(original_language, xx = c("xx","")))## # A tibble: 0 × 38
## # ℹ 38 variables: adult <lgl>, budget <dbl>, homepage <chr>, id <chr>,
## # imdb_id <chr>, original_language <fct>, original_title <chr>,
## # overview <chr>, popularity <dbl>, poster_path <chr>, release_date <date>,
## # revenue <dbl>, runtime <dbl>, status <fct>, tagline <chr>, title <chr>,
## # video <chr>, vote_average <dbl>, vote_count <int>, id_collection <chr>,
## # name_collection <chr>, poster_path_collection <chr>,
## # backdrop_path_collection <chr>, genre1 <fct>, genre2 <fct>, genre3 <fct>, …
## # A tibble: 44 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 16624 tt0000… xx Blacksmith Sc… "Three …
## 2 FALSE 4.22e6 "" 1330… tt0000… xx Le manoir du … "A bat …
## 3 FALSE 4.22e6 "" 1323… tt0000… xx The '?' Motor… "A magi…
## 4 FALSE 4.22e6 "" 36208 tt0009… xx A Dog's Life "Poor C…
## 5 FALSE 4.22e6 "" 70804 tt0010… xx J'accuse! "The st…
## 6 FALSE 4.22e6 "" 47703 tt0013… xx Дневник Глумо… "Filmic…
## 7 FALSE 4.22e6 "" 42565 tt0018… xx Underworld "Boiste…
## 8 FALSE 4.22e6 "" 3591… tt0053… xx 13 Fighting M… "A grou…
## 9 FALSE 1.20e7 "" 62204 tt0082… xx La Guerre du … "A colo…
## 10 FALSE 4.22e6 "" 1237… tt0082… xx Junkopia "A shor…
## # ℹ 34 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …
release_date variable to
know which status give to movies without one## [1] "" "Canceled" "In Production" "Planned"
## [5] "Post Production" "Released" "Rumored"
## # A tibble: 84 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 42496 tt0067… en Millhouse "Emile …
## 2 FALSE 4.22e6 "" 57868 tt0071… en The Autobiogr… "In Feb…
## 3 FALSE 4.22e6 "" 46770 tt0094… en Sur " "
## 4 FALSE 4.22e6 "" 41934 tt0095… en Heavy Petting "HEAVY …
## 5 FALSE 4.22e6 "" 41932 tt0097… en Easy Wheels "A grou…
## 6 FALSE 4.22e6 "" 41811 tt0099… en Eating "At a s…
## 7 FALSE 4.22e6 "" 77314 tt0101… fr The Cabinet o… ""
## 8 FALSE 4.22e6 "" 1236… tt0104… en Dream Deceive… "A chil…
## 9 FALSE 4.22e6 "" 1242… tt0106… en Anna: Ot shes… "Direct…
## 10 FALSE 4.22e6 "" 71687 tt0107… en My Life's in … "No gir…
## # ℹ 74 more rows
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, …
Most movies do contain a release date which has already happened, so for those cases where the movies have a release date before 2017 their status will be considered as “Released”.
# Assign movies with a release date to the level "Released"
movies <- movies %>%
mutate(status = if_else(!is.na(release_date),fct_collapse(status, Released = c("Released","")),status))# Check if the movies without status and a release date now form part of "Released"
movies %>% count(status)## # A tibble: 7 × 2
## status n
## <fct> <int>
## 1 "Released" 45051
## 2 "Canceled" 2
## 3 "In Production" 20
## 4 "Planned" 15
## 5 "Post Production" 98
## 6 "Rumored" 227
## 7 "" 4
## # A tibble: 4 × 38
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 82663 tt0113… en Midnight Man British…
## 2 FALSE 4.22e6 "" 94214 tt0210… en Jails, Hospit… Jails, …
## 3 FALSE 4.22e6 "http:/… 1226… tt2423… ja マルドゥック… Third f…
## 4 FALSE 4.22e6 "" 2492… tt2622… en Avalanche Sha… A group…
## # ℹ 30 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## # country1_language <fct>, country2_language <fct>, …
Due to the fact only a few rows without status remain, their imdb_id’s were directly searched for in order to find their status. It seems all the remaining movies were also released, so we are adding these movies to the “Released” category.
# Add remaining movies to category "Released"
movies <- movies %>% mutate(status = fct_collapse(status, Released = c("Released","")))## # A tibble: 6 × 2
## status n
## <fct> <int>
## 1 Released 45055
## 2 Canceled 2
## 3 In Production 20
## 4 Planned 15
## 5 Post Production 98
## 6 Rumored 227
## [1] "Released" "Canceled" "In Production" "Planned"
## [5] "Post Production" "Rumored"
“status” column now contains the proper categories and does not need any additional fixes.
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## # A tibble: 21 × 2
## genre1 n
## <fct> <int>
## 1 "" 2437
## 2 "Action" 4487
## 3 "Adventure" 1508
## 4 "Animation" 1123
## 5 "Comedy" 8815
## 6 "Crime" 1683
## 7 "Documentary" 3412
## 8 "Drama" 11952
## 9 "Family" 524
## 10 "Fantasy" 702
## # ℹ 11 more rows
Due to the amount of rows without a genre data cannot be manually
added without taking a long amount of time and we do not have a way to
extract large amounts of data from imdb, therefore we are going to put
the rows without a genre in a category called “Unspecified” for
genre2 and genre3 when it is not necessary a
movie has more than 1 genre, we are going to use the term “NA” as the
category name.
# Create the category "Unspecified"
movies <- movies %>% mutate(genre1 = fct_collapse(genre1, Unspecified = ""))## [1] "Unspecified" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## # A tibble: 21 × 2
## genre1 n
## <fct> <int>
## 1 Unspecified 2437
## 2 Action 4487
## 3 Adventure 1508
## 4 Animation 1123
## 5 Comedy 8815
## 6 Crime 1683
## 7 Documentary 3412
## 8 Drama 11952
## 9 Family 524
## 10 Fantasy 702
## # ℹ 11 more rows
# Eliminate white space inconsistency
movies <- movies %>% mutate(genre1 = str_trim(genre1))
movies <- movies %>% mutate(genre2 = str_trim(genre2))
movies <- movies %>% mutate(genre3 = str_trim(genre3))# Reconvert variables to factor data type
movies <- movies %>% mutate(genre1 = as.factor(movies$genre1))
movies <- movies %>% mutate(genre2 = as.factor(movies$genre2))
movies <- movies %>% mutate(genre3 = as.factor(movies$genre3))## [1] "Action" "Adventure" "Animation" "Comedy"
## [5] "Crime" "Documentary" "Drama" "Family"
## [9] "Fantasy" "Foreign" "History" "Horror"
## [13] "Music" "Mystery" "Romance" "Science Fiction"
## [17] "Thriller" "TV Movie" "Unspecified" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
## [1] "" "Action" "Adventure" "Animation"
## [5] "Comedy" "Crime" "Documentary" "Drama"
## [9] "Family" "Fantasy" "Foreign" "History"
## [13] "Horror" "Music" "Mystery" "Romance"
## [17] "Science Fiction" "Thriller" "TV Movie" "War"
## [21] "Western"
# Create a new column 'genre_count' to count the number of genres for each row
movies$genre_count <- rowSums(movies[, c("genre1", "genre2", "genre3")] != "")I left columns genre2 and genre3 blank to
make a count of those movies with 1 or more genres for later use. Genre
variables are now clean and ready for use in analysis.
## [1] "" "01 Distribution"
## [3] "1 85 Films" "100 Halal"
## [5] "100 Bares" "101st Street Films"
## [7] "10dB Inc" "10th Hole Productions"
## [9] "11" "1201"
## [11] "120dB Films" "13 All Stars LLC"
## [13] "14 Luglio Cinematografica" "14 Reels Entertainment"
## [15] "1492 Pictures" "1818"
## [17] "1821 Pictures" "185 Trax"
## [19] "185º Equator" "19 Entertainment"
## [21] "1984 Private Defense Contractors" "2 4 7 Films"
## [23] "2 Man Production" "2 Player Productions"
## [25] "2 Smooth Film Productions"
# Sort companies by amount of movies produced
company1_sort <- movies %>% count(company1) %>% arrange(desc(n))
company2_sort <- movies %>% count(company2) %>% arrange(desc(n))
company3_sort <- movies %>% count(company3) %>% arrange(desc(n))# Get main companies and the cases where the company is not specified
top_50_company1 <- company1_sort$company1[1:51]
top_50_company2 <- company2_sort$company2[1:51]
top_50_company3 <- company3_sort$company3[1:51]# Move all the companies that does not form part of the 50 biggest companies or are blank in the category "Other"
movies <- movies %>% mutate(company1 = fct_collapse(company1, "Other" = company1[!company1 %in% top_50_company1]))
movies <- movies %>% mutate(company2 = fct_collapse(company2, "Other" = company2[!company2 %in% top_50_company2]))
movies <- movies %>% mutate(company3 = fct_collapse(company3, "Other" = company3[!company3 %in% top_50_company2]))## [1] ""
## [2] "Other"
## [3] "American International Pictures AIP"
## [4] "BBC Films"
## [5] "British Broadcasting Corporation BBC"
## [6] "Canal+"
## [7] "Channel Four Films"
## [8] "CJ Entertainment"
## [9] "Columbia Pictures"
## [10] "Columbia Pictures Corporation"
## [11] "DC Comics"
## [12] "DreamWorks SKG"
## [13] "First National Pictures"
## [14] "Fox Film Corporation"
## [15] "Fox Searchlight Pictures"
## [16] "France 2 Cinéma"
## [17] "Gaumont"
## [18] "Hammer Film Productions"
## [19] "Hollywood Pictures"
## [20] "Imagine Entertainment"
## [21] "Lions Gate Films"
## [22] "Lionsgate"
## [23] "Metro Goldwyn Mayer MGM"
## [24] "Miramax Films"
## [25] "Monogram Pictures"
## [26] "Mosfilm"
## [27] "New Line Cinema"
## [28] "New World Pictures"
## [29] "Nikkatsu"
## [30] "Nordisk Film"
## [31] "Orion Pictures"
## [32] "Paramount Pictures"
## [33] "Rai Cinema"
## [34] "Regency Enterprises"
## [35] "RKO Radio Pictures"
## [36] "Shaw Brothers"
## [37] "Shôchiku Eiga"
## [38] "StudioCanal"
## [39] "Summit Entertainment"
## [40] "The Rank Organisation"
## [41] "TLA Releasing"
## [42] "Toho Company"
## [43] "Touchstone Pictures"
## [44] "TriStar Pictures"
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"
## [47] "Universal International Pictures UI"
## [48] "Universal Pictures"
## [49] "Village Roadshow Pictures"
## [50] "Walt Disney Pictures"
## [51] "Walt Disney Productions"
## [52] "Warner Bros"
## # A tibble: 52 × 2
## company1 n
## <fct> <int>
## 1 "Other" 24274
## 2 "" 11861
## 3 "Paramount Pictures" 996
## 4 "Metro Goldwyn Mayer MGM" 851
## 5 "Twentieth Century Fox Film Corporation" 780
## 6 "Warner Bros" 757
## 7 "Universal Pictures" 754
## 8 "Columbia Pictures" 429
## 9 "Columbia Pictures Corporation" 401
## 10 "RKO Radio Pictures" 290
## # ℹ 42 more rows
## [1] ""
## [2] "Other"
## [3] "Amblin Entertainment"
## [4] "American International Pictures AIP"
## [5] "BBC Films"
## [6] "Blumhouse Productions"
## [7] "British Broadcasting Corporation BBC"
## [8] "Canal+"
## [9] "Carolco Pictures"
## [10] "Castle Rock Entertainment"
## [11] "Columbia Pictures Corporation"
## [12] "Dimension Films"
## [13] "DreamWorks SKG"
## [14] "Dune Entertainment"
## [15] "Film i Väst"
## [16] "Film4"
## [17] "Focus Features"
## [18] "Globo Filmes"
## [19] "Happy Madison Productions"
## [20] "HBO Films"
## [21] "Hollywood Pictures"
## [22] "Lionsgate"
## [23] "M6 Films"
## [24] "Metro Goldwyn Mayer MGM"
## [25] "Millennium Films"
## [26] "Morgan Creek Productions"
## [27] "Nickelodeon Movies"
## [28] "Original Film"
## [29] "Pixar Animation Studios"
## [30] "PolyGram Filmed Entertainment"
## [31] "Rai Cinema"
## [32] "Regency Enterprises"
## [33] "Relativity Media"
## [34] "Revolution Studios"
## [35] "Scott Rudin Productions"
## [36] "Spyglass Entertainment"
## [37] "StudioCanal"
## [38] "Svensk Filmindustri SF"
## [39] "TF1 Films Production"
## [40] "The Vitaphone Corporation"
## [41] "Touchstone Pictures"
## [42] "TriStar Pictures"
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"
## [45] "United Artists Pictures"
## [46] "Universal Pictures"
## [47] "Walt Disney Animation Studios"
## [48] "Walt Disney Productions"
## [49] "Warner Bros"
## [50] "Warner Bros Animation"
## [51] "Wild Bunch"
## [52] "Zweites Deutsches Fernsehen ZDF"
## # A tibble: 52 × 2
## company2 n
## <fct> <int>
## 1 "" 28428
## 2 "Other" 15072
## 3 "Warner Bros" 270
## 4 "Metro Goldwyn Mayer MGM" 149
## 5 "Canal+" 124
## 6 "Touchstone Pictures" 75
## 7 "Universal Pictures" 71
## 8 "TF1 Films Production" 52
## 9 "StudioCanal" 47
## 10 "Twentieth Century Fox Film Corporation" 45
## # ℹ 42 more rows
## [1] ""
## [2] "Other"
## [3] "Amblin Entertainment"
## [4] "BBC Films"
## [5] "Blumhouse Productions"
## [6] "British Broadcasting Corporation BBC"
## [7] "Canal+"
## [8] "Carolco Pictures"
## [9] "Castle Rock Entertainment"
## [10] "Columbia Pictures Corporation"
## [11] "Dimension Films"
## [12] "Dune Entertainment"
## [13] "Film i Väst"
## [14] "Film4"
## [15] "Focus Features"
## [16] "Globo Filmes"
## [17] "Happy Madison Productions"
## [18] "HBO Films"
## [19] "Hollywood Pictures"
## [20] "Lionsgate"
## [21] "M6 Films"
## [22] "Metro Goldwyn Mayer MGM"
## [23] "Millennium Films"
## [24] "Morgan Creek Productions"
## [25] "Nickelodeon Movies"
## [26] "Original Film"
## [27] "PolyGram Filmed Entertainment"
## [28] "Rai Cinema"
## [29] "Regency Enterprises"
## [30] "Relativity Media"
## [31] "Revolution Studios"
## [32] "Scott Rudin Productions"
## [33] "Spyglass Entertainment"
## [34] "StudioCanal"
## [35] "Svensk Filmindustri SF"
## [36] "TF1 Films Production"
## [37] "The Vitaphone Corporation"
## [38] "Touchstone Pictures"
## [39] "TriStar Pictures"
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"
## [42] "Universal Pictures"
## [43] "Warner Bros"
## [44] "Warner Bros Animation"
## [45] "Wild Bunch"
## [46] "Zweites Deutsches Fernsehen ZDF"
## # A tibble: 46 × 2
## company3 n
## <fct> <int>
## 1 "" 36385
## 2 "Other" 8332
## 3 "Warner Bros" 130
## 4 "Canal+" 109
## 5 "Metro Goldwyn Mayer MGM" 44
## 6 "Relativity Media" 42
## 7 "TF1 Films Production" 29
## 8 "Touchstone Pictures" 27
## 9 "Film4" 20
## 10 "Millennium Films" 19
## # ℹ 36 more rows
The next step is to replace the blank category with a new category
name “No Company” for company1, this because it is not important to call
“No Company” for the other variables. To make use of this I will add a
variable called company_count for later analysis.
# Create a new column 'genre_count' to count the number of genres for each row
movies$company_count <- rowSums(movies[, c("company1", "company2", "company3")] != "")## [1] "No Company"
## [2] "Other"
## [3] "American International Pictures AIP"
## [4] "BBC Films"
## [5] "British Broadcasting Corporation BBC"
## [6] "Canal+"
## [7] "Channel Four Films"
## [8] "CJ Entertainment"
## [9] "Columbia Pictures"
## [10] "Columbia Pictures Corporation"
## [11] "DC Comics"
## [12] "DreamWorks SKG"
## [13] "First National Pictures"
## [14] "Fox Film Corporation"
## [15] "Fox Searchlight Pictures"
## [16] "France 2 Cinéma"
## [17] "Gaumont"
## [18] "Hammer Film Productions"
## [19] "Hollywood Pictures"
## [20] "Imagine Entertainment"
## [21] "Lions Gate Films"
## [22] "Lionsgate"
## [23] "Metro Goldwyn Mayer MGM"
## [24] "Miramax Films"
## [25] "Monogram Pictures"
## [26] "Mosfilm"
## [27] "New Line Cinema"
## [28] "New World Pictures"
## [29] "Nikkatsu"
## [30] "Nordisk Film"
## [31] "Orion Pictures"
## [32] "Paramount Pictures"
## [33] "Rai Cinema"
## [34] "Regency Enterprises"
## [35] "RKO Radio Pictures"
## [36] "Shaw Brothers"
## [37] "Shôchiku Eiga"
## [38] "StudioCanal"
## [39] "Summit Entertainment"
## [40] "The Rank Organisation"
## [41] "TLA Releasing"
## [42] "Toho Company"
## [43] "Touchstone Pictures"
## [44] "TriStar Pictures"
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"
## [47] "Universal International Pictures UI"
## [48] "Universal Pictures"
## [49] "Village Roadshow Pictures"
## [50] "Walt Disney Pictures"
## [51] "Walt Disney Productions"
## [52] "Warner Bros"
## # A tibble: 52 × 2
## company1 n
## <fct> <int>
## 1 Other 24274
## 2 No Company 11861
## 3 Paramount Pictures 996
## 4 Metro Goldwyn Mayer MGM 851
## 5 Twentieth Century Fox Film Corporation 780
## 6 Warner Bros 757
## 7 Universal Pictures 754
## 8 Columbia Pictures 429
## 9 Columbia Pictures Corporation 401
## 10 RKO Radio Pictures 290
## # ℹ 42 more rows
## [1] ""
## [2] "Other"
## [3] "Amblin Entertainment"
## [4] "American International Pictures AIP"
## [5] "BBC Films"
## [6] "Blumhouse Productions"
## [7] "British Broadcasting Corporation BBC"
## [8] "Canal+"
## [9] "Carolco Pictures"
## [10] "Castle Rock Entertainment"
## [11] "Columbia Pictures Corporation"
## [12] "Dimension Films"
## [13] "DreamWorks SKG"
## [14] "Dune Entertainment"
## [15] "Film i Väst"
## [16] "Film4"
## [17] "Focus Features"
## [18] "Globo Filmes"
## [19] "Happy Madison Productions"
## [20] "HBO Films"
## [21] "Hollywood Pictures"
## [22] "Lionsgate"
## [23] "M6 Films"
## [24] "Metro Goldwyn Mayer MGM"
## [25] "Millennium Films"
## [26] "Morgan Creek Productions"
## [27] "Nickelodeon Movies"
## [28] "Original Film"
## [29] "Pixar Animation Studios"
## [30] "PolyGram Filmed Entertainment"
## [31] "Rai Cinema"
## [32] "Regency Enterprises"
## [33] "Relativity Media"
## [34] "Revolution Studios"
## [35] "Scott Rudin Productions"
## [36] "Spyglass Entertainment"
## [37] "StudioCanal"
## [38] "Svensk Filmindustri SF"
## [39] "TF1 Films Production"
## [40] "The Vitaphone Corporation"
## [41] "Touchstone Pictures"
## [42] "TriStar Pictures"
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"
## [45] "United Artists Pictures"
## [46] "Universal Pictures"
## [47] "Walt Disney Animation Studios"
## [48] "Walt Disney Productions"
## [49] "Warner Bros"
## [50] "Warner Bros Animation"
## [51] "Wild Bunch"
## [52] "Zweites Deutsches Fernsehen ZDF"
## # A tibble: 52 × 2
## company2 n
## <fct> <int>
## 1 "" 28428
## 2 "Other" 15072
## 3 "Warner Bros" 270
## 4 "Metro Goldwyn Mayer MGM" 149
## 5 "Canal+" 124
## 6 "Touchstone Pictures" 75
## 7 "Universal Pictures" 71
## 8 "TF1 Films Production" 52
## 9 "StudioCanal" 47
## 10 "Twentieth Century Fox Film Corporation" 45
## # ℹ 42 more rows
## [1] ""
## [2] "Other"
## [3] "Amblin Entertainment"
## [4] "BBC Films"
## [5] "Blumhouse Productions"
## [6] "British Broadcasting Corporation BBC"
## [7] "Canal+"
## [8] "Carolco Pictures"
## [9] "Castle Rock Entertainment"
## [10] "Columbia Pictures Corporation"
## [11] "Dimension Films"
## [12] "Dune Entertainment"
## [13] "Film i Väst"
## [14] "Film4"
## [15] "Focus Features"
## [16] "Globo Filmes"
## [17] "Happy Madison Productions"
## [18] "HBO Films"
## [19] "Hollywood Pictures"
## [20] "Lionsgate"
## [21] "M6 Films"
## [22] "Metro Goldwyn Mayer MGM"
## [23] "Millennium Films"
## [24] "Morgan Creek Productions"
## [25] "Nickelodeon Movies"
## [26] "Original Film"
## [27] "PolyGram Filmed Entertainment"
## [28] "Rai Cinema"
## [29] "Regency Enterprises"
## [30] "Relativity Media"
## [31] "Revolution Studios"
## [32] "Scott Rudin Productions"
## [33] "Spyglass Entertainment"
## [34] "StudioCanal"
## [35] "Svensk Filmindustri SF"
## [36] "TF1 Films Production"
## [37] "The Vitaphone Corporation"
## [38] "Touchstone Pictures"
## [39] "TriStar Pictures"
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"
## [42] "Universal Pictures"
## [43] "Warner Bros"
## [44] "Warner Bros Animation"
## [45] "Wild Bunch"
## [46] "Zweites Deutsches Fernsehen ZDF"
## # A tibble: 46 × 2
## company3 n
## <fct> <int>
## 1 "" 36385
## 2 "Other" 8332
## 3 "Warner Bros" 130
## 4 "Canal+" 109
## 5 "Metro Goldwyn Mayer MGM" 44
## 6 "Relativity Media" 42
## 7 "TF1 Films Production" 29
## 8 "Touchstone Pictures" 27
## 9 "Film4" 20
## 10 "Millennium Films" 19
## # ℹ 36 more rows
Finally the blank spaces are going to be removed from each row.
# Eliminate white space inconsistency
movies <- movies %>% mutate(company1 = str_trim(company1))
movies <- movies %>% mutate(company2 = str_trim(company2))
movies <- movies %>% mutate(company3 = str_trim(company3))# Reconvert variables to factor data type
movies <- movies %>% mutate(company1 = as.factor(movies$company1))
movies <- movies %>% mutate(company2 = as.factor(movies$company2))
movies <- movies %>% mutate(company3 = as.factor(movies$company3))## [1] "American International Pictures AIP"
## [2] "BBC Films"
## [3] "British Broadcasting Corporation BBC"
## [4] "Canal+"
## [5] "Channel Four Films"
## [6] "CJ Entertainment"
## [7] "Columbia Pictures"
## [8] "Columbia Pictures Corporation"
## [9] "DC Comics"
## [10] "DreamWorks SKG"
## [11] "First National Pictures"
## [12] "Fox Film Corporation"
## [13] "Fox Searchlight Pictures"
## [14] "France 2 Cinéma"
## [15] "Gaumont"
## [16] "Hammer Film Productions"
## [17] "Hollywood Pictures"
## [18] "Imagine Entertainment"
## [19] "Lions Gate Films"
## [20] "Lionsgate"
## [21] "Metro Goldwyn Mayer MGM"
## [22] "Miramax Films"
## [23] "Monogram Pictures"
## [24] "Mosfilm"
## [25] "New Line Cinema"
## [26] "New World Pictures"
## [27] "Nikkatsu"
## [28] "No Company"
## [29] "Nordisk Film"
## [30] "Orion Pictures"
## [31] "Other"
## [32] "Paramount Pictures"
## [33] "Rai Cinema"
## [34] "Regency Enterprises"
## [35] "RKO Radio Pictures"
## [36] "Shaw Brothers"
## [37] "Shôchiku Eiga"
## [38] "StudioCanal"
## [39] "Summit Entertainment"
## [40] "The Rank Organisation"
## [41] "TLA Releasing"
## [42] "Toho Company"
## [43] "Touchstone Pictures"
## [44] "TriStar Pictures"
## [45] "Twentieth Century Fox Film Corporation"
## [46] "United Artists"
## [47] "Universal International Pictures UI"
## [48] "Universal Pictures"
## [49] "Village Roadshow Pictures"
## [50] "Walt Disney Pictures"
## [51] "Walt Disney Productions"
## [52] "Warner Bros"
## [1] ""
## [2] "Amblin Entertainment"
## [3] "American International Pictures AIP"
## [4] "BBC Films"
## [5] "Blumhouse Productions"
## [6] "British Broadcasting Corporation BBC"
## [7] "Canal+"
## [8] "Carolco Pictures"
## [9] "Castle Rock Entertainment"
## [10] "Columbia Pictures Corporation"
## [11] "Dimension Films"
## [12] "DreamWorks SKG"
## [13] "Dune Entertainment"
## [14] "Film i Väst"
## [15] "Film4"
## [16] "Focus Features"
## [17] "Globo Filmes"
## [18] "Happy Madison Productions"
## [19] "HBO Films"
## [20] "Hollywood Pictures"
## [21] "Lionsgate"
## [22] "M6 Films"
## [23] "Metro Goldwyn Mayer MGM"
## [24] "Millennium Films"
## [25] "Morgan Creek Productions"
## [26] "Nickelodeon Movies"
## [27] "Original Film"
## [28] "Other"
## [29] "Pixar Animation Studios"
## [30] "PolyGram Filmed Entertainment"
## [31] "Rai Cinema"
## [32] "Regency Enterprises"
## [33] "Relativity Media"
## [34] "Revolution Studios"
## [35] "Scott Rudin Productions"
## [36] "Spyglass Entertainment"
## [37] "StudioCanal"
## [38] "Svensk Filmindustri SF"
## [39] "TF1 Films Production"
## [40] "The Vitaphone Corporation"
## [41] "Touchstone Pictures"
## [42] "TriStar Pictures"
## [43] "Twentieth Century Fox Film Corporation"
## [44] "UK Film Council"
## [45] "United Artists Pictures"
## [46] "Universal Pictures"
## [47] "Walt Disney Animation Studios"
## [48] "Walt Disney Productions"
## [49] "Warner Bros"
## [50] "Warner Bros Animation"
## [51] "Wild Bunch"
## [52] "Zweites Deutsches Fernsehen ZDF"
## [1] ""
## [2] "Amblin Entertainment"
## [3] "BBC Films"
## [4] "Blumhouse Productions"
## [5] "British Broadcasting Corporation BBC"
## [6] "Canal+"
## [7] "Carolco Pictures"
## [8] "Castle Rock Entertainment"
## [9] "Columbia Pictures Corporation"
## [10] "Dimension Films"
## [11] "Dune Entertainment"
## [12] "Film i Väst"
## [13] "Film4"
## [14] "Focus Features"
## [15] "Globo Filmes"
## [16] "Happy Madison Productions"
## [17] "HBO Films"
## [18] "Hollywood Pictures"
## [19] "Lionsgate"
## [20] "M6 Films"
## [21] "Metro Goldwyn Mayer MGM"
## [22] "Millennium Films"
## [23] "Morgan Creek Productions"
## [24] "Nickelodeon Movies"
## [25] "Original Film"
## [26] "Other"
## [27] "PolyGram Filmed Entertainment"
## [28] "Rai Cinema"
## [29] "Regency Enterprises"
## [30] "Relativity Media"
## [31] "Revolution Studios"
## [32] "Scott Rudin Productions"
## [33] "Spyglass Entertainment"
## [34] "StudioCanal"
## [35] "Svensk Filmindustri SF"
## [36] "TF1 Films Production"
## [37] "The Vitaphone Corporation"
## [38] "Touchstone Pictures"
## [39] "TriStar Pictures"
## [40] "Twentieth Century Fox Film Corporation"
## [41] "UK Film Council"
## [42] "Universal Pictures"
## [43] "Warner Bros"
## [44] "Warner Bros Animation"
## [45] "Wild Bunch"
## [46] "Zweites Deutsches Fernsehen ZDF"
Company column is now clean with proper categorization and its ready for use in analysis.
## [1] "" "Afrikaans" "Azərbaycan" "Bahasa indonesia"
## [5] "Bahasa melayu" "Bamanankan" "Bokmål" "Bosanski"
## [9] "Català" "Český" "Cymraeg" "Dansk"
## [13] "Deutsch" "Eesti" "English" "Español"
## [17] "Esperanto" "euskera" "Français" "Fulfulde"
## [21] "Gaeilge" "Galego" "Hausa" "Hrvatski"
## [25] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [29] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [33] "Magyar" "Nederlands" "No Language" "Norsk"
## [37] "Polski" "Português" "Pусский" "Română"
## [41] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [45] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [49] "Türkçe" "Wolof" "ελληνικά" "беларуская мова"
## [53] "български език" "қазақ" "Український" "ქართული"
## [57] "עִבְרִית" "اردو" "العربية" "پښتو"
## [61] "فارسی" "हिन्दी" "বাংলা" "ਪੰਜਾਬੀ"
## [65] "தமிழ்" "తెలుగు" "ภาษาไทย" "한국어 조선말"
## [69] "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Català" "Český"
## [9] "Cymraeg" "Dansk" "Deutsch" "Eesti"
## [13] "English" "Español" "Esperanto" "Français"
## [17] "Fulfulde" "Gaeilge" "Galego" "Hrvatski"
## [21] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [25] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [29] "Magyar" "Malti" "Nederlands" "No Language"
## [33] "Norsk" "ozbek" "Polski" "Português"
## [37] "Pусский" "Română" "shqip" "Slovenčina"
## [41] "Slovenščina" "Somali" "Srpski" "suomi"
## [45] "svenska" "Tiếng Việt" "Türkçe" "Wolof"
## [49] "ελληνικά" "български език" "қазақ" "Український"
## [53] "ქართული" "עִבְרִית" "اردو" "العربية"
## [57] "پښتو" "فارسی" "हिन्दी" "বাংলা"
## [61] "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు" "ภาษาไทย"
## [65] "한국어 조선말" "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Český" "Cymraeg"
## [9] "Dansk" "Deutsch" "Eesti" "English"
## [13] "Español" "Esperanto" "euskera" "Français"
## [17] "Gaeilge" "Hrvatski" "isiZulu" "Íslenska"
## [21] "Italiano" "Kiswahili" "Latin" "Latviešu"
## [25] "Lietuvi x9akai" "Magyar" "Nederlands" "Norsk"
## [29] "Polski" "Português" "Pусский" "Română"
## [33] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [37] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [41] "Türkçe" "Wolof" "ελληνικά" "български език"
## [45] "қазақ" "Український" "ქართული" "עִבְרִית"
## [49] "اردو" "العربية" "پښتو" "فارسی"
## [53] "हिन्दी" "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు"
## [57] "ภาษาไทย" "한국어 조선말" "广州话 廣州話" "日本語"
## [61] "普通话"
The fisrt step to clean the columns will be to remove the whitespace on each of the variables.
# Eliminate white space inconsistency
movies <- movies %>% mutate(country1_language = str_trim(country1_language))
movies <- movies %>% mutate(country2_language = str_trim(country2_language))
movies <- movies %>% mutate(country3_language = str_trim(country3_language))# Reconvert variables to factor data type
movies <- movies %>% mutate(country1_language = as.factor(movies$country1_language))
movies <- movies %>% mutate(country2_language = as.factor(movies$country2_language))
movies <- movies %>% mutate(country3_language = as.factor(movies$country3_language))## [1] "" "Afrikaans" "Azərbaycan" "Bahasa indonesia"
## [5] "Bahasa melayu" "Bamanankan" "Bokmål" "Bosanski"
## [9] "Català" "Český" "Cymraeg" "Dansk"
## [13] "Deutsch" "Eesti" "English" "Español"
## [17] "Esperanto" "euskera" "Français" "Fulfulde"
## [21] "Gaeilge" "Galego" "Hausa" "Hrvatski"
## [25] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [29] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [33] "Magyar" "Nederlands" "No Language" "Norsk"
## [37] "Polski" "Português" "Pусский" "Română"
## [41] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [45] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [49] "Türkçe" "Wolof" "ελληνικά" "беларуская мова"
## [53] "български език" "қазақ" "Український" "ქართული"
## [57] "עִבְרִית" "اردو" "العربية" "پښتو"
## [61] "فارسی" "हिन्दी" "বাংলা" "ਪੰਜਾਬੀ"
## [65] "தமிழ்" "తెలుగు" "ภาษาไทย" "한국어 조선말"
## [69] "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Català" "Český"
## [9] "Cymraeg" "Dansk" "Deutsch" "Eesti"
## [13] "English" "Español" "Esperanto" "Français"
## [17] "Fulfulde" "Gaeilge" "Galego" "Hrvatski"
## [21] "isiZulu" "Íslenska" "Italiano" "Kinyarwanda"
## [25] "Kiswahili" "Latin" "Latviešu" "Lietuvi x9akai"
## [29] "Magyar" "Malti" "Nederlands" "No Language"
## [33] "Norsk" "ozbek" "Polski" "Português"
## [37] "Pусский" "Română" "shqip" "Slovenčina"
## [41] "Slovenščina" "Somali" "Srpski" "suomi"
## [45] "svenska" "Tiếng Việt" "Türkçe" "Wolof"
## [49] "ελληνικά" "български език" "қазақ" "Український"
## [53] "ქართული" "עִבְרִית" "اردو" "العربية"
## [57] "پښتو" "فارسی" "हिन्दी" "বাংলা"
## [61] "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు" "ภาษาไทย"
## [65] "한국어 조선말" "广州话 廣州話" "日本語" "普通话"
## [1] "" "Afrikaans" "Bahasa indonesia" "Bahasa melayu"
## [5] "Bamanankan" "Bosanski" "Český" "Cymraeg"
## [9] "Dansk" "Deutsch" "Eesti" "English"
## [13] "Español" "Esperanto" "euskera" "Français"
## [17] "Gaeilge" "Hrvatski" "isiZulu" "Íslenska"
## [21] "Italiano" "Kiswahili" "Latin" "Latviešu"
## [25] "Lietuvi x9akai" "Magyar" "Nederlands" "Norsk"
## [29] "Polski" "Português" "Pусский" "Română"
## [33] "shqip" "Slovenčina" "Slovenščina" "Somali"
## [37] "Srpski" "suomi" "svenska" "Tiếng Việt"
## [41] "Türkçe" "Wolof" "ελληνικά" "български език"
## [45] "қазақ" "Український" "ქართული" "עִבְרִית"
## [49] "اردو" "العربية" "پښتو" "فارسی"
## [53] "हिन्दी" "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు"
## [57] "ภาษาไทย" "한국어 조선말" "广州话 廣州話" "日本語"
## [61] "普通话"
The second step to clean this column will be assing a name to the level that does not contain information in order to identify it faster.
# Create the category "Unspecified Language"
movies <- movies %>% mutate(country1_language = fct_collapse(country1_language, "Unspecified Language" = ""))
movies <- movies %>% mutate(country2_language = fct_collapse(country2_language, "Unspecified Language" = ""))
movies <- movies %>% mutate(country3_language = fct_collapse(country3_language, "Unspecified Language" = ""))## [1] "Unspecified Language" "Afrikaans" "Azərbaycan"
## [4] "Bahasa indonesia" "Bahasa melayu" "Bamanankan"
## [7] "Bokmål" "Bosanski" "Català"
## [10] "Český" "Cymraeg" "Dansk"
## [13] "Deutsch" "Eesti" "English"
## [16] "Español" "Esperanto" "euskera"
## [19] "Français" "Fulfulde" "Gaeilge"
## [22] "Galego" "Hausa" "Hrvatski"
## [25] "isiZulu" "Íslenska" "Italiano"
## [28] "Kinyarwanda" "Kiswahili" "Latin"
## [31] "Latviešu" "Lietuvi x9akai" "Magyar"
## [34] "Nederlands" "No Language" "Norsk"
## [37] "Polski" "Português" "Pусский"
## [40] "Română" "shqip" "Slovenčina"
## [43] "Slovenščina" "Somali" "Srpski"
## [46] "suomi" "svenska" "Tiếng Việt"
## [49] "Türkçe" "Wolof" "ελληνικά"
## [52] "беларуская мова" "български език" "қазақ"
## [55] "Український" "ქართული" "עִבְרִית"
## [58] "اردو" "العربية" "پښتو"
## [61] "فارسی" "हिन्दी" "বাংলা"
## [64] "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు"
## [67] "ภาษาไทย" "한국어 조선말" "广州话 廣州話"
## [70] "日本語" "普通话"
## [1] "Unspecified Language" "Afrikaans" "Bahasa indonesia"
## [4] "Bahasa melayu" "Bamanankan" "Bosanski"
## [7] "Català" "Český" "Cymraeg"
## [10] "Dansk" "Deutsch" "Eesti"
## [13] "English" "Español" "Esperanto"
## [16] "Français" "Fulfulde" "Gaeilge"
## [19] "Galego" "Hrvatski" "isiZulu"
## [22] "Íslenska" "Italiano" "Kinyarwanda"
## [25] "Kiswahili" "Latin" "Latviešu"
## [28] "Lietuvi x9akai" "Magyar" "Malti"
## [31] "Nederlands" "No Language" "Norsk"
## [34] "ozbek" "Polski" "Português"
## [37] "Pусский" "Română" "shqip"
## [40] "Slovenčina" "Slovenščina" "Somali"
## [43] "Srpski" "suomi" "svenska"
## [46] "Tiếng Việt" "Türkçe" "Wolof"
## [49] "ελληνικά" "български език" "қазақ"
## [52] "Український" "ქართული" "עִבְרִית"
## [55] "اردو" "العربية" "پښتو"
## [58] "فارسی" "हिन्दी" "বাংলা"
## [61] "ਪੰਜਾਬੀ" "தமிழ்" "తెలుగు"
## [64] "ภาษาไทย" "한국어 조선말" "广州话 廣州話"
## [67] "日本語" "普通话"
## [1] "Unspecified Language" "Afrikaans" "Bahasa indonesia"
## [4] "Bahasa melayu" "Bamanankan" "Bosanski"
## [7] "Český" "Cymraeg" "Dansk"
## [10] "Deutsch" "Eesti" "English"
## [13] "Español" "Esperanto" "euskera"
## [16] "Français" "Gaeilge" "Hrvatski"
## [19] "isiZulu" "Íslenska" "Italiano"
## [22] "Kiswahili" "Latin" "Latviešu"
## [25] "Lietuvi x9akai" "Magyar" "Nederlands"
## [28] "Norsk" "Polski" "Português"
## [31] "Pусский" "Română" "shqip"
## [34] "Slovenčina" "Slovenščina" "Somali"
## [37] "Srpski" "suomi" "svenska"
## [40] "Tiếng Việt" "Türkçe" "Wolof"
## [43] "ελληνικά" "български език" "қазақ"
## [46] "Український" "ქართული" "עִבְרִית"
## [49] "اردو" "العربية" "پښتو"
## [52] "فارسی" "हिन्दी" "ਪੰਜਾਬੀ"
## [55] "தமிழ்" "తెలుగు" "ภาษาไทย"
## [58] "한국어 조선말" "广州话 廣州話" "日本語"
## [61] "普通话"
For the mean time translation will not be made at this stage, however it is something that could further improve the overall cleanliness in the data set, but it is possible to analyze information in the current state of the three columns.
A imdb id should have the same character length regardless of the
format, this is an example of how it should look “tt7158814”. In total
it contains 9 characters therefore, any imdb_id that
contains less than that should be changed.
## [1] 0 9 9 9 9 9 9 9 9 9
an error was encountered within the first 10 rows, however we need to see if there are more errors aside from that one.
## # A tibble: 1 × 40
## adult budget homepage id imdb_id original_language original_title overview
## <lgl> <dbl> <chr> <chr> <chr> <fct> <chr> <chr>
## 1 FALSE 4.22e6 "" 15257 "" en Hulk vs. Wolv… Departm…
## # ℹ 32 more variables: popularity <dbl>, poster_path <chr>,
## # release_date <date>, revenue <dbl>, runtime <dbl>, status <fct>,
## # tagline <chr>, title <chr>, video <chr>, vote_average <dbl>,
## # vote_count <int>, id_collection <chr>, name_collection <chr>,
## # poster_path_collection <chr>, backdrop_path_collection <chr>, genre1 <fct>,
## # genre2 <fct>, genre3 <fct>, country1 <fct>, country2 <fct>, country3 <fct>,
## # country1_language <fct>, country2_language <fct>, …
By running the code we can find that the only invalid id is the same we have detected previously, searching on imdb the title of the movie, the id for this movie was found, which is the following “tt1308622”
# Replace the invalid id with the correct one
movies <- movies %>% mutate(imdb_id = case_when(imdb_id == "" ~ str_replace(imdb_id, "^$", "tt1308622"),TRUE ~ imdb_id))## # A tibble: 0 × 40
## # ℹ 40 variables: adult <lgl>, budget <dbl>, homepage <chr>, id <chr>,
## # imdb_id <chr>, original_language <fct>, original_title <chr>,
## # overview <chr>, popularity <dbl>, poster_path <chr>, release_date <date>,
## # revenue <dbl>, runtime <dbl>, status <fct>, tagline <chr>, title <chr>,
## # video <chr>, vote_average <dbl>, vote_count <int>, id_collection <chr>,
## # name_collection <chr>, poster_path_collection <chr>,
## # backdrop_path_collection <chr>, genre1 <fct>, genre2 <fct>, genre3 <fct>, …
Until this point, I have only focused on the movies_metadata csv files, however it is not the only file available that is related to this data set. There is other file that could be relevant to add to this data set in order to further expand our analysis possibilities. In order to do this we are going to use merge functions to successfully include the other information in this data set.
The file we are going to merge, are the keywords file which groups by id the keywords that identify a movie.
In order to fix any orthographic errors in the country columns we are going to use stringdist and fuzzyjoin packages, this will help us to correct any typo in the countries column
# Get the unique languages
unique_languages <- table(movies$country1_language)
write.csv(unique_languages,"languages.csv")# Read list with correct names
languages_corrected <- read.csv("D:\\Business Analytics\\languages_corrected.csv")# Join both datasets using string distance as the criteria
movies <- movies %>%
stringdist_left_join(languages_corrected, by = c("country1_language" = "Language"), method = "dl") %>%
stringdist_left_join(languages_corrected, by = c("country2_language" = "Language"), method = "dl") %>%
stringdist_left_join(languages_corrected, by = c("country3_language" = "Language"), method = "dl")## Unspecified Language Afrikaans Azərbaycan
## 4051 22 4
## Bahasa indonesia Bahasa melayu Bamanankan
## 26 5 4
## Bokmål Bosanski Català
## 3 26 31
## Český Cymraeg Dansk
## 270 2 300
## Deutsch Eesti English
## 1321 41 26890
## Español Esperanto euskera
## 1144 3 14
## Français Fulfulde Gaeilge
## 2430 1 6
## Galego Hausa Hrvatski
## 3 1 34
## isiZulu Íslenska Italiano
## 4 32 1416
## Kinyarwanda Kiswahili Latin
## 1 2 24
## Latviešu Lietuvi x9akai Magyar
## 17 15 144
## Nederlands No Language Norsk
## 297 306 112
## Polski Português Pусский
## 246 330 909
## Română shqip Slovenčina
## 75 24 18
## Slovenščina Somali Srpski
## 24 1 47
## suomi svenska Tiếng Việt
## 345 676 15
## Türkçe Wolof ελληνικά
## 149 3 133
## беларуская мова български език қазақ
## 2 25 8
## Український ქართული עִבְרִית
## 16 21 76
## اردو العربية پښتو
## 15 269 2
## فارسی हिन्दी বাংলা
## 102 546 43
## ਪੰਜਾਬੀ தமிழ் తెలుగు
## 4 81 43
## ภาษาไทย 한국어 조선말 广州话 廣州話
## 72 446 405
## 日本語 普通话
## 1385 414
## Unspecified Language Afrikaans Bahasa indonesia
## 37996 4 9
## Bahasa melayu Bamanankan Bosanski
## 4 1 3
## Català Český Cymraeg
## 5 14 4
## Dansk Deutsch Eesti
## 19 924 10
## English Español Esperanto
## 1617 786 3
## Français Fulfulde Gaeilge
## 1488 1 11
## Galego Hrvatski isiZulu
## 1 14 5
## Íslenska Italiano Kinyarwanda
## 22 623 2
## Kiswahili Latin Latviešu
## 7 48 2
## Lietuvi x9akai Magyar Malti
## 7 138 2
## Nederlands No Language Norsk
## 29 13 49
## ozbek Polski Português
## 2 162 160
## Pусский Română shqip
## 332 27 1
## Slovenčina Slovenščina Somali
## 18 10 4
## Srpski suomi svenska
## 27 40 244
## Tiếng Việt Türkçe Wolof
## 29 51 7
## ελληνικά български език қазақ
## 44 4 2
## Український ქართული עִבְרִית
## 17 8 74
## اردو العربية پښتو
## 19 28 2
## فارسی हिन्दी বাংলা
## 26 126 2
## ਪੰਜਾਬੀ தமிழ் తెలుగు
## 6 21 16
## ภาษาไทย 한국어 조선말 广州话 廣州話
## 42 54 44
## 日本語 普通话
## 240 222
## Unspecified Language Afrikaans Bahasa indonesia
## 43438 2 2
## Bahasa melayu Bamanankan Bosanski
## 6 1 2
## Český Cymraeg Dansk
## 2 1 5
## Deutsch Eesti English
## 330 1 243
## Español Esperanto euskera
## 309 1 1
## Français Gaeilge Hrvatski
## 237 4 3
## isiZulu Íslenska Italiano
## 6 8 225
## Kiswahili Latin Latviešu
## 9 41 2
## Lietuvi x9akai Magyar Nederlands
## 2 60 7
## Norsk Polski Português
## 24 82 65
## Pусский Română shqip
## 212 18 3
## Slovenčina Slovenščina Somali
## 4 4 4
## Srpski suomi svenska
## 21 8 112
## Tiếng Việt Türkçe Wolof
## 8 29 2
## ελληνικά български език қазақ
## 28 2 1
## Український ქართული עִבְרִית
## 8 4 37
## اردو العربية پښتو
## 15 31 1
## فارسی हिन्दी ਪੰਜਾਬੀ
## 8 28 7
## தமிழ் తెలుగు ภาษาไทย
## 4 8 31
## 한국어 조선말 广州话 廣州話 日本語
## 28 12 87
## 普通话
## 88
The language values are imported and corrected in case there was a typo within the dataset. With this we ensure that the same language stays in one category only.
Now we import keywords.
# Merging movies df with keywords df only maintaining coincidences with movies.
movies <- merge(movies,keywords,all.x =TRUE)Now the movies contain their corresponding keywords when applicable, however the keywords are in JSON format which for analysis purposes is not adequate, therefore we are going to create three new columns for registering the first three keywords a movies uses.
## V1 V2 V3
## Length:45972 Length:45972 Length:45972
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
# Cleaning and trimming spaces
new_keywords <- new_keywords %>%
mutate(
keyword1 = str_replace_all(V1, "[[:punct:]]", " "),
keyword2 = str_replace_all(V2, "[[:punct:]]", " "),
keyword3 = str_replace_all(V3, "[[:punct:]]", " ")
) %>%
mutate(
keyword1 = str_remove_all(keyword1, "\\bid\\b"),
keyword2 = str_remove_all(keyword2, "\\bid\\b"),
keyword3 = str_remove_all(keyword3, "\\bid\\b")
)# Remove the original columns V1, V2, V3
new_keywords <- select(new_keywords, keyword1, keyword2, keyword3)# Trim all leading and trailing white spaces
new_keywords <- new_keywords %>%
mutate(
keyword1 = str_trim(keyword1),
keyword2 = str_trim(keyword2),
keyword3 = str_trim(keyword3)
)# Convert to factor
new_keywords$keyword1 <- as.factor(new_keywords$keyword1)
new_keywords$keyword2 <- as.factor(new_keywords$keyword2)
new_keywords$keyword3 <- as.factor(new_keywords$keyword3)## keyword1 keyword2 keyword3
## :14520 :21056 :25845
## woman director : 1344 woman director : 387 woman director : 248
## independent film: 700 independent film: 234 independent film: 210
## based on novel : 446 sex : 212 murder : 175
## musical : 386 based on novel : 210 nudity : 133
## female nudity : 376 murder : 196 sex : 105
## (Other) :28200 (Other) :23677 (Other) :19256
By using the merge function we are now able to see which are the most popular keywords on the whole dataset and also search for any specific movie for their keywords, which could be important to consider when doing data analysis.
# Check for missing values in the 'titles' variable
missing_title <- sum(is.na(movies$title))
# Display the number of missing values
missing_title## [1] 0
## [1] "The Great Mouse Detective" "Exit Smiling"
## [3] "Turn It Up" "Gabriel"
## [5] "Hot Stuff" "The Free Will"
Eventhough i checked for missing values with code, I looked manually
and see missing values in title and original_title remains
intact.
Now that we have merged and cleaned the new dataset, we can create a report in with the data explorer library to obtain insights about our dataset. But first it is necessary to remove unnecessary rows in order to obtain a better report. For the report we will only include one of each variable in cases where there is more than 1.
# Creating a shorter data set
movies_short <- select(movies,adult,original_title,revenue,budget,runtime,release_date,status,vote_average,vote_count,popularity_max,genre1,genre_count,company1,company_count,country1,country1_language,keyword1)Know that we have the simplified dataset to movies_short
we will create the report using the revenue as our main
focus for analysis.
# For the report we are going to use revenue as our dependent variable
#create_report(movies_short,y= "revenue")A report was created with Data Explorer to see missing values and other data. I will keep this as a comment for loading purposes for it is not important that it opens up every time I run the code.
##
## Action Adventure Animation Comedy Crime
## 4518 1525 1133 8929 1705
## Documentary Drama Family Fantasy Foreign
## 3447 12156 539 708 118
## History Horror Music Mystery Romance
## 283 2634 493 560 1201
## Science Fiction Thriller TV Movie Unspecified War
## 647 1692 391 2458 384
## Western
## 451
##
## Action Adventure Animation Comedy Crime
## 0.098277212 0.033172366 0.024645436 0.194226921 0.037087793
## Documentary Drama Family Fantasy Foreign
## 0.074980423 0.264421822 0.011724528 0.015400679 0.002566780
## History Horror Music Mystery Romance
## 0.006155921 0.057295745 0.010723919 0.012181328 0.026124598
## Science Fiction Thriller TV Movie Unspecified War
## 0.014073784 0.036805012 0.008505177 0.053467328 0.008352910
## Western
## 0.009810319
By a large margin, the most movies in the dataset are drama movies.
##
## xx 104.0 68.0 82.0 ab af am ar ay bg bm bn bo
## 45 0 0 0 10 2 2 39 1 10 3 29 2
## bs ca cn cs cy da de el en eo es et eu
## 14 12 313 135 1 241 1081 113 32365 1 995 24 3
## fa fi fr fy gl he hi hr hu hy id is it
## 100 308 2438 1 1 67 508 30 102 1 20 24 1532
## iu ja jv ka kk kn ko ku ky la lb lo lt
## 2 1347 1 18 3 3 444 3 3 1 1 2 9
## lv mk ml mn mr ms mt nb ne nl no pa pl
## 18 5 36 2 25 5 1 6 2 248 119 2 218
## ps pt qu ro ru rw sh si sk sl sm sq sr
## 2 316 1 57 826 1 5 1 18 33 1 5 63
## sv ta te tg th tl tr uk ur uz vi wo zh
## 724 78 45 1 75 23 150 16 8 1 10 5 409
## zu
## 1
As expected, most movies are in english.
##
## Afghanistan Albania Algeria Angola Argentina Armenia
## Action 296 0 0 0 0 11 0
## Adventure 72 0 0 1 0 5 0
## Animation 131 0 0 0 0 5 1
## Comedy 871 0 1 0 0 26 2
## Crime 78 0 0 0 0 8 0
## Documentary 1157 0 0 0 0 7 1
## Drama 1129 3 1 5 2 99 3
## Family 74 0 0 0 0 1 0
## Fantasy 47 0 0 0 0 3 0
## Foreign 17 0 0 0 0 1 0
## History 28 0 0 1 0 0 0
## Horror 204 0 0 0 0 5 0
## Music 101 0 0 0 0 1 0
## Mystery 41 0 0 0 0 3 0
## Romance 110 0 0 0 0 5 0
## Science Fiction 62 0 0 0 0 5 0
## Thriller 149 0 1 0 0 6 0
## TV Movie 85 0 0 0 0 1 0
## Unspecified 1600 0 0 0 0 16 0
## War 29 1 0 1 0 1 0
## Western 33 0 0 0 0 3 0
##
## Aruba Australia Austria Azerbaijan Bahamas Bangladesh Belarus
## Action 3 63 5 0 0 0 1
## Adventure 1 28 2 0 2 0 0
## Animation 0 12 0 0 0 0 0
## Comedy 0 78 26 0 1 0 2
## Crime 0 24 5 0 0 0 0
## Documentary 0 16 26 0 0 0 0
## Drama 0 131 54 1 0 1 1
## Family 0 13 1 0 0 1 0
## Fantasy 0 11 1 0 0 0 0
## Foreign 0 3 0 0 0 0 0
## History 0 3 0 0 0 0 0
## Horror 0 38 5 0 0 0 0
## Music 0 4 2 0 0 0 0
## Mystery 0 13 1 0 0 0 0
## Romance 0 13 2 0 0 0 0
## Science Fiction 0 11 3 0 0 0 0
## Thriller 0 32 5 0 1 0 0
## TV Movie 0 0 0 0 0 0 0
## Unspecified 0 5 11 0 0 0 0
## War 0 7 0 0 0 0 1
## Western 0 1 2 0 0 0 0
##
## Belgium Bermuda Bhutan Bolivia Bosnia and Herzegovina
## Action 10 0 1 0 1
## Adventure 9 0 1 1 0
## Animation 10 0 0 0 0
## Comedy 53 0 1 0 3
## Crime 9 0 0 0 0
## Documentary 7 1 0 1 0
## Drama 117 0 0 4 15
## Family 6 0 0 0 0
## Fantasy 5 0 0 0 0
## Foreign 0 0 0 0 0
## History 5 0 0 0 0
## Horror 14 0 0 0 0
## Music 2 0 0 0 0
## Mystery 5 0 0 0 0
## Romance 19 0 0 0 0
## Science Fiction 4 0 0 0 0
## Thriller 9 0 0 1 0
## TV Movie 1 0 0 0 0
## Unspecified 12 0 0 1 0
## War 2 0 0 0 3
## Western 0 0 0 0 0
##
## Botswana Brazil Brunei Darussalam Bulgaria Burkina Faso
## Action 1 7 1 7 0
## Adventure 0 11 0 3 0
## Animation 0 3 0 0 0
## Comedy 0 54 0 5 0
## Crime 0 7 0 0 0
## Documentary 1 30 0 1 0
## Drama 0 113 0 14 5
## Family 0 3 0 0 0
## Fantasy 0 0 0 0 0
## Foreign 0 1 0 0 0
## History 0 1 0 0 1
## Horror 0 6 0 2 0
## Music 0 2 0 0 0
## Mystery 0 2 0 0 0
## Romance 0 8 0 0 1
## Science Fiction 0 1 0 2 0
## Thriller 0 4 0 2 0
## TV Movie 0 0 0 0 0
## Unspecified 0 9 0 0 0
## War 0 0 0 0 1
## Western 0 0 0 0 0
##
## Cambodia Cameroon Canada Chad Chile China Colombia Congo
## Action 0 0 166 1 1 67 1 1
## Adventure 0 0 56 0 0 15 1 1
## Animation 0 0 43 0 1 5 1 0
## Comedy 1 0 202 0 11 25 1 0
## Crime 0 0 49 0 1 4 2 0
## Documentary 4 1 118 0 6 8 3 1
## Drama 1 2 381 0 20 122 8 0
## Family 0 0 20 0 0 0 0 0
## Fantasy 0 0 22 0 1 9 0 0
## Foreign 0 0 0 0 1 0 0 0
## History 0 0 4 0 0 3 0 0
## Horror 0 0 167 0 1 0 0 0
## Music 0 0 9 0 0 0 0 0
## Mystery 0 0 23 0 0 4 0 0
## Romance 0 0 35 0 1 17 0 0
## Science Fiction 0 1 33 0 0 0 0 0
## Thriller 0 0 93 0 1 10 1 0
## TV Movie 0 0 40 0 0 0 0 0
## Unspecified 0 0 29 0 5 10 1 0
## War 0 0 5 1 0 2 0 0
## Western 0 0 6 0 0 0 0 0
##
## Costa Rica Cote D Ivoire Croatia Cuba Cyprus Czech Republic
## Action 0 0 1 0 0 5
## Adventure 0 0 1 0 0 11
## Animation 0 0 0 0 0 17
## Comedy 1 0 7 3 1 37
## Crime 0 0 1 0 0 4
## Documentary 0 0 0 4 0 2
## Drama 3 2 16 6 1 42
## Family 0 0 0 0 0 9
## Fantasy 0 0 1 1 0 6
## Foreign 0 0 0 0 0 0
## History 0 0 1 0 0 5
## Horror 0 0 1 1 0 4
## Music 0 0 0 0 0 4
## Mystery 0 0 0 0 0 3
## Romance 0 0 1 1 0 4
## Science Fiction 0 0 1 0 0 2
## Thriller 0 0 1 0 0 6
## TV Movie 0 0 0 0 0 0
## Unspecified 0 0 0 0 0 4
## War 0 0 0 0 0 4
## Western 0 0 0 0 0 0
##
## Czechoslovakia Denmark Dominican Republic East Germany
## Action 0 13 1 0
## Adventure 0 12 0 0
## Animation 0 5 0 0
## Comedy 3 50 1 0
## Crime 0 16 2 0
## Documentary 0 24 0 0
## Drama 1 132 2 1
## Family 0 13 0 0
## Fantasy 0 1 0 1
## Foreign 0 1 0 0
## History 0 2 0 1
## Horror 0 16 0 0
## Music 0 1 0 1
## Mystery 0 3 0 0
## Romance 0 6 0 0
## Science Fiction 0 1 0 1
## Thriller 0 22 0 0
## TV Movie 0 0 0 0
## Unspecified 0 10 0 0
## War 0 2 0 0
## Western 0 0 0 0
##
## Ecuador Egypt El Salvador Estonia Ethiopia Finland France
## Action 0 2 0 0 0 13 147
## Adventure 0 0 0 4 0 3 85
## Animation 0 0 0 3 0 2 42
## Comedy 0 1 0 9 0 83 643
## Crime 1 0 0 3 0 12 117
## Documentary 2 0 0 4 0 28 129
## Drama 1 10 1 14 3 110 1001
## Family 0 0 0 1 0 5 22
## Fantasy 0 0 0 0 0 4 57
## Foreign 0 0 0 0 0 3 4
## History 0 0 0 1 0 4 27
## Horror 0 1 0 2 0 2 60
## Music 0 0 0 0 0 6 18
## Mystery 0 1 0 1 0 1 28
## Romance 1 1 0 0 0 12 105
## Science Fiction 0 0 0 0 0 4 18
## Thriller 0 2 0 1 0 12 96
## TV Movie 0 0 0 1 0 2 4
## Unspecified 0 1 0 3 0 30 60
## War 0 0 0 1 0 6 33
## Western 0 0 0 0 0 0 10
##
## Georgia Germany Ghana Greece Guatemala Hong Kong Hungary
## Action 1 93 0 3 0 246 2
## Adventure 0 63 0 1 0 13 8
## Animation 0 34 0 0 0 3 6
## Comedy 4 283 0 33 0 35 15
## Crime 0 41 0 3 0 21 7
## Documentary 2 113 0 3 0 3 2
## Drama 9 515 0 57 1 69 54
## Family 0 24 0 0 0 0 1
## Fantasy 0 32 0 3 0 14 0
## Foreign 1 1 0 1 0 3 1
## History 0 11 0 0 0 0 2
## Horror 1 47 0 2 0 16 0
## Music 0 11 0 1 0 2 1
## Mystery 0 20 0 1 0 0 1
## Romance 3 40 1 6 0 11 4
## Science Fiction 0 13 0 2 0 4 0
## Thriller 0 42 1 6 0 18 5
## TV Movie 0 8 0 0 0 0 0
## Unspecified 0 28 0 8 0 7 9
## War 0 11 0 1 0 3 3
## Western 0 5 0 0 0 0 0
##
## Iceland India Indonesia Iran Iraq Ireland Israel Italy
## Action 1 141 6 2 0 6 7 119
## Adventure 2 13 0 1 0 5 1 47
## Animation 0 8 0 0 0 2 0 6
## Comedy 7 120 3 4 0 29 18 360
## Crime 0 31 1 1 0 1 1 61
## Documentary 4 10 0 7 1 4 5 28
## Drama 21 257 11 68 1 60 48 361
## Family 2 7 0 3 0 3 0 4
## Fantasy 0 7 0 0 0 3 0 17
## Foreign 0 15 2 0 0 0 0 9
## History 1 4 0 0 0 1 2 18
## Horror 2 19 3 0 0 9 4 123
## Music 1 12 0 0 0 1 0 1
## Mystery 0 6 0 0 0 0 1 25
## Romance 0 50 0 2 0 5 2 49
## Science Fiction 1 2 0 0 0 2 1 21
## Thriller 0 50 1 1 0 2 0 55
## TV Movie 0 0 0 0 0 0 0 0
## Unspecified 0 29 1 2 0 1 3 89
## War 0 2 0 0 0 2 3 17
## Western 0 0 0 0 0 0 0 68
##
## Jamaica Japan Jordan Kazakhstan Kyrgyz Republic
## Action 2 267 0 4 0
## Adventure 0 72 1 0 0
## Animation 0 180 0 0 0
## Comedy 0 118 2 1 1
## Crime 0 42 0 1 0
## Documentary 0 26 0 0 0
## Drama 1 378 1 4 4
## Family 0 5 0 0 0
## Fantasy 0 53 0 0 0
## Foreign 0 23 0 0 0
## History 0 14 0 0 0
## Horror 0 92 0 0 0
## Music 1 7 0 0 0
## Mystery 0 16 0 0 0
## Romance 0 39 0 0 0
## Science Fiction 0 50 0 0 0
## Thriller 0 37 0 1 0
## TV Movie 0 0 0 0 0
## Unspecified 0 60 0 1 0
## War 0 11 0 0 0
## Western 0 0 0 0 0
##
## Lao People s Democratic Republic Latvia Lebanon Liberia
## Action 0 1 0 0
## Adventure 0 0 1 0
## Animation 0 0 0 0
## Comedy 0 2 2 0
## Crime 0 0 0 0
## Documentary 0 5 1 1
## Drama 1 9 1 1
## Family 0 0 0 0
## Fantasy 0 0 0 0
## Foreign 0 0 0 0
## History 0 0 0 0
## Horror 0 0 0 0
## Music 0 0 0 0
## Mystery 0 0 0 0
## Romance 0 1 1 0
## Science Fiction 0 0 0 0
## Thriller 0 0 0 0
## TV Movie 0 0 0 0
## Unspecified 0 2 0 0
## War 0 0 0 0
## Western 0 0 0 0
##
## Libyan Arab Jamahiriya Liechtenstein Lithuania Luxembourg
## Action 1 0 3 4
## Adventure 2 0 0 1
## Animation 0 0 0 1
## Comedy 0 0 1 2
## Crime 0 0 1 1
## Documentary 0 0 0 0
## Drama 0 0 7 5
## Family 0 0 0 0
## Fantasy 0 0 0 1
## Foreign 0 0 0 0
## History 0 0 1 0
## Horror 0 1 1 3
## Music 0 0 0 0
## Mystery 0 0 0 2
## Romance 0 0 0 2
## Science Fiction 0 0 0 2
## Thriller 0 0 0 1
## TV Movie 0 0 0 0
## Unspecified 0 0 2 1
## War 0 0 1 1
## Western 0 0 0 0
##
## Macedonia Malaysia Mali Malta Martinique Mauritania Mexico
## Action 1 1 0 1 0 0 18
## Adventure 1 0 0 0 0 0 10
## Animation 0 0 0 0 0 0 1
## Comedy 0 1 0 0 0 0 43
## Crime 0 0 0 0 0 0 9
## Documentary 0 1 0 0 0 0 11
## Drama 4 1 1 1 1 3 91
## Family 0 0 0 0 0 0 1
## Fantasy 0 0 0 0 0 0 3
## Foreign 0 0 0 0 0 0 2
## History 0 0 0 0 0 0 2
## Horror 0 0 0 0 0 0 10
## Music 0 0 0 0 0 0 2
## Mystery 1 0 0 0 0 0 1
## Romance 0 0 0 0 0 0 8
## Science Fiction 0 0 0 0 0 0 4
## Thriller 0 2 0 0 0 0 10
## TV Movie 0 0 0 0 0 0 1
## Unspecified 0 0 0 0 0 0 4
## War 0 0 0 0 0 0 1
## Western 0 0 0 0 0 0 4
##
## Monaco Mongolia Montenegro Morocco Myanmar Namibia Nepal
## Action 0 0 0 3 0 1 0
## Adventure 0 0 0 0 0 0 1
## Animation 0 0 0 0 0 0 0
## Comedy 0 0 0 1 0 0 0
## Crime 0 0 0 0 0 0 0
## Documentary 0 2 0 1 0 0 0
## Drama 0 0 1 6 0 0 1
## Family 0 1 0 0 0 0 0
## Fantasy 0 0 0 0 0 0 0
## Foreign 0 0 0 0 0 0 0
## History 0 0 0 1 0 0 0
## Horror 1 0 0 0 0 0 0
## Music 0 0 0 0 0 0 0
## Mystery 0 0 0 0 0 0 0
## Romance 0 0 0 0 0 0 0
## Science Fiction 0 0 0 0 0 0 0
## Thriller 0 0 0 0 0 0 0
## TV Movie 0 0 0 0 0 0 0
## Unspecified 0 0 0 2 1 0 0
## War 0 0 0 0 0 0 0
## Western 0 0 0 0 0 0 0
##
## Netherlands New Zealand Nicaragua Nigeria North Korea Norway
## Action 7 17 0 0 0 13
## Adventure 9 10 0 0 0 7
## Animation 5 0 0 0 0 1
## Comedy 35 13 0 0 0 30
## Crime 4 0 0 0 0 5
## Documentary 23 8 1 0 1 6
## Drama 87 22 0 2 0 46
## Family 5 0 0 0 0 3
## Fantasy 4 5 0 0 0 3
## Foreign 7 1 0 0 0 0
## History 4 0 0 0 0 0
## Horror 5 9 0 0 0 6
## Music 2 1 0 0 0 0
## Mystery 1 0 0 0 0 3
## Romance 7 4 0 1 0 3
## Science Fiction 2 1 0 0 0 0
## Thriller 8 2 0 1 0 8
## TV Movie 0 0 0 0 0 0
## Unspecified 8 0 0 0 0 4
## War 4 1 0 0 0 1
## Western 0 0 0 0 0 0
##
## Pakistan Palestinian Territory Panama Papua New Guinea
## Action 2 0 0 0
## Adventure 0 0 0 0
## Animation 0 0 0 0
## Comedy 0 0 0 1
## Crime 1 0 0 0
## Documentary 2 1 1 0
## Drama 4 4 1 0
## Family 2 0 0 0
## Fantasy 0 0 0 0
## Foreign 0 0 0 0
## History 1 0 0 0
## Horror 1 0 0 0
## Music 0 0 0 0
## Mystery 0 0 0 0
## Romance 0 0 0 0
## Science Fiction 0 0 0 0
## Thriller 0 2 0 0
## TV Movie 0 0 0 0
## Unspecified 1 0 1 0
## War 0 0 0 0
## Western 0 0 0 0
##
## Paraguay Peru Philippines Poland Portugal Puerto Rico Qatar
## Action 1 2 17 13 0 0 0
## Adventure 0 0 1 3 1 0 0
## Animation 0 0 0 5 1 0 0
## Comedy 0 2 12 53 12 0 0
## Crime 0 0 2 6 3 0 1
## Documentary 0 1 0 9 5 1 1
## Drama 0 7 20 100 36 0 5
## Family 0 0 0 0 0 0 0
## Fantasy 0 0 0 3 0 0 0
## Foreign 0 0 2 0 0 0 0
## History 0 0 0 2 2 0 0
## Horror 0 2 10 6 2 0 0
## Music 0 0 0 1 2 0 0
## Mystery 0 1 0 4 2 0 0
## Romance 0 0 2 4 2 2 0
## Science Fiction 0 0 0 8 1 0 0
## Thriller 0 0 1 9 0 0 1
## TV Movie 0 0 0 0 0 0 0
## Unspecified 0 0 2 13 4 1 0
## War 0 0 1 7 1 0 0
## Western 0 0 0 0 0 0 0
##
## Romania Russia Rwanda Samoa Saudi Arabia Senegal Serbia
## Action 10 59 0 0 0 0 4
## Adventure 0 49 0 0 0 0 4
## Animation 1 49 0 0 0 0 1
## Comedy 16 179 0 0 0 2 20
## Crime 2 19 0 0 0 0 1
## Documentary 2 26 0 0 0 0 2
## Drama 29 227 2 1 1 8 20
## Family 0 27 0 0 0 0 0
## Fantasy 0 14 0 0 0 0 0
## Foreign 0 0 0 0 0 0 0
## History 3 11 0 0 0 0 0
## Horror 11 7 0 0 0 0 2
## Music 0 2 0 0 0 0 1
## Mystery 0 13 0 0 0 0 0
## Romance 0 42 0 0 0 0 3
## Science Fiction 1 10 0 0 0 0 0
## Thriller 4 17 0 0 0 0 1
## TV Movie 0 4 0 0 0 0 0
## Unspecified 6 23 0 0 0 0 0
## War 2 22 0 0 0 0 6
## Western 0 0 0 0 0 0 0
##
## Singapore Slovakia Slovenia South Africa South Korea
## Action 4 0 0 15 88
## Adventure 1 0 0 4 7
## Animation 1 1 0 1 10
## Comedy 2 0 2 13 38
## Crime 0 0 2 2 25
## Documentary 1 2 2 3 5
## Drama 5 4 18 10 135
## Family 1 0 0 1 7
## Fantasy 0 0 0 0 7
## Foreign 0 0 0 0 6
## History 0 0 0 0 4
## Horror 0 0 0 1 32
## Music 0 0 2 1 1
## Mystery 1 0 2 0 9
## Romance 0 2 2 0 24
## Science Fiction 0 0 0 4 4
## Thriller 1 0 0 7 41
## TV Movie 0 0 0 0 0
## Unspecified 0 0 0 0 11
## War 0 2 0 2 3
## Western 0 0 0 1 0
##
## Soviet Union Spain Sri Lanka Sweden Switzerland
## Action 0 30 1 46 6
## Adventure 2 16 0 17 1
## Animation 0 11 0 5 3
## Comedy 1 123 0 163 17
## Crime 0 19 0 30 0
## Documentary 3 25 1 45 18
## Drama 7 177 1 288 33
## Family 0 2 0 15 2
## Fantasy 0 8 0 7 0
## Foreign 0 4 0 0 0
## History 0 6 0 4 1
## Horror 0 64 0 22 3
## Music 0 1 0 6 3
## Mystery 0 15 0 8 0
## Romance 1 14 0 12 2
## Science Fiction 1 5 0 2 0
## Thriller 0 39 0 36 4
## TV Movie 0 0 0 1 0
## Unspecified 2 25 0 16 4
## War 0 5 0 5 2
## Western 0 12 0 0 0
##
## Syrian Arab Republic Taiwan Tajikistan Tanzania Thailand
## Action 0 9 0 0 21
## Adventure 0 0 0 0 3
## Animation 0 0 0 0 0
## Comedy 0 15 0 0 14
## Crime 0 4 0 0 1
## Documentary 0 1 0 1 2
## Drama 1 43 2 0 22
## Family 0 0 0 0 1
## Fantasy 0 0 0 0 2
## Foreign 0 1 0 0 1
## History 0 0 0 0 1
## Horror 0 1 0 0 8
## Music 0 1 0 0 1
## Mystery 0 1 0 0 1
## Romance 0 4 0 0 4
## Science Fiction 0 0 0 0 1
## Thriller 0 2 0 0 6
## TV Movie 0 0 0 0 0
## Unspecified 0 6 0 0 1
## War 0 0 0 0 0
## Western 0 0 0 0 0
##
## Trinidad and Tobago Tunisia Turkey Uganda Ukraine
## Action 0 0 7 1 2
## Adventure 1 0 5 0 1
## Animation 0 0 0 0 0
## Comedy 0 0 33 0 6
## Crime 0 0 4 0 0
## Documentary 0 0 1 1 1
## Drama 0 3 40 0 17
## Family 0 0 1 0 1
## Fantasy 0 0 2 0 0
## Foreign 0 0 0 0 0
## History 0 0 1 0 2
## Horror 0 0 5 0 0
## Music 0 0 2 0 0
## Mystery 0 0 3 0 0
## Romance 1 0 9 0 0
## Science Fiction 0 0 1 0 1
## Thriller 0 0 0 0 1
## TV Movie 0 0 0 0 0
## Unspecified 0 0 20 0 1
## War 0 0 0 0 0
## Western 0 0 0 0 0
##
## United Arab Emirates United Kingdom
## Action 1 227
## Adventure 0 129
## Animation 0 41
## Comedy 2 523
## Crime 1 154
## Documentary 2 228
## Drama 5 899
## Family 0 27
## Fantasy 0 56
## Foreign 1 3
## History 0 24
## Horror 1 250
## Music 0 41
## Mystery 0 48
## Romance 0 80
## Science Fiction 0 48
## Thriller 0 170
## TV Movie 0 31
## Unspecified 0 38
## War 0 50
## Western 0 6
##
## United States Minor Outlying Islands United States of America
## Action 0 2152
## Adventure 0 685
## Animation 0 474
## Comedy 0 4313
## Crime 0 852
## Documentary 1 1201
## Drama 0 4170
## Family 0 224
## Fantasy 0 294
## Foreign 0 2
## History 0 73
## Horror 0 1322
## Music 0 234
## Mystery 0 245
## Romance 0 410
## Science Fiction 0 306
## Thriller 0 644
## TV Movie 0 212
## Unspecified 0 213
## War 0 113
## Western 0 300
##
## Uruguay Uzbekistan Venezuela Vietnam Yugoslavia
## Action 1 0 0 1 1
## Adventure 0 1 0 0 0
## Animation 0 1 0 0 0
## Comedy 1 0 1 0 0
## Crime 0 0 1 1 0
## Documentary 0 0 1 0 0
## Drama 6 2 6 6 2
## Family 0 0 0 0 0
## Fantasy 0 0 0 0 0
## Foreign 0 0 1 0 0
## History 0 0 0 0 0
## Horror 0 0 1 0 0
## Music 0 0 0 0 0
## Mystery 0 0 0 0 0
## Romance 0 0 0 0 0
## Science Fiction 0 0 0 0 0
## Thriller 0 0 0 0 0
## TV Movie 0 0 0 0 0
## Unspecified 1 0 0 0 0
## War 0 0 0 0 1
## Western 0 0 0 0 0
With this cross tabulation we can know what country makes the most amount of movies based on genre, United States, being the major producer, makes comedy movies the most. Would those be the movies that make the most revenue?
For the following charts and graphs I used the package Esquisse in order to obtain more complex visualizations, yet the function is going to remain as a comment in order for it not to load every time I run the code.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can see that if there is a higher budget there probably will be a
higher revenue, yet it seems that there could be a limit because if you
exceed a certain budget then it could be too high to make a profit out
of it.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
By far, adventure movies make the most revenue.
With this past anayisis we now know 3 things:
With that, we can conclude that an adventure movie is the most likely to be a success.
The histogram visualizes the distribution of movie runtimes across
different revenue ranges. If it is in very high it means that the movie
made a lot of revenue. We can see that most movies made medium revenue
which is around 10,000,000.
## [1] "Mean runtime (filtered): 96.48"
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Calculate the mean runtime of filtered movies
mean_runtime_filtered <- mean(filtered_movies$runtime, na.rm = TRUE)
print(paste("Mean runtime (filtered):", round(mean_runtime_filtered, 2)))## [1] "Mean runtime (filtered): 96.48"
With this past runtime analysis, we can conlude that for
a movie to be a success or in other words, have a higher revenue, it
should last between 96 and 150 minutes. A good sweet spot for the best
movie would be two hours.
It was to be expected yet this clarifies the theory that movies that
come out on peoples vacations are the most profitable. We can see that
movies have more success when released in summer (june & july) and
aslo in winter (november & december). I would not say holidays are
the best but vacation time for sure.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Failed to fit group 1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Failed to fit group 1.
We can see that in past years, the audience was not too amused by movies
yet the quality of them were still great. Nowadays people are way more
interested.
Here we can see that “Other” and “No company” are very high in
comparison with the rest of the categories, that is because these
categories make up for a set of companies that either are not available
in the dataset or are too small to even bother measuring. When put
together they are great but it cannot be considered as one.
Individually, Paramount Pictures is the most successfull company.
NOTE:
This does not mean other production companies are not important, a collaboration between big and small production companies could result a better match.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).
In the analysis for production companies, It was found that movies made
in collaboration between three companies are more likely to make the
most revenue. This way we can determine that:
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## [1] "Correlation coefficient: NA"
## [1] "Mean title length: 16.33"
For movie titles, the analysis made was done to determine the right amount of characters a movie title hast to have in order to increase the probability of it being a success. It was found that titles with around 16 characters make the best length for a movie title.
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
In these word clouds we can see the most repeated words throughout all
movies. In the second wordcloud, it shows us the words that are
repeated the most with revenue-based weighting, meaning these words were
repeated the most in movies with higher revenue. We can see that movies
with more revenues mention the words: based on novel, woman director and
saving the world, also big cities like paris, new york, london, etc. but
that could be the location where the movie was made, the setting or
another factor.
# Convert budget and revenue as numeric
movies$budget_original <- as.numeric(as.character(movies$budget_original))
movies$revenue_original <- as.numeric(as.character(movies$revenue_original))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 4194725 0 380000000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 0.000e+00 0.000e+00 1.114e+07 0.000e+00 2.788e+09 3
## Budget - Mode: 0 , Mean: 4194725 , Median: 0 , Standard Deviation: 17363925 , Minimum: 0 , Maximum: 3.8e+08
## Revenue - Mode: 0 , Mean: 11139954 , Median: 0 , Standard Deviation: 64127446 , Minimum: 0 , Maximum: 2787965087
We could have worked with the data as it was, but we decided to conduct a different analysis by comparing the categories of budget and revenue after excluding zero values. This is because there are many zeros in the data, and we want to have a clearer vision of the cases that do contain complete information.
## Budget (Filtrado) - Media: 21565635 , Mediana: 8e+06 , Desviación Estándar: 34286508 , Mínimo: 1 , Máximo: 3.8e+08
## Revenue (Filtrado) - Media: 68829646 , Mediana: 16801877 , Desviación Estándar: 146424469 , Mínimo: 1 , Máximo: 2787965087
## # A tibble: 2 × 3
## USA mean_revenue median_revenue
## <lgl> <dbl> <dbl>
## 1 FALSE 73431073. 18800000
## 2 TRUE 100046323. 37170057
## Statistic Value
## 1 Mean 90440189.5394444
## 2 Median 29911946
## 3 Mode 1.2e+07
## 4 Standard Deviation 166189478.816842
## 5 Variance 27618942869413484
## 6 IQR 92990947
## 7 Min 1
## 8 Max 2787965087
## 9 Diff 2787965086
## 10 Range 1 to 2787965087
# Filter out outliers
movies_filtered_no_outliers <- movies_filtered_both %>%
filter(!is_outlier)
# Plot density distributions of log-transformed revenue without outliers
ggplot(movies_filtered_no_outliers, aes(x = log(revenue_original), fill = factor(USA))) +
geom_density(alpha = 0.3) +
labs(title = "Density Distribution of Log-transformed Revenue (Excluding Outliers)",
x = "Log Revenue",
y = "Density",
fill = "United States of America") +
scale_fill_manual(values = c("TRUE" = "red", "FALSE" = "blue")) +
theme_minimal()# Calculating correlations between numerical variables
correlation_matrix <- cor(movies_filtered_no_outliers[, c("budget_original", "revenue_original", "popularity_max", "vote_average", "vote_count")])
# Scatterplot of budget vs revenue
ggplot(data = movies_filtered_no_outliers, aes(x = budget_original, y = revenue_original)) +
geom_point(alpha = 0.5) +
labs(title = "Budget vs Revenue",
x = "Budget",
y = "Revenue") +
theme_minimal()# Analysis of the relationship between budget and revenue
budget_revenue_lm <- lm(revenue_original ~ budget_original, data = movies_filtered_no_outliers)
summary(budget_revenue_lm)##
## Call:
## lm(formula = revenue_original ~ budget_original, data = movies_filtered_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -677974787 -40715382 -5447567 15288916 2075081200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.130e+06 2.044e+06 -1.531 0.126
## budget_original 3.021e+00 3.951e-02 76.469 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115500000 on 5197 degrees of freedom
## Multiple R-squared: 0.5294, Adjusted R-squared: 0.5294
## F-statistic: 5847 on 1 and 5197 DF, p-value: < 2.2e-16
# Get the top 15 countries by revenue
top_countries <- movies_filtered_no_outliers %>%
group_by(country1) %>%
summarize(total_revenue = sum(revenue_original, na.rm = TRUE)) %>%
top_n(15, total_revenue) %>%
arrange(desc(total_revenue)) %>%
pull(country1)
# Filter data for the top 15 countries
movies_filtered_top_countries <- movies_filtered_no_outliers %>%
filter(country1 %in% top_countries)
# Scatterplot of revenue by country for top 15 countries
ggplot(data = movies_filtered_top_countries, aes(x = country1, y = revenue_original, fill = country1)) +
geom_boxplot() +
labs(title = "Revenue by Country of Origin (Top 15 Countries)",
x = "Country",
y = "Revenue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))# Summary Statistics
summary_stats <- function(data, col) {
cat("\n--- Summary Statistics for", col, "---\n")
mean_val <- mean(data[[col]], na.rm = TRUE)
median_val <- median(data[[col]], na.rm = TRUE)
mode_val <- get_mode(data[[col]])
sd_val <- sd(data[[col]], na.rm = TRUE)
range_val <- range(data[[col]], na.rm = TRUE)
iqr_val <- IQR(data[[col]], na.rm = TRUE)
skewness_val <- skewness(data[[col]], na.rm = TRUE)
kurtosis_val <- kurtosis(data[[col]], na.rm = TRUE)
cat("Mean:", mean_val, "\nMedian:", median_val, "\nMode:", mode_val,
"\nStandard Deviation:", sd_val, "\nVariance:", sd_val^2,
"\nRange: [", range_val[1], ",", range_val[2], "]",
"\nInterquartile Range:", iqr_val,
"\nSkewness:", skewness_val, "\nKurtosis:", kurtosis_val, "\n")
}
# Run the summary statistics function for 'budget_original' and 'revenue_original'
summary_stats(movies, "budget_original")##
## --- Summary Statistics for budget_original ---
## Mean: 4194725
## Median: 0
## Mode: 0
## Standard Deviation: 17363925
## Variance: 3.015059e+14
## Range: [ 0 , 3.8e+08 ]
## Interquartile Range: 0
## Skewness: 7.146398
## Kurtosis: 67.14456
##
## --- Summary Statistics for revenue_original ---
## Mean: 11139954
## Median: 0
## Mode: 0
## Standard Deviation: 64127446
## Variance: 4.112329e+15
## Range: [ 0 , 2787965087 ]
## Interquartile Range: 0
## Skewness: 12.28317
## Kurtosis: 238.2217
Almost all findings are based on revenue, as I chose it as my independent variable. Based on that, we can find many interesting and useful insights.
Recepie for a successful movie:
Here are some visualizations used to confirm this movie recepie.
##
## xx 104.0 68.0 82.0 ab af am ar ay bg bm bn bo
## 45 0 0 0 10 2 2 39 1 10 3 29 2
## bs ca cn cs cy da de el en eo es et eu
## 14 12 313 135 1 241 1081 113 32365 1 995 24 3
## fa fi fr fy gl he hi hr hu hy id is it
## 100 308 2438 1 1 67 508 30 102 1 20 24 1532
## iu ja jv ka kk kn ko ku ky la lb lo lt
## 2 1347 1 18 3 3 444 3 3 1 1 2 9
## lv mk ml mn mr ms mt nb ne nl no pa pl
## 18 5 36 2 25 5 1 6 2 248 119 2 218
## ps pt qu ro ru rw sh si sk sl sm sq sr
## 2 316 1 57 826 1 5 1 18 33 1 5 63
## sv ta te tg th tl tr uk ur uz vi wo zh
## 724 78 45 1 75 23 150 16 8 1 10 5 409
## zu
## 1
## [1] "Mean runtime (filtered): 96.48"
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## [1] "Correlation coefficient: NA"
In conclusion, the analysis of revenue data in the film industry reveals a refined understanding of the factors influencing movie success. English-language films, particularly within the adventure genre, dominate the box office, suggesting a clear preference among audiences. Strategic elements such as runtime, release timing, and collaborative partnerships with renowned production companies significantly impact revenue outcomes. Moreover, the presence of female directors emerges as a potentially lucrative path for enhancing film profitability. These findings reinforce the importance of strategic decision-making and audience-centric content creation in driving revenue growth in the film industry. By leveraging these insights, stakeholders can navigate market dynamics more effectively, ultimately fostering sustained success and innovation in movie production.
Strategic Partnerships: Establish partnerships with leading production companies like Paramount, Universal, and Disney to leverage their expertise and resources. Additionally, explore collaborations with emerging American production houses to diversify content offerings.
Content Development: Focus on producing English-language adventure films with compelling storylines centered on themes like action, fantasy, science fiction, or family-oriented narratives. Consider adapting popular novels for cinematic adaptations to capitalize on existing fan bases.
Release Strategy: Plan movie releases during the summer season, particularly in June and July, to maximize box office performance. Utilize data-driven insights to identify optimal release dates and avoid clashes with major blockbuster releases.
Directorial Diversity: Encourage diversity in directorial roles by actively seeking opportunities to collaborate with talented female directors. Embrace inclusivity and promote gender diversity in creative decision-making processes.
Market Expansion: Explore opportunities to expand into international markets while maintaining a focus on English-speaking audiences. Tailor marketing strategies and localization efforts to resonate with diverse cultural preferences and sensibilities.
Implementing these recommendations can enhance the overall success and profitability of movie productions, driving sustained growth in the dynamic entertainment industry landscape.
In the data cleaning process, I searched for various movies that were possibly silent and found that they in fact were. For example: Blacksmith Scene and Le manoir du diable. I also searched for movies that had zero in runtime to verify if they in fact did to determine if i had to imputate those values, here is an example of the movies: Torno a vivere da solo, The Black Waters of Echo’s Pond and star Force: Fugitive Alien II runtime. Lastly, I consulted searches of the most profitable movies to see if the director was a man or woman, Box Office Mojo is a website that listed the highest grossing movies and displays their information.