Introduction
For starters, this report was made using the coding language known as
R, with the objective of finding out which variables determine
the success of a movie. To seek to find this answers, we used the csv
file known as movies_metadata.csv. This csv file was
obtained from Kaggle (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data),
and it contains a registry of over 45,000 movies, 270,000 users and over
26 million ratings. This is important, since we are looking to find
concrete answers to our question, and so using a big amount of data will
make our results more accurate.
Objectives and Variables
As previously mentioned, the main objective is to determine what
makes (or doesn’t make) a movie successful. To do this, we looked at the
variables within this dataset. These variables are the following: adult,
belongs_to_collection, budget, genres homepage, id, imdb_id,
original_language, original_title, overview, popularity, poster_path,
production_companies, production_countries, release_date, revenue,
runtime, spoken_languages, status, tagline, title video, vote_average
and vote_count. These variables range from descriptions to numerical
values, so we had to assign them a different value and judge them
differently from each other. Obviously, to determine the success of a
movie, we did not necessarily need to utilize all of these variables,
and so to do this, a cleaning process had to be made.
Cleaning Process
For this cleaning process, we had to elaborate codes with the
objective of filtering out data that is either incorrect, invalid or
simply unnecessary for our analysis. To ensure that our data is as
well-arranged as possible, I began by ordering the columns by
importance, in the manner that I thought would be easiest to understand
for the reader. Because of this, the first 2 columns are the name of the
movie, and the identification number. After this, I continued by
checking for NA values. This refers to values that weren’t recorded
properly, and so they make the analysis extremely complicated. Because
of this, I deleted these values. Next, I checked for duplicate rows, and
logically deleted the copies, to make sure that there is only one row
for each movie. Then I continued by cleaning the columns individually:
Adult: All values were assigned to either True or False. Runtime:
Deleted rows that had impossible runtimes. (Ex. 0 or 1984920 minutes.)
Vote_Average: Gave the average rating to rows with a value of 0.
Collection: Deleted the parts of the column that I deemed unnecessary
(Ex. Parentheses and Collection IDs) Genres, Companies, Countries,
Languages: I separated the different values assigned to the movie into
different columns, as well as graphed the top 20 of each category.
Poster_Path, Tagline, Video: I deleted these columns since I deemed them
unnecessary. The remaining columns were left untouched, since I
considered that they were both clean, and relevant to the analysis I was
trying to do.
//////////
# Checking for NAs
columns_to_check <- c("budget", "genres", "homepage", "id", "imdb_id", "original_language",
"original_title", "overview", "popularity", "poster_path",
"production_companies", "production_countries", "release_date",
"revenue", "runtime", "spoken_languages", "status",
"title", "video", "vote_average", "vote_count", "adult")
missing_rows <- which(!complete.cases(movies[, columns_to_check]))
num_missing_rows <- length(missing_rows)
print(paste("Number of missing rows:", num_missing_rows))
## [1] "Number of missing rows: 263"
print("Indices of missing rows:")
## [1] "Indices of missing rows:"
print(missing_rows)
## [1] 635 636 645 803 864 2141 13563 13577 13661 13758 14770 14891
## [13] 15246 15385 15734 16902 17318 17491 17952 18039 18573 18673 18917 18969
## [25] 19316 19372 19730 19731 19875 20167 20380 20643 20712 20827 21103 21251
## [37] 21985 22314 22347 22425 22664 22819 23039 23227 23419 23421 23830 24231
## [49] 25204 25505 25508 25512 25515 25562 25631 25678 25912 25938 26075 27881
## [61] 27968 28133 28246 28274 28277 28416 28658 28801 28952 28965 29228 29265
## [73] 29301 29308 29309 29310 29312 29449 29451 29503 29504 30357 30380 30386
## [85] 30783 30790 30828 30945 31011 31057 31230 31359 31416 31533 31544 31545
## [97] 31557 31581 31588 31589 32004 32065 32156 32177 32424 32498 32499 32509
## [109] 32637 33004 33529 33532 34398 34447 34527 34558 34606 34672 34703 34860
## [121] 35036 35164 35217 35314 35492 35518 35587 35588 35653 35661 35691 35982
## [133] 36218 36348 36352 36403 36422 36461 36592 36858 36864 36900 37077 37085
## [145] 37151 37157 37186 37210 37215 37242 37275 37405 37451 37472 37488 37491
## [157] 37641 37642 37754 37755 37814 38101 38182 38360 38402 38415 39144 39271
## [169] 39302 39365 39367 39497 39568 39678 39721 39723 39852 39971 39972 40152
## [181] 40204 40207 40379 40577 40726 40776 40887 41046 41068 41069 41079 41368
## [193] 41369 41371 41373 41377 41379 41400 41659 41696 41699 41719 41833 41888
## [205] 41922 41923 42069 42150 42151 42198 42199 42258 42290 42298 42321 42345
## [217] 42574 42776 42848 43107 43111 43129 43203 43393 43470 43471 43477 43481
## [229] 43487 43488 43533 43597 43674 43682 43942 43950 43964 43986 44017 44066
## [241] 44121 44152 44383 44395 44424 44560 44568 44574 44593 44784 44878 44912
## [253] 44960 44989 45016 45055 45063 45232 45247 45311 45314 45378 45411
# Cleaning NAs
movies_clean <- movies[complete.cases(movies[, columns_to_check]), ]
print("Dimensions of cleaned dataframe:")
## [1] "Dimensions of cleaned dataframe:"
print(dim(movies_clean))
## [1] 45203 24
//////////
/////////
movies_clean$adult <- ifelse(movies_clean$adult != "True" & movies_clean$adult != "False", "False", movies_clean$adult)
//////////
//////////
//////////
//////////
Statistical and Analytical Techniques
After concluding the cleaning process, I began to graph the data into
charts and plots. The first 4 graphs I made were the ones I mentioned
previously. They contain the 20 most common Genres, Companies, Countries
and Languages. These graphs are extremely simple, but I considered them
to be essential to the analysis, since they represent the majority of
the population. Values that are smaller tend to be much less
significant, and so the most popular values in these columns are of the
utmost importance. These are the graphs I obtained:
barplot(top_20_studios, main = "Top 20 Most Common Studios", xlab = "Studios", ylab = "Count", las = 2, col = "forestgreen")

# New countries (Top 20 countries graph)
country_counts <- table(movies_clean$production_countries)
new_countries <- str_split_fixed(movies_clean$production_countries, "\\},", n = Inf)
new_countries <- new_countries[, 1:7]
new_countries <- as.data.frame(new_countries)
new_countries <- new_countries %>%
mutate(country1 = str_extract(V1, "'name':\\s+'([[:alnum:]\\s]+)'"),
country2 = str_extract(V2, "'name':\\s+'([[:alnum:]\\s]+)'"),
country3 = str_extract(V3, "'name':\\s+'([[:alnum:]\\s]+)'"),
country4 = str_extract(V4, "'name':\\s+'([[:alnum:]\\s]+)'"),
country5 = str_extract(V5, "'name':\\s+'([[:alnum:]\\s]+)'"),
country6 = str_extract(V6, "'name':\\s+'([[:alnum:]\\s]+)'"),
country7 = str_extract(V7, "'name':\\s+'([[:alnum:]\\s]+)'"))
new_countries$country1 <- str_remove(new_countries$country1, "'name':\\s+")
new_countries$country2 <- str_remove(new_countries$country2, "'name':\\s+")
new_countries$country3 <- str_remove(new_countries$country3, "'name':\\s+")
new_countries$country4 <- str_remove(new_countries$country4, "'name':\\s+")
new_countries$country5 <- str_remove(new_countries$country5, "'name':\\s+")
new_countries$country6 <- str_remove(new_countries$country6, "'name':\\s+")
new_countries$country7 <- str_remove(new_countries$country7, "'name':\\s+")
all_countries <- c(new_countries$country1, new_countries$country2, new_countries$country3,
new_countries$country4, new_countries$country5, new_countries$country6,
new_countries$country7)
country_counts <- table(all_countries)
country_counts <- sort(country_counts, decreasing = TRUE)
top_20_countries <- head(country_counts, 20)
print("Counts of Each Country:")
## [1] "Counts of Each Country:"
print(top_20_countries)
## all_countries
## 'United States of America' 'United Kingdom'
## 21128 4084
## 'France' 'Germany'
## 3915 2211
## 'Italy' 'Canada'
## 2132 1762
## 'Japan' 'Spain'
## 1642 944
## 'Russia' 'India'
## 907 826
## 'Hong Kong' 'Sweden'
## 595 579
## 'Australia' 'South Korea'
## 570 493
## 'Belgium' 'Denmark'
## 443 384
## 'China' 'Netherlands'
## 372 372
## 'Finland' 'Mexico'
## 371 325
barplot(top_20_countries, main = "Top 20 Most Common Countries", xlab = "Countries", ylab = "Count", las = 2, col = "red")

# New languages (Top 20 languages graph)
new_languages <- str_split_fixed(movies_clean$spoken_languages, "\\},", n = Inf)
new_languages <- new_languages[, 1:7]
new_languages <- as.data.frame(new_languages)
new_languages <- new_languages %>%
mutate(language1 = str_extract(V1, "'name':\\s+'([[:alnum:]\\s]+)'"),
language2 = str_extract(V2, "'name':\\s+'([[:alnum:]\\s]+)'"),
language3 = str_extract(V3, "'name':\\s+'([[:alnum:]\\s]+)'"),
language4 = str_extract(V4, "'name':\\s+'([[:alnum:]\\s]+)'"),
language5 = str_extract(V5, "'name':\\s+'([[:alnum:]\\s]+)'"),
language6 = str_extract(V6, "'name':\\s+'([[:alnum:]\\s]+)'"),
language7 = str_extract(V7, "'name':\\s+'([[:alnum:]\\s]+)'"))
new_languages$language1 <- str_remove(new_languages$language1, "'name':\\s+")
new_languages$language2 <- str_remove(new_languages$language2, "'name':\\s+")
new_languages$language3 <- str_remove(new_languages$language3, "'name':\\s+")
new_languages$language4 <- str_remove(new_languages$language4, "'name':\\s+")
new_languages$language5 <- str_remove(new_languages$language5, "'name':\\s+")
new_languages$language6 <- str_remove(new_languages$language6, "'name':\\s+")
new_languages$language7 <- str_remove(new_languages$language7, "'name':\\s+")
all_languages <- c(new_languages$language1, new_languages$language2, new_languages$language3,
new_languages$language4, new_languages$language5, new_languages$language6,
new_languages$language7)
language_counts <- table(all_languages)
language_counts <- sort(language_counts, decreasing = TRUE)
top_20_languages <- head(language_counts, 20)
print("Counts of Each Language:")
## [1] "Counts of Each Language:"
print(top_20_languages)
## all_languages
## 'English' 'Français' 'Deutsch' 'Español' 'Italiano'
## 28716 4173 2590 2387 2338
## '日本語' 'Pусский' '普通话' 'Português' 'svenska'
## 1755 1547 788 584 553
## 'Polski' 'suomi' 'Magyar' 'العربية' 'Nederlands'
## 520 360 359 340 333
## 'No Language' 'Dansk' 'Český' 'Türkçe' 'עִבְרִית'
## 319 301 283 238 214
barplot(top_20_languages, main = "Top 20 Most Common Spoken Languages", xlab = "Languages", ylab = "Count", las = 2, col = "skyblue")

Next, I made a table that gives you the name of the different
countries and obtains both the average revenue and runtime for each. I
considered that these values are also important, since different
countries tend to have different preferences when it comes to the
duration of films, and well as the fact that they have different amounts
of resources, which tends to have an impact on how much a movie earns in
sales. This is how this table looks like:
# Runtime and revenue data
all_countries <- c(new_countries$country1, new_countries$country2, new_countries$country3,
new_countries$country4, new_countries$country5, new_countries$country6,
new_countries$country7)
movie_country <- data.frame(movie_id = rep(movies_clean$id, each = 7), country = all_countries)
movie_country_revenue_runtime <- left_join(movie_country, movies_clean, by = c("movie_id" = "id"))
country_stats <- movie_country_revenue_runtime %>%
group_by(country) %>%
summarize(average_revenue = mean(revenue, na.rm = TRUE),
average_runtime = mean(runtime, na.rm = TRUE),
average_budget = mean(budget, na.rm = TRUE))
## Warning: There were 159 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `average_budget = mean(budget, na.rm = TRUE)`.
## ℹ In group 1: `country = "'Afghanistan'"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 158 remaining warnings.
print(country_stats)
## # A tibble: 159 × 4
## country average_revenue average_runtime average_budget
## <chr> <dbl> <dbl> <dbl>
## 1 'Afghanistan' 18780959. 113. NA
## 2 'Albania' 0 95.3 NA
## 3 'Algeria' 65245073. 91.5 NA
## 4 'Angola' 41241269 90.3 NA
## 5 'Antarctica' 0 95 NA
## 6 'Argentina' 20087200. 104. NA
## 7 'Armenia' 37011790. 107. NA
## 8 'Aruba' 483333. 103. NA
## 9 'Australia' 30705248. 106. NA
## 10 'Austria' 17337497. 104. NA
## # ℹ 149 more rows
After obtaining this table, I graphed it in a scatter plot, where we
can see the top 20 countries by both revenue and runtime, as well as the
bottom 20 countries. These are the graphs:
# Highest runtime and revenues scatterplots
top_20_countries <- head(country_stats[order(country_stats$average_revenue, decreasing = TRUE),], 20)
ggplot(top_20_countries, aes(x = country, y = average_revenue)) +
geom_point() +
labs(title = "Top 20 Countries: Average Revenue",
x = "Country", y = "Average Revenue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

top_20_countries <- head(country_stats[order(country_stats$average_runtime, decreasing = TRUE),], 20)
ggplot(top_20_countries, aes(x = country, y = average_runtime)) +
geom_point() +
labs(title = "Top 20 Countries: Average Runtime",
x = "Country", y = "Average Runtime") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Finally, I obtained the average rating by the different studios, by
combining all of the results obtained from the different movies. To do
this, I made a filter to only include studios with at least 10 movies.
This helps us only look at important studios, since some studios might
only have 1 or 2 movies, which would increase the range of error in the
analysis. After obtaining this average, I graphed the top 20 highest and
lowest studios. This is how the graphs look like:
# Top 20 best and worst rated studios (at least 10 movies)
studio_ratings_filtered <- studio_ratings_filtered %>%
filter(!is.na(average_vote_average))
top_studios <- head(studio_ratings_filtered, 20)
bottom_studios <- tail(studio_ratings_filtered, 20)
ggplot(top_studios, aes(x = average_vote_average, y = reorder(studio, average_vote_average))) +
geom_bar(stat = "identity", fill = "darkblue") +
labs(title = "Top 20 Studios by Average Ratings (with at least 10 movies)",
x = "Average Vote Average", y = "Studio") +
theme(axis.text.y = element_text(angle = 0, hjust = 1))

ggplot(bottom_studios, aes(x = average_vote_average, y = reorder(studio, average_vote_average))) +
geom_bar(stat = "identity", fill = "orange") +
labs(title = "Bottom 20 Studios by Average Ratings (with at least 10 movies)",
x = "Average Vote Average", y = "Studio") +
theme(axis.text.y = element_text(angle = 0, hjust = 1))

Findings and Results
After these graphs and charts, we can finally view our findings. The
first finding we can see, is that the genre with the most movies is
Drama. Based on this, we can assume 2 things. First, that studios keep
making Drama-oriented movies because there is demand for this genre.
Based on this, we can assume that Drama is the most successful genre,
considering that there are over 18 thousand movies that have this genre
as one of their main identifiers.
For our next finding, we can see which studios are the most
successful. The top 3 studios in terms of movies made are Paramount, Fox
and Universal, which are extremely well known at a global level. The
interesting detail is that these 3 studios are all from the United
States, and not only that. Out of the top 10 studios with the most
movies made, 9 are from the US. This leads to the conclusion that movies
made in this country tend to have higher success rates than others. We
can see this graphically, since in the “Most Common Countries” graph,
the United States has over 10 times as many movies as the second place
(United Kingdom).
This same argument goes on towards the next graph. We can see that
the English language is the most popular by an incredibly wide margin.
This is not only due to the fact that the USA is the country with the
most movies made, but also that English tends to be considered the
“World’s Language”, and so people from all over the world seem to be
attracted towards movies that have English as one of their options.
The final observation that can be made is related to the last graphs.
Most of the lower ranked studios are either not proper studios, or based
in countries that have a smaller amount of resources to make their
movies. On the other hand, most of the studios with the highest average
scores hail from the United States of America.
Conclusions and Recommendations
After this thorough analysis, I have arrived at the conclusion that
the most important factor in a movies success is it’s country of
production. Of course, this is mainly oriented towards the United
States, which can be explained by the fact that they have the highest
amount of studios out of any country by far. Another factor that can be
attributed to this is that American studios tend to have a higher amount
of popularity. Take Universal Studios for example. This studio is so
popular, that it has pretty much transcended movie-making, to the point
that it even has a theme park based around it.
The other conclusion that i’ve reached is that for a movie to be
successful, it has to at least have a translation to the English
language. As we can clearly see, over 85% of movies have the option to
be watched in English. To me, this is a key factor. This is due to the
fact that it is the most popular language in the world, and so adapting
a movie to this language could very well take a movie from being a local
hit, to a worldwide phenomenom.
