This report seeks to examine the key factors contributing to a movie’s success using the Movies Dataset from Kaggle. It covers thorough data cleaning, categorization, exploration, correlation analysis, and clustering, ultimately providing actionable insights for major film production companies.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies <- read_csv("/Users/eduardosanchez/Desktop/Archive/movies_metadata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 45466 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
## dbl (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## lgl (2): adult, video
## date (1): release_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
credits <- read_csv("/Users/eduardosanchez/Desktop/Archive/credits.csv")
## Rows: 45476 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cast, crew
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
keywords <- read_csv("/Users/eduardosanchez/Desktop/Archive/keywords.csv")
## Rows: 46419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): keywords
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
links_small <- read_csv("/Users/eduardosanchez/Desktop/Archive/links_small.csv")
## Rows: 9125 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): imdbId
## dbl (2): movieId, tmdbId
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ratings_small_csv <- read_csv("/Users/eduardosanchez/Desktop/Archive/ratings_small.csv")
## Rows: 100004 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): userId, movieId, rating, timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
## Rows: 45,466
## Columns: 24
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection', '…
## $ budget <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000, 58…
## $ genres <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'n…
## $ homepage <chr> "http://toystory.disney.com/toy-story", NA, NA, …
## $ id <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4532…
## $ imdb_id <chr> "tt0114709", "tt0113497", "tt0113228", "tt011488…
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", …
## $ original_title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ overview <chr> "Led by Woody, Andy's toys live happily in his r…
## $ popularity <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8.387…
## $ poster_path <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP7aP…
## $ production_companies <chr> "[{'name': 'Pixar Animation Studios', 'id': 3}]"…
## $ production_countries <chr> "[{'iso_3166_1': 'US', 'name': 'United States of…
## $ release_date <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-22,…
## $ revenue <dbl> 373554033, 262797249, 0, 81452156, 76578911, 187…
## $ runtime <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 130, …
## $ spoken_languages <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "[{'…
## $ status <chr> "Released", "Released", "Released", "Released", …
## $ tagline <chr> NA, "Roll the dice and unleash the excitement!",…
## $ title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ vote_average <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, 6.6…
## $ vote_count <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, 119…
## tibble [45,466 × 3] (S3: tbl_df/tbl/data.frame)
## $ budget : num [1:45466] 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
## $ revenue : num [1:45466] 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
## $ release_date: Date[1:45466], format: "1995-10-30" "1995-12-15" ...
## budget revenue release_date
## 3 6 90
## # A tibble: 0 × 6
## # ℹ 6 variables: budget <dbl>, popularity <dbl>, runtime <dbl>, revenue <dbl>,
## # vote_average <dbl>, vote_count <dbl>
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## [1] 0
##
## iter imp variable
## 1 1
## 1 2
## 1 3
## 1 4
## 1 5
## 2 1
## 2 2
## 2 3
## 2 4
## 2 5
## 3 1
## 3 2
## 3 3
## 3 4
## 3 5
## 4 1
## 4 2
## 4 3
## 4 4
## 4 5
## 5 1
## 5 2
## 5 3
## 5 4
## 5 5
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## corrplot 0.95 loaded
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1 2 5572 4 12396383 2023
## # A tibble: 19 × 3
## main_genre N r
## <chr> <int> <dbl>
## 1 History 39 0.850
## 2 Fantasy 182 0.777
## 3 Action 1195 0.766
## 4 Science Fiction 116 0.756
## 5 Adventure 474 0.738
## 6 Thriller 259 0.723
## 7 Animation 188 0.720
## 8 Family 79 0.707
## 9 Crime 333 0.685
## 10 Western 44 0.654
## 11 Drama 1934 0.632
## 12 Comedy 1568 0.630
## 13 War 50 0.546
## 14 Mystery 91 0.527
## 15 Romance 176 0.516
## 16 Music 49 0.484
## 17 Documentary 188 0.481
## 18 Horror 403 0.470
## 19 <NA> 23 0.453
## # A tibble: 21 × 9
## genres count mean_budget median_budget sd_budget iqr_budget min_budget
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adventure 957 63865792. 40000000 62204025. 82250400 5
## 2 Animation 292 63660027. 52000000 54238774. 74500000 30
## 3 Fantasy 510 61990664. 40000000 61886126. 70275000 8
## 4 Family 530 57989444. 40000000 52626867. 62000000 12
## 5 Science Fict… 634 52353195. 30000000 57038951. 64500000 7
## 6 Action 1414 49730128. 30000000 53845276. 59000000 1
## 7 Thriller 1502 32363496. 20000000 36729052. 38000000 1
## 8 War 203 31885662. 18000000 35213561. 38250000 4
## 9 History 235 30050228. 18339750 31938033. 30000000 8
## 10 Western 89 29583281. 10500000 44941203. 31231215 200000
## # ℹ 11 more rows
## # ℹ 2 more variables: max_budget <dbl>, skew_budget <dbl>
## [1] 0.120132
To enhance the analysis of movie success factors, K-Means Clustering was applied to the dataset using the numerical features: budget, revenue, vote average, and runtime. K-Means is an unsupervised machine learning technique that groups data points into clusters based on their similarities.
In this case, K-Means clustering helped identify distinct groups of movies with similar characteristics. The number of clusters was determined using the elbow method, which plots the within-cluster sum of squares for various cluster numbers and helps find the optimal 𝑘 k value.
The clustering results were visualized in a scatterplot, where movies were grouped based on budget and revenue. This visualization allows decision-makers to see clusters of movies that share similar characteristics, such as high-budget blockbusters or low-budget indie films. By analyzing these clusters, film studios can target specific market segments more effectively and improve their production and marketing strategies.
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 369900)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 369900)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
-> Movies with larger budgets typically yield higher revenues. -> Action, Animation, and Sci-Fi genres consistently outperform others in terms of revenue generation.
For optimal success, film studios should strategically choose genres and budget ranges within these optimal thresholds and align their marketing strategies to target these high-performing segments.