This report aims to analyze critical factors behind a movie’s success using the Movies Dataset from Kaggle. It provides comprehensive data cleaning, categorization, exploration, correlation analyses, and clustering, culminating in actionable insights for major film production companies.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/movies_metadata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 45466 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
## dbl (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## lgl (2): adult, video
## date (1): release_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
credits <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/credits.csv")
## Rows: 45476 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cast, crew
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
keywords <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/keywords.csv")
## Rows: 46419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): keywords
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
links_small <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/links_small.csv")
## Rows: 9125 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): imdbId
## dbl (2): movieId, tmdbId
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ratings_small_csv <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/ratings_small.csv")
## Rows: 100004 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): userId, movieId, rating, timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
## Rows: 45,466
## Columns: 24
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection', '…
## $ budget <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000, 58…
## $ genres <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'n…
## $ homepage <chr> "http://toystory.disney.com/toy-story", NA, NA, …
## $ id <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4532…
## $ imdb_id <chr> "tt0114709", "tt0113497", "tt0113228", "tt011488…
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en", "en", …
## $ original_title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ overview <chr> "Led by Woody, Andy's toys live happily in his r…
## $ popularity <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8.387…
## $ poster_path <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP7aP…
## $ production_companies <chr> "[{'name': 'Pixar Animation Studios', 'id': 3}]"…
## $ production_countries <chr> "[{'iso_3166_1': 'US', 'name': 'United States of…
## $ release_date <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-22,…
## $ revenue <dbl> 373554033, 262797249, 0, 81452156, 76578911, 187…
## $ runtime <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 130, …
## $ spoken_languages <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "[{'…
## $ status <chr> "Released", "Released", "Released", "Released", …
## $ tagline <chr> NA, "Roll the dice and unleash the excitement!",…
## $ title <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ vote_average <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, 6.6…
## $ vote_count <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, 119…
## tibble [45,466 × 3] (S3: tbl_df/tbl/data.frame)
## $ budget : num [1:45466] 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
## $ revenue : num [1:45466] 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
## $ release_date: Date[1:45466], format: "1995-10-30" "1995-12-15" ...
## budget revenue release_date
## 3 6 90
## # A tibble: 0 × 6
## # ℹ 6 variables: budget <dbl>, popularity <dbl>, runtime <dbl>, revenue <dbl>,
## # vote_average <dbl>, vote_count <dbl>
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
## [1] 0
##
## iter imp variable
## 1 1
## 1 2
## 1 3
## 1 4
## 1 5
## 2 1
## 2 2
## 2 3
## 2 4
## 2 5
## 3 1
## 3 2
## 3 3
## 3 4
## 3 5
## 4 1
## 4 2
## 4 3
## 4 4
## 4 5
## 5 1
## 5 2
## 5 3
## 5 4
## 5 5
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## corrplot 0.95 loaded
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1 2 5572 4 12396383 2023
## # A tibble: 19 × 3
## main_genre N r
## <chr> <int> <dbl>
## 1 History 39 0.850
## 2 Fantasy 182 0.777
## 3 Action 1195 0.766
## 4 Science Fiction 116 0.756
## 5 Adventure 474 0.738
## 6 Thriller 259 0.723
## 7 Animation 188 0.720
## 8 Family 79 0.707
## 9 Crime 333 0.685
## 10 Western 44 0.654
## 11 Drama 1934 0.632
## 12 Comedy 1568 0.630
## 13 War 50 0.546
## 14 Mystery 91 0.527
## 15 Romance 176 0.516
## 16 Music 49 0.484
## 17 Documentary 188 0.481
## 18 Horror 403 0.470
## 19 <NA> 23 0.453
##Extra Data and Charts
## # A tibble: 21 × 9
## genres count mean_budget median_budget sd_budget iqr_budget min_budget
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adventure 957 63865792. 40000000 62204025. 82250400 5
## 2 Animation 292 63660027. 52000000 54238774. 74500000 30
## 3 Fantasy 510 61990664. 40000000 61886126. 70275000 8
## 4 Family 530 57989444. 40000000 52626867. 62000000 12
## 5 Science Fict… 634 52353195. 30000000 57038951. 64500000 7
## 6 Action 1414 49730128. 30000000 53845276. 59000000 1
## 7 Thriller 1502 32363496. 20000000 36729052. 38000000 1
## 8 War 203 31885662. 18000000 35213561. 38250000 4
## 9 History 235 30050228. 18339750 31938033. 30000000 8
## 10 Western 89 29583281. 10500000 44941203. 31231215 200000
## # ℹ 11 more rows
## # ℹ 2 more variables: max_budget <dbl>, skew_budget <dbl>
## [1] 0.120132
##New Function not seen in class To complement the statistical analysis of movie success factors, a Principal Component Analysis (PCA) was performed on the numerical features: budget, revenue, vote average, vote count, and runtime. PCA is an advanced statistical technique used to reduce the dimensionality of data while retaining as much variability as possible.
In this case, PCA allowed us to identify underlying patterns in the dataset by projecting movies into a 2D space, based on combinations of the original variables (called principal components). The first two principal components explained a significant portion of the variance across all features.
The scatterplot generated using the ggfortify package visualizes movies colored by their popularity index. It reveals clusters and gradients of movie performance, suggesting that combinations of budget, revenue, and vote count are tightly linked to popularity classification. This helps decision-makers visually identify outliers and latent groups of successful or unsuccessful movies.
To maximize success, studios should strategically select genres and budgets within the identified optimal ranges and focus marketing efforts accordingly.