1 Executive Summary

This report seeks to examine the key factors contributing to a movie’s success using the Movies Dataset from Kaggle. It covers thorough data cleaning, categorization, exploration, correlation analysis, and clustering, ultimately providing actionable insights for major film production companies.

2 Data Import and Preliminary Analysis

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies <- read_csv("/Users/eduardosanchez/Desktop/Archive/movies_metadata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 45466 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
## dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## lgl   (2): adult, video
## date  (1): release_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
credits <- read_csv("/Users/eduardosanchez/Desktop/Archive/credits.csv")
## Rows: 45476 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cast, crew
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
keywords <- read_csv("/Users/eduardosanchez/Desktop/Archive/keywords.csv")
## Rows: 46419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): keywords
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
links_small <- read_csv("/Users/eduardosanchez/Desktop/Archive/links_small.csv")
## Rows: 9125 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): imdbId
## dbl (2): movieId, tmdbId
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ratings_small_csv <- read_csv("/Users/eduardosanchez/Desktop/Archive/ratings_small.csv")
## Rows: 100004 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): userId, movieId, rating, timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
## Rows: 45,466
## Columns: 24
## $ adult                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection', '…
## $ budget                <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000, 58…
## $ genres                <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'n…
## $ homepage              <chr> "http://toystory.disney.com/toy-story", NA, NA, …
## $ id                    <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4532…
## $ imdb_id               <chr> "tt0114709", "tt0113497", "tt0113228", "tt011488…
## $ original_language     <chr> "en", "en", "en", "en", "en", "en", "en", "en", …
## $ original_title        <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ overview              <chr> "Led by Woody, Andy's toys live happily in his r…
## $ popularity            <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8.387…
## $ poster_path           <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP7aP…
## $ production_companies  <chr> "[{'name': 'Pixar Animation Studios', 'id': 3}]"…
## $ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United States of…
## $ release_date          <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-22,…
## $ revenue               <dbl> 373554033, 262797249, 0, 81452156, 76578911, 187…
## $ runtime               <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 130, …
## $ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "[{'…
## $ status                <chr> "Released", "Released", "Released", "Released", …
## $ tagline               <chr> NA, "Roll the dice and unleash the excitement!",…
## $ title                 <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ video                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ vote_average          <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, 6.6…
## $ vote_count            <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, 119…

3 Data cleaning

3.1 Conversion to Numeric Format

## tibble [45,466 × 3] (S3: tbl_df/tbl/data.frame)
##  $ budget      : num [1:45466] 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
##  $ revenue     : num [1:45466] 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
##  $ release_date: Date[1:45466], format: "1995-10-30" "1995-12-15" ...

3.2 Missing Data Detection

##       budget      revenue release_date 
##            3            6           90

4 Identifying and Categorizing Outliers

## # A tibble: 0 × 6
## # ℹ 6 variables: budget <dbl>, popularity <dbl>, runtime <dbl>, revenue <dbl>,
## #   vote_average <dbl>, vote_count <dbl>

4.1 Categorization of Budget Levels

4.2 Categorization of Popularity Levels

4.3 Categorization of Runtime

5 Filtering the Final Dataset

6 Cleaning and Parsing Genre Data

## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten

7 Merging Datasets

8 Managing Missing Data

## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## [1] 0
## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

9 Exploratory Data Analysis

9.1 Distribution of Genres

9.2 Revenue Trend Over Time

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

9.3 Analysis of Correlations

## corrplot 0.95 loaded

9.4 Clustering

9.5 Return on Investment (ROI)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0        1        2     5572        4 12396383     2023

9.6 Genre-based Correlation Analysis

## # A tibble: 19 × 3
##    main_genre          N     r
##    <chr>           <int> <dbl>
##  1 History            39 0.850
##  2 Fantasy           182 0.777
##  3 Action           1195 0.766
##  4 Science Fiction   116 0.756
##  5 Adventure         474 0.738
##  6 Thriller          259 0.723
##  7 Animation         188 0.720
##  8 Family             79 0.707
##  9 Crime             333 0.685
## 10 Western            44 0.654
## 11 Drama            1934 0.632
## 12 Comedy           1568 0.630
## 13 War                50 0.546
## 14 Mystery            91 0.527
## 15 Romance           176 0.516
## 16 Music              49 0.484
## 17 Documentary       188 0.481
## 18 Horror            403 0.470
## 19 <NA>               23 0.453

9.7 Additional Data and Visualizations

## # A tibble: 21 × 9
##    genres        count mean_budget median_budget sd_budget iqr_budget min_budget
##    <chr>         <int>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
##  1 Adventure       957   63865792.      40000000 62204025.   82250400          5
##  2 Animation       292   63660027.      52000000 54238774.   74500000         30
##  3 Fantasy         510   61990664.      40000000 61886126.   70275000          8
##  4 Family          530   57989444.      40000000 52626867.   62000000         12
##  5 Science Fict…   634   52353195.      30000000 57038951.   64500000          7
##  6 Action         1414   49730128.      30000000 53845276.   59000000          1
##  7 Thriller       1502   32363496.      20000000 36729052.   38000000          1
##  8 War             203   31885662.      18000000 35213561.   38250000          4
##  9 History         235   30050228.      18339750 31938033.   30000000          8
## 10 Western          89   29583281.      10500000 44941203.   31231215     200000
## # ℹ 11 more rows
## # ℹ 2 more variables: max_budget <dbl>, skew_budget <dbl>

## [1] 0.120132

9.8 New Function not seen in class

To enhance the analysis of movie success factors, K-Means Clustering was applied to the dataset using the numerical features: budget, revenue, vote average, and runtime. K-Means is an unsupervised machine learning technique that groups data points into clusters based on their similarities.

In this case, K-Means clustering helped identify distinct groups of movies with similar characteristics. The number of clusters was determined using the elbow method, which plots the within-cluster sum of squares for various cluster numbers and helps find the optimal 𝑘 k value.

The clustering results were visualized in a scatterplot, where movies were grouped based on budget and revenue. This visualization allows decision-makers to see clusters of movies that share similar characteristics, such as high-budget blockbusters or low-budget indie films. By analyzing these clusters, film studios can target specific market segments more effectively and improve their production and marketing strategies.

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 369900)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 369900)
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations

10 Conclusions and Recommendations

-> Movies with larger budgets typically yield higher revenues. -> Action, Animation, and Sci-Fi genres consistently outperform others in terms of revenue generation.

For optimal success, film studios should strategically choose genres and budget ranges within these optimal thresholds and align their marketing strategies to target these high-performing segments.