1 Executive Summary

This report aims to analyze critical factors behind a movie’s success using the Movies Dataset from Kaggle. It provides comprehensive data cleaning, categorization, exploration, correlation analyses, and clustering, culminating in actionable insights for major film production companies.

2 1. Data Loading and Initial Overview

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/movies_metadata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 45466 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (14): belongs_to_collection, genres, homepage, imdb_id, original_langua...
## dbl   (7): budget, id, popularity, revenue, runtime, vote_average, vote_count
## lgl   (2): adult, video
## date  (1): release_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
credits <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/credits.csv")
## Rows: 45476 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): cast, crew
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
keywords <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/keywords.csv")
## Rows: 46419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): keywords
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
links_small <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/links_small.csv")
## Rows: 9125 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): imdbId
## dbl (2): movieId, tmdbId
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ratings_small_csv <- read_csv("C:/Users/ferpa/OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey/Documents/RStudio/Business Analytics/ratings_small.csv")
## Rows: 100004 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): userId, movieId, rating, timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
## Rows: 45,466
## Columns: 24
## $ adult                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collection', '…
## $ budget                <dbl> 30000000, 65000000, 0, 16000000, 0, 60000000, 58…
## $ genres                <chr> "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'n…
## $ homepage              <chr> "http://toystory.disney.com/toy-story", NA, NA, …
## $ id                    <dbl> 862, 8844, 15602, 31357, 11862, 949, 11860, 4532…
## $ imdb_id               <chr> "tt0114709", "tt0113497", "tt0113228", "tt011488…
## $ original_language     <chr> "en", "en", "en", "en", "en", "en", "en", "en", …
## $ original_title        <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ overview              <chr> "Led by Woody, Andy's toys live happily in his r…
## $ popularity            <dbl> 21.946943, 17.015539, 11.712900, 3.859495, 8.387…
## $ poster_path           <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vzmL6fP7aP…
## $ production_companies  <chr> "[{'name': 'Pixar Animation Studios', 'id': 3}]"…
## $ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United States of…
## $ release_date          <date> 1995-10-30, 1995-12-15, 1995-12-22, 1995-12-22,…
## $ revenue               <dbl> 373554033, 262797249, 0, 81452156, 76578911, 187…
## $ runtime               <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 106, 130, …
## $ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}]", "[{'…
## $ status                <chr> "Released", "Released", "Released", "Released", …
## $ tagline               <chr> NA, "Roll the dice and unleash the excitement!",…
## $ title                 <chr> "Toy Story", "Jumanji", "Grumpier Old Men", "Wai…
## $ video                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ vote_average          <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, 5.5, 6.6…
## $ vote_count            <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, 174, 119…

3 2. Data Cleaning

3.1 2.1 Numeric Conversion

## tibble [45,466 × 3] (S3: tbl_df/tbl/data.frame)
##  $ budget      : num [1:45466] 3.0e+07 6.5e+07 0.0 1.6e+07 0.0 6.0e+07 5.8e+07 0.0 3.5e+07 5.8e+07 ...
##  $ revenue     : num [1:45466] 3.74e+08 2.63e+08 0.00 8.15e+07 7.66e+07 ...
##  $ release_date: Date[1:45466], format: "1995-10-30" "1995-12-15" ...

3.2 2.2 Missing Value Inspection

##       budget      revenue release_date 
##            3            6           90

4 3. Outlier Detection and Categorization

4.1 3.1 Identify Outliers

## # A tibble: 0 × 6
## # ℹ 6 variables: budget <dbl>, popularity <dbl>, runtime <dbl>, revenue <dbl>,
## #   vote_average <dbl>, vote_count <dbl>

4.2 3.2 Budget Categorization

4.3 3.3 Popularity Categorization

4.4 3.4 Runtime Categorization

5 4. Final Dataset Filtering

6 5. Genre Cleaning and Parsing

## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten

7 6. Dataset Integration

8 7. Handling Missing Values

## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## [1] 0
## 
##  iter imp variable
##   1   1
##   1   2
##   1   3
##   1   4
##   1   5
##   2   1
##   2   2
##   2   3
##   2   4
##   2   5
##   3   1
##   3   2
##   3   3
##   3   4
##   3   5
##   4   1
##   4   2
##   4   3
##   4   4
##   4   5
##   5   1
##   5   2
##   5   3
##   5   4
##   5   5

9 8. Exploratory Analysis

9.1 8.1 Genre Distribution

9.2 8.2 Revenue Over Time

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

9.3 8.3 Correlation Analysis

## corrplot 0.95 loaded

9.4 8.4 Clustering

9.5 8.5 Return on Investment (ROI)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0        1        2     5572        4 12396383     2023

9.6 8.6 Correlation by Genre

## # A tibble: 19 × 3
##    main_genre          N     r
##    <chr>           <int> <dbl>
##  1 History            39 0.850
##  2 Fantasy           182 0.777
##  3 Action           1195 0.766
##  4 Science Fiction   116 0.756
##  5 Adventure         474 0.738
##  6 Thriller          259 0.723
##  7 Animation         188 0.720
##  8 Family             79 0.707
##  9 Crime             333 0.685
## 10 Western            44 0.654
## 11 Drama            1934 0.632
## 12 Comedy           1568 0.630
## 13 War                50 0.546
## 14 Mystery            91 0.527
## 15 Romance           176 0.516
## 16 Music              49 0.484
## 17 Documentary       188 0.481
## 18 Horror            403 0.470
## 19 <NA>               23 0.453

##Extra Data and Charts

## # A tibble: 21 × 9
##    genres        count mean_budget median_budget sd_budget iqr_budget min_budget
##    <chr>         <int>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
##  1 Adventure       957   63865792.      40000000 62204025.   82250400          5
##  2 Animation       292   63660027.      52000000 54238774.   74500000         30
##  3 Fantasy         510   61990664.      40000000 61886126.   70275000          8
##  4 Family          530   57989444.      40000000 52626867.   62000000         12
##  5 Science Fict…   634   52353195.      30000000 57038951.   64500000          7
##  6 Action         1414   49730128.      30000000 53845276.   59000000          1
##  7 Thriller       1502   32363496.      20000000 36729052.   38000000          1
##  8 War             203   31885662.      18000000 35213561.   38250000          4
##  9 History         235   30050228.      18339750 31938033.   30000000          8
## 10 Western          89   29583281.      10500000 44941203.   31231215     200000
## # ℹ 11 more rows
## # ℹ 2 more variables: max_budget <dbl>, skew_budget <dbl>

## [1] 0.120132

##New Function not seen in class To complement the statistical analysis of movie success factors, a Principal Component Analysis (PCA) was performed on the numerical features: budget, revenue, vote average, vote count, and runtime. PCA is an advanced statistical technique used to reduce the dimensionality of data while retaining as much variability as possible.

In this case, PCA allowed us to identify underlying patterns in the dataset by projecting movies into a 2D space, based on combinations of the original variables (called principal components). The first two principal components explained a significant portion of the variance across all features.

The scatterplot generated using the ggfortify package visualizes movies colored by their popularity index. It reveals clusters and gradients of movie performance, suggesting that combinations of budget, revenue, and vote count are tightly linked to popularity classification. This helps decision-makers visually identify outliers and latent groups of successful or unsuccessful movies.

10 9. Conclusions and Recommendations

To maximize success, studios should strategically select genres and budgets within the identified optimal ranges and focus marketing efforts accordingly.