1. Background and Problem Definition

The film industry is a multi-billion-dollar domain where production budgets and marketing decisions are often driven by assumptions about what makes a movie successful. This project aims to analyze various attributes of movies—such as budget, runtime, genre, and IMDB-style ratings—to determine how they influence box office revenue. The goal is to develop a predictive model that can estimate revenue based on these features.

2. Data Wrangling, Munging and Cleaning

# Load required libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(corrplot)

## corrplot 0.95 loaded

# Import dataset
df <- read_csv("movie_data.csv")

## Rows: 500 Columns: 10

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): title
## dbl (9): year, budget, length, rating, votes, Action, Comedy, Drama, revenue
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Structure and first rows
str(df)

## spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ title  : chr [1:500] "Movie 0" "Movie 1" "Movie 2" "Movie 3" ...
##  $ year   : num [1:500] 1982 2008 2014 2018 1997 ...
##  $ budget : num [1:500] 26242501 16202750 35232184 25733102 12445982 ...
##  $ length : num [1:500] 95.3 101.7 101.6 82.3 82.3 ...
##  $ rating : num [1:500] 8.32 8.64 3.53 2.04 1.91 ...
##  $ votes  : num [1:500] 48967 73771 93341 30384 28077 ...
##  $ Action : num [1:500] 0 1 1 0 1 0 0 0 0 0 ...
##  $ Comedy : num [1:500] 0 1 1 0 1 1 1 1 1 1 ...
##  $ Drama  : num [1:500] 1 0 1 1 1 1 0 0 0 1 ...
##  $ revenue: num [1:500] 33430000 25930000 35530000 10410000 5570000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   title = col_character(),
##   ..   year = col_double(),
##   ..   budget = col_double(),
##   ..   length = col_double(),
##   ..   rating = col_double(),
##   ..   votes = col_double(),
##   ..   Action = col_double(),
##   ..   Comedy = col_double(),
##   ..   Drama = col_double(),
##   ..   revenue = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(df)

# Check for missing values
colSums(is.na(df))

##   title    year  budget  length  rating   votes  Action  Comedy   Drama revenue 
##       0       0       0       0       0       0       0       0       0       0

# Convert genre columns to factors
df <- df %>%
  mutate(
    Action = factor(Action),
    Comedy = factor(Comedy),
    Drama = factor(Drama)
  )

3. Exploratory Data Analysis

# Summary statistics
summary(df)

##     title                year          budget             length      
##  Length:500         Min.   :1980   Min.   : 2055277   Min.   : 60.00  
##  Class :character   1st Qu.:1989   1st Qu.:23317399   1st Qu.: 89.85  
##  Mode  :character   Median :2000   Median :29653526   Median :100.10  
##                     Mean   :2000   Mean   :29756334   Mean   : 99.57  
##                     3rd Qu.:2010   3rd Qu.:36909386   3rd Qu.:108.58  
##                     Max.   :2019   Max.   :65715792   Max.   :141.50  
##      rating          votes       Action  Comedy  Drama      revenue         
##  Min.   :1.001   Min.   : 1073   0:354   0:291   0:240   Min.   :  2120000  
##  1st Qu.:3.267   1st Qu.:24420   1:146   1:209   1:260   1st Qu.: 18277500  
##  Median :5.444   Median :48114                           Median : 32365000  
##  Mean   :5.517   Mean   :49294                           Mean   : 36536880  
##  3rd Qu.:7.721   3rd Qu.:73749                           3rd Qu.: 49127500  
##  Max.   :9.994   Max.   :99935                           Max.   :140410000

# Top 5 highest revenue movies
df %>% arrange(desc(revenue)) %>% head(5)

# Ratings distribution
ggplot(df, aes(x = rating)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Movie Ratings", x = "Rating", y = "Count")

# Budget vs Revenue
ggplot(df, aes(x = budget, y = revenue)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Budget vs Revenue", x = "Budget (USD)", y = "Revenue (USD)")

## `geom_smooth()` using formula = 'y ~ x'

4. Data Visualization

# Correlation matrix
numeric_vars <- df %>% select(budget, length, rating, votes, revenue)
cor_matrix <- cor(numeric_vars)
corrplot::corrplot(cor_matrix, method = "circle", type = "upper")

# Boxplot: Revenue by Genre (Drama)
ggplot(df, aes(x = Drama, y = revenue)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Revenue by Drama Genre", x = "Drama (1 = Yes)", y = "Revenue")

5. Linear Regression Model

# Linear model
model <- lm(revenue ~ budget + rating + votes + length + Action + Comedy + Drama, data = df)
summary(model)

## 
## Call:
## lm(formula = revenue ~ budget + rating + votes + length + Action + 
##     Comedy + Drama, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22771717  -7338127   -137756   6661208  51439284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.286e+07  3.757e+06  -8.745   <2e-16 ***
## budget       1.219e+00  4.539e-02  26.855   <2e-16 ***
## rating       6.698e+06  1.780e+05  37.634   <2e-16 ***
## votes       -1.429e+01  1.623e+01  -0.881    0.379    
## length      -3.917e+04  3.241e+04  -1.208    0.228    
## Action1      1.409e+06  1.013e+06   1.391    0.165    
## Comedy1      1.349e+06  9.312e+05   1.448    0.148    
## Drama1      -3.812e+05  9.202e+05  -0.414    0.679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10240000 on 492 degrees of freedom
## Multiple R-squared:  0.8104, Adjusted R-squared:  0.8077 
## F-statistic: 300.4 on 7 and 492 DF,  p-value: < 2.2e-16

# Add predictions
df$predicted_revenue <- predict(model)

# Plot Actual vs Predicted
ggplot(df, aes(x = revenue, y = predicted_revenue)) +
  geom_point(color = "darkgreen", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Revenue", x = "Actual Revenue", y = "Predicted Revenue")

6. Final Paper/Slides and Presentation

This report analyzed the key predictors of box office success using a simulated movie dataset. After performing extensive EDA and building a linear regression model, we found that budget, ratings, and votes are strong predictors of revenue. The model also showed some contribution from genre indicators like Drama and Comedy.

The predicted revenue values aligned reasonably well with the actual revenues, especially in the mid and high ranges. Future improvements could involve more sophisticated machine learning models such as random forests or gradient boosting, along with feature engineering on cast, release date, or director popularity.

This report and the accompanying slides provide a foundation for understanding which movie attributes matter most in financial performance.