The film industry is a multi-billion-dollar domain where production budgets and marketing decisions are often driven by assumptions about what makes a movie successful. This project aims to analyze various attributes of movies—such as budget, runtime, genre, and IMDB-style ratings—to determine how they influence box office revenue. The goal is to develop a predictive model that can estimate revenue based on these features.
# Load required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(corrplot)
## corrplot 0.95 loaded
# Import dataset
df <- read_csv("movie_data.csv")
## Rows: 500 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): title
## dbl (9): year, budget, length, rating, votes, Action, Comedy, Drama, revenue
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Structure and first rows
str(df)
## spc_tbl_ [500 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ title : chr [1:500] "Movie 0" "Movie 1" "Movie 2" "Movie 3" ...
## $ year : num [1:500] 1982 2008 2014 2018 1997 ...
## $ budget : num [1:500] 26242501 16202750 35232184 25733102 12445982 ...
## $ length : num [1:500] 95.3 101.7 101.6 82.3 82.3 ...
## $ rating : num [1:500] 8.32 8.64 3.53 2.04 1.91 ...
## $ votes : num [1:500] 48967 73771 93341 30384 28077 ...
## $ Action : num [1:500] 0 1 1 0 1 0 0 0 0 0 ...
## $ Comedy : num [1:500] 0 1 1 0 1 1 1 1 1 1 ...
## $ Drama : num [1:500] 1 0 1 1 1 1 0 0 0 1 ...
## $ revenue: num [1:500] 33430000 25930000 35530000 10410000 5570000 ...
## - attr(*, "spec")=
## .. cols(
## .. title = col_character(),
## .. year = col_double(),
## .. budget = col_double(),
## .. length = col_double(),
## .. rating = col_double(),
## .. votes = col_double(),
## .. Action = col_double(),
## .. Comedy = col_double(),
## .. Drama = col_double(),
## .. revenue = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(df)
# Check for missing values
colSums(is.na(df))
## title year budget length rating votes Action Comedy Drama revenue
## 0 0 0 0 0 0 0 0 0 0
# Convert genre columns to factors
df <- df %>%
mutate(
Action = factor(Action),
Comedy = factor(Comedy),
Drama = factor(Drama)
)
# Summary statistics
summary(df)
## title year budget length
## Length:500 Min. :1980 Min. : 2055277 Min. : 60.00
## Class :character 1st Qu.:1989 1st Qu.:23317399 1st Qu.: 89.85
## Mode :character Median :2000 Median :29653526 Median :100.10
## Mean :2000 Mean :29756334 Mean : 99.57
## 3rd Qu.:2010 3rd Qu.:36909386 3rd Qu.:108.58
## Max. :2019 Max. :65715792 Max. :141.50
## rating votes Action Comedy Drama revenue
## Min. :1.001 Min. : 1073 0:354 0:291 0:240 Min. : 2120000
## 1st Qu.:3.267 1st Qu.:24420 1:146 1:209 1:260 1st Qu.: 18277500
## Median :5.444 Median :48114 Median : 32365000
## Mean :5.517 Mean :49294 Mean : 36536880
## 3rd Qu.:7.721 3rd Qu.:73749 3rd Qu.: 49127500
## Max. :9.994 Max. :99935 Max. :140410000
# Top 5 highest revenue movies
df %>% arrange(desc(revenue)) %>% head(5)
# Ratings distribution
ggplot(df, aes(x = rating)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
labs(title = "Distribution of Movie Ratings", x = "Rating", y = "Count")
# Budget vs Revenue
ggplot(df, aes(x = budget, y = revenue)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Budget vs Revenue", x = "Budget (USD)", y = "Revenue (USD)")
## `geom_smooth()` using formula = 'y ~ x'
# Correlation matrix
numeric_vars <- df %>% select(budget, length, rating, votes, revenue)
cor_matrix <- cor(numeric_vars)
corrplot::corrplot(cor_matrix, method = "circle", type = "upper")
# Boxplot: Revenue by Genre (Drama)
ggplot(df, aes(x = Drama, y = revenue)) +
geom_boxplot(fill = "orange") +
labs(title = "Revenue by Drama Genre", x = "Drama (1 = Yes)", y = "Revenue")
# Linear model
model <- lm(revenue ~ budget + rating + votes + length + Action + Comedy + Drama, data = df)
summary(model)
##
## Call:
## lm(formula = revenue ~ budget + rating + votes + length + Action +
## Comedy + Drama, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22771717 -7338127 -137756 6661208 51439284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.286e+07 3.757e+06 -8.745 <2e-16 ***
## budget 1.219e+00 4.539e-02 26.855 <2e-16 ***
## rating 6.698e+06 1.780e+05 37.634 <2e-16 ***
## votes -1.429e+01 1.623e+01 -0.881 0.379
## length -3.917e+04 3.241e+04 -1.208 0.228
## Action1 1.409e+06 1.013e+06 1.391 0.165
## Comedy1 1.349e+06 9.312e+05 1.448 0.148
## Drama1 -3.812e+05 9.202e+05 -0.414 0.679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10240000 on 492 degrees of freedom
## Multiple R-squared: 0.8104, Adjusted R-squared: 0.8077
## F-statistic: 300.4 on 7 and 492 DF, p-value: < 2.2e-16
# Add predictions
df$predicted_revenue <- predict(model)
# Plot Actual vs Predicted
ggplot(df, aes(x = revenue, y = predicted_revenue)) +
geom_point(color = "darkgreen", alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs Predicted Revenue", x = "Actual Revenue", y = "Predicted Revenue")
This report analyzed the key predictors of box office success using a simulated movie dataset. After performing extensive EDA and building a linear regression model, we found that budget, ratings, and votes are strong predictors of revenue. The model also showed some contribution from genre indicators like Drama and Comedy.
The predicted revenue values aligned reasonably well with the actual revenues, especially in the mid and high ranges. Future improvements could involve more sophisticated machine learning models such as random forests or gradient boosting, along with feature engineering on cast, release date, or director popularity.
This report and the accompanying slides provide a foundation for understanding which movie attributes matter most in financial performance.