This dataset contains information on the top 250 movies of all time including the movie title, release year, audience rating, total number of votes, runtime in minutes, movie rating certificate (G, PG, R, etc.), genre(s), and gross earnings. The source of the data in this dataset is IMDB.

I will analyze the relationship between the gross revenue and audience rating to determine if the audience’s opinion of the quality of the movie is an accurate predictor of the revenue, or if i can be better predicted by another variable.

Load packages and data

#load libraries
library(readr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ stringr   1.6.0
## ✔ forcats   1.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)

#load data
source_data <- read_csv("movies.csv")
## Rows: 250 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): title, certificate, genre, gross_total
## dbl (3): year, rating, runtime
## num (1): votes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean Data

#check for NA values
colSums(is.na(source_data))
##       title        year      rating       votes     runtime certificate 
##           0           0           0           0           0           0 
##       genre gross_total 
##           0           4
#select only the relevant columns
movies_clean <- source_data |>
  select("title","year", "rating", "runtime", "certificate","gross_total")

#remove $ and M characters from gross_total and change data type to numeric
movies_clean <- movies_clean |>
  mutate(gross_total = str_remove_all(gross_total, "[\\$M]")) |>
  mutate(gross_total = as.numeric(gross_total))

#create factor of "certificate" column to ensure variables are ordered correctly in visualizations
movies_clean <- movies_clean |> 
  mutate(certificate = factor(certificate, levels = c("G", "PG", "PG-13", "R", "NC-17", "Unrated")))

#create "years since 1920" column for proper analysis
movies_clean <- movies_clean |>
  mutate(year_short = year - 1920)

#filter out NA values in gross_total
movies_clean <- movies_clean |> 
  filter(!is.na(gross_total))

#check
head(movies_clean)
## # A tibble: 6 × 7
##   title                   year rating runtime certificate gross_total year_short
##   <chr>                  <dbl>  <dbl>   <dbl> <fct>             <dbl>      <dbl>
## 1 The Shawshank Redempt…  1994    9.3     142 R                 29.3          74
## 2 The Godfather           1972    9.2     175 R                251.           52
## 3 The Dark Knight         2008    9.1     152 PG-13           1008.           88
## 4 The Godfather Part II   1974    9       202 R                 48.2          54
## 5 12 Angry Men            1957    9        96 Unrated            0.01         37
## 6 The Lord of the Rings…  2003    9       201 PG-13           1149.           83

Linear Regression Model

#multiple linear regression model

lrm_movies <- lm(gross_total ~ year_short + rating + runtime, data = movies_clean)
summary(lrm_movies)
## 
## Call:
## lm(formula = gross_total ~ year_short + rating + runtime, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -558.20 -190.85  -66.23   75.09 2364.72 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.016e+03  7.833e+02  -2.573   0.0107 *  
## year_short   5.891e+00  8.592e-01   6.857 5.82e-11 ***
## rating       2.203e+02  9.639e+01   2.285   0.0232 *  
## runtime      9.371e-02  6.177e-01   0.152   0.8796    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 337.3 on 242 degrees of freedom
## Multiple R-squared:  0.1893, Adjusted R-squared:  0.1793 
## F-statistic: 18.84 on 3 and 242 DF,  p-value: 5.159e-11
#linear regression model, just years and gross earnings

lrm_movies2 <- lm(gross_total ~ year_short, data = movies_clean)
summary(lrm_movies2)
## 
## Call:
## lm(formula = gross_total ~ year_short, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.39 -204.04  -62.73   61.81 2384.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -183.6681    62.6447  -2.932  0.00369 ** 
## year_short     6.0455     0.8559   7.063 1.69e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 339.9 on 244 degrees of freedom
## Multiple R-squared:  0.1698, Adjusted R-squared:  0.1664 
## F-statistic: 49.89 on 1 and 244 DF,  p-value: 1.69e-11
autoplot(lrm_movies2)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The first linear regression compared the the release year, audience rating, and runtime of the movies to determine which had the greatest correlation with the gross earnings. While the audience rating has a relatively strong p-value less than .05 of .0232, the release year has the greatest association with a p-value of 5.82e-11.

For the linear regression between the release year and the gross earnings, the equation is:

y=-183.6681 + 6.0455x

While the p-value for this association is high, the r-squared value is quite low, and is decreased from the r-squared value of the original model. Further, the residuals charts indicate that the residual values in this regression model are not evenly distributed along the x axis. This means that this likely not a reliable model.

Also, there is one major possible confounding variable in this association, the rate of inflation. In order to more accurately analyze trends in the gross earnings for each movie, the amounts should be adjusted relative to inflation so that the differences can be more accurately prepared.

Data Visualizations

#scatterplot of gross earnings over time, color coded by certificate rating
movies_time <- movies_clean |>
  ggplot(aes(year, gross_total, color = certificate)) +
  geom_point() +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  labs(x = "Release Year", y = "Box Office Earnings (in Millions)",
       title ="Movie Revenue Over Time",
       caption = "Source: IMDB",
       color = "Rating")
movies_time

#boxplots based on rating
movies_box <- movies_clean |>
  ggplot(aes(certificate, gross_total, fill = certificate)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Accent") +
  theme_minimal() +
  labs(x = "Movie Rating", y = "Movie Earnings (in Millions)",
       title ="Movie Revenue by Rating Category",
       caption = "Source: IMDB",
       fill = "Movie Rating")

movies_box

Conclusion

In order to clean this dataset, I selected only the relevant columns to work with. In order to make the gross earnings variable into a numeric value, I used the mutate function to remove the “$” and “M” characters from each value. Then, I changed the data type to numeric for the entire column. Next, I changed the data type for the “certificate” column to a factor and designated the levels to ensure that the values were properly ordered in the visualizations. I also created an additional column that measured the release year in years since 1920 to aid in the analysis, as the year is not truly a quantitative variable and no movies were released before 1920. Finally, I removed the 4 movies with NA values from the dataset using the filter function so that they would not interfere with the analysis. To check my work and ensure that the column data types were updated correctly and the correct columns had been selected, I used the head command.

The data visualization represents the gross earnings of each movie by the release year of the movie, with each movie color-coded to the certificate rating. The gross earnings has risen exponentially since the 1970s, and since 2000, the top 5 most profitable movies have all been rated PG-13. As noted before, I think that this dataset should be adjusted for inflation in order to be more appropriately analyzed.

Based on this visualization, I also created an additional box plot that shows the 5-number summaries for all movies categorized by the certificate type. This visualization shows significantly higher earnings for PG-13 movies compared to all other rating certificates, indicating that the rating may have a correlation with the gross earnings.

I would have liked to also analyze the gross earnings by genre, however many of the movies had multiple genres listed, so many movies had unique categorical values in this column that would have made analysis very difficult. I also would like to better understand how to correctly complete a regression analysis using a categorical variable because I would like to look further into the relationship between the rating certificate and the gross earnings.