Top 200 Movies of 2023

Author

M Sullivan

Top 200 Movies of 2023

Source: https://mattcraig.substack.com/p/the-state-of-the-movies-2023-what

Source: https://mattcraig.substack.com/p/the-state-of-the-movies-2023-what

This data set ranks the top 200 movies that were released in 2023 by their total gross or box office income. The data includes other variables such as the number of theaters that showed each movie and each movie’s release date and distributor. I intend on examining which distributors produced the highest-ranked movies of 2023 as well as if films released in certain months grossed more income than in other months. The source for this data set is Box Office Mojo.

Load the Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tinytex)

Clean the dataset

setwd("C:/Users/micha/OneDrive/Documents/DATA 110")
Top200Movies2023 <- read_csv ("Top200Movies2023.csv")
Rows: 200 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Title, Theaters, Total Gross, Release Date, Distributor
dbl (1): Rank

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Top200Movies2023$`Total Gross`<-gsub(",","",as.character(Top200Movies2023$`Total Gross`))
Top200Movies2023$Theaters<-gsub(",","",as.character(Top200Movies2023$Theaters))
Top200Movies2023$`Total Gross`<-gsub("$", "", Top200Movies2023$`Total Gross`)
Top200Movies2023
# A tibble: 200 × 6
    Rank Title                 Theaters `Total Gross` `Release Date` Distributor
   <dbl> <chr>                 <chr>    <chr>         <chr>          <chr>      
 1     1 Barbie                4337     $594254460    7/21/2023 0:00 Warner Bro…
 2     2 The Super Mario Bros… 4371     $574759600    4/5/2023 0:00  Universal …
 3     3 Spider-Man: Across t… 4332     $381178195    6/2/2023 0:00  Columbia P…
 4     4 Guardians of the Gal… 4450     $358995815    5/5/2023 0:00  Walt Disne…
 5     5 Oppenheimer           3761     $300144670    7/21/2023 0:00 Universal …
 6     6 The Little Mermaid    4320     $297895447    5/26/2023 0:00 Walt Disne…
 7     7 Avatar: The Way of W… 4340     $684075767    12/16/2023 0:… 20th Centu…
 8     8 Ant-Man and the Wasp… 4345     $214504909    2/17/2023 0:00 Walt Disne…
 9     9 John Wick: Chapter 4  3855     $187131806    3/24/2023 0:00 Lionsgate  
10    10 Sound of Freedom      3411     $180587629    7/4/2023 0:00  Angel Stud…
# ℹ 190 more rows
head(Top200Movies2023)
# A tibble: 6 × 6
   Rank Title                  Theaters `Total Gross` `Release Date` Distributor
  <dbl> <chr>                  <chr>    <chr>         <chr>          <chr>      
1     1 Barbie                 4337     $594254460    7/21/2023 0:00 Warner Bro…
2     2 The Super Mario Bros.… 4371     $574759600    4/5/2023 0:00  Universal …
3     3 Spider-Man: Across th… 4332     $381178195    6/2/2023 0:00  Columbia P…
4     4 Guardians of the Gala… 4450     $358995815    5/5/2023 0:00  Walt Disne…
5     5 Oppenheimer            3761     $300144670    7/21/2023 0:00 Universal …
6     6 The Little Mermaid     4320     $297895447    5/26/2023 0:00 Walt Disne…

This shows us the first few rows and columns of the data set.

ggplot(data = Top200Movies2023,
       aes(x = Title,
           y = `Total Gross`,
           colour = Distributor)) +
  geom_point(aes(shape = Distributor), alpha = 0.8) +
  labs(title = "2023 Movie Grossing by Distributor",
       x = "Title",
       y = "Total Gross")
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 52 values. Consider specifying shapes manually if you need
  that many have them.
Warning: Removed 162 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(data = Top200Movies2023) + 
  geom_bar(mapping = aes(x = Distributor, y = `Total Gross`), stat = "identity")

I cleaned the dataset by removing the commas (,) from the Total Gross and Theaters columns. This was accomplished via the gsub function after the Top200Movies2023 dataset was uploaded. I also attempted to remove the dollar sign ($) from the Total Gross column.

The first visualization simply outlines each film distributor that produced a film last year in a color-coded legend. The second visualization is meant to be a scatterplot where each dot represents a movie. The x-axis is the title of each film, while the y-axis is the total domestic gross. Lastly, the final visualization is supposed to portray a bar graph of the highest-grossing film distributors. Walt Disney films appears to have grossed the most money last year. Four Walt Disney films were in the top 10 highest-grossing films of 2023. Several distributors such as Oscilloscope and Quiver Distribution films made less than $700,000 at the box office, by comparison.

I wish all of my visualizations were clearer and focused. I wanted to show a comprehensive and detailed scatterplot where each film plotted was assigned a color that matched their film distributor in the legend. This would have made it much easier to compare which distributors had successful years at the box office. I also wish the x-axis on my bar graph was easier to read and that the y-axis had a reasonable scale. In both visualizations I wanted them to represent only the top 10 highest-grossing film distributors. This would have made the data appear much less cluttered, and it would have eliminated the majority of the distributors. I was also not able to construct a visualization that measured total gross against the date the film was released. It would have been interesting to analyze if most films made money during certain months and lived up to the term “Summer blockbuster.”