Data Dive Week 2 - Justin Segrest

Setting up R and Loading Data set

First we bring in all the libraries we will be using. Then we load the data set we have downloaded.

#Load Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(knitr)

#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")

The next step for our data set is to clean it and format it so that we can begin to work through it.

#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
  separate(released, into = c("release_new","country_released"), sep=" \\(") |>
  mutate(country_released = str_remove(country_released, "\\)$")) |>    #remove the end parathensis
  mutate(release_date=mdy(release_new)) |>         #then change the date to an easier format
  rename(country_filmed=country)            #rename column for ease of understanding
  
movies_

## # A tibble: 7,668 × 17
##    name    rating genre  year release_new country_released score  votes director
##    <chr>   <chr>  <chr> <dbl> <chr>       <chr>            <dbl>  <dbl> <chr>   
##  1 The Sh… R      Drama  1980 June 13, 1… United States      8.4 9.27e5 Stanley…
##  2 The Bl… R      Adve…  1980 July 2, 19… United States      5.8 6.5 e4 Randal …
##  3 Star W… PG     Acti…  1980 June 20, 1… United States      8.7 1.20e6 Irvin K…
##  4 Airpla… PG     Come…  1980 July 2, 19… United States      7.7 2.21e5 Jim Abr…
##  5 Caddys… R      Come…  1980 July 25, 1… United States      7.3 1.08e5 Harold …
##  6 Friday… R      Horr…  1980 May 9, 1980 United States      6.4 1.23e5 Sean S.…
##  7 The Bl… R      Acti…  1980 June 20, 1… United States      7.9 1.88e5 John La…
##  8 Raging… R      Biog…  1980 December 1… United States      8.2 3.30e5 Martin …
##  9 Superm… PG     Acti…  1980 June 19, 1… United States      6.8 1.01e5 Richard…
## 10 The Lo… R      Biog…  1980 May 16, 19… United States      7   1   e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## #   budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## #   release_date <date>

Numeric Summaries

The first column I wanted to look at was “score.” This column is a collection of the average IMDB score each movie and is on a scale of 0 to 10.

movies_ |>
  summarize(
    mean_score = mean(score, na.rm=TRUE),
    min_score = min(score, na.rm=TRUE),
    max_score = max(score, na.rm=TRUE),
    quantile_score = quantile(score, na.rm=TRUE)
  )

## # A tibble: 5 × 4
##   mean_score min_score max_score quantile_score
##        <dbl>     <dbl>     <dbl>          <dbl>
## 1       6.39       1.9       9.3            1.9
## 2       6.39       1.9       9.3            5.8
## 3       6.39       1.9       9.3            6.5
## 4       6.39       1.9       9.3            7.1
## 5       6.39       1.9       9.3            9.3

We can see in the above output that the minimum average score was 1.9 while the maximum score was 9.3. It could be an interesting question later to see if there are common variables between low scoring movies and high scoring movies such as budget, production company, etc. The mean score is 6.39 which is interesting because your first thought would be that it should be closer to 5 if all of these movies from each year are randomly selected. This is another interesting question to look into in the future.

The next column I looked at was “gross,” which is the gross revenue of the movie (just in the United States according to the data set author).

movies_ |>
  summarize(
    mean_gross = mean(gross, na.rm=TRUE),
    median_gross = median(gross, na.rm=TRUE),
    min_gross = min(gross, na.rm=TRUE),
    max_gross = max(gross, na.rm=TRUE),
    quantile_gross = quantile(gross, na.rm=TRUE)
  )

## # A tibble: 5 × 5
##   mean_gross median_gross min_gross  max_gross quantile_gross
##        <dbl>        <dbl>     <dbl>      <dbl>          <dbl>
## 1  78500541.     20205757       309 2847246203           309 
## 2  78500541.     20205757       309 2847246203       4532056.
## 3  78500541.     20205757       309 2847246203      20205757 
## 4  78500541.     20205757       309 2847246203      76016692.
## 5  78500541.     20205757       309 2847246203    2847246203

In this output we see that the minimum gross revenue for a movie was a measly $309 while the max gross is well over $2 billion! Because of that huge disparity I decided to look at both mean and median. The mean is $78,500,541 while the median is $20,205,757, showing that the highest grossing movies are bringing up the average revenue by a lot. Later I look at the correlation between gross revenue and budget using a scatter plot but it could be interesting to delve into it more and see the relationships between revenue and budgets.

3 Questions to investigate

Question 1 - How many movies are in each year in the data set.

I felt this was a good question as it would show which movies might need to be left out in the future in terms of not having enough data and skewing results.

movies_ |>
  tabyl(year) |>
  adorn_pct_formatting()

##  year   n percent
##  1980  92    1.2%
##  1981 113    1.5%
##  1982 126    1.6%
##  1983 144    1.9%
##  1984 168    2.2%
##  1985 200    2.6%
##  1986 200    2.6%
##  1987 200    2.6%
##  1988 200    2.6%
##  1989 200    2.6%
##  1990 200    2.6%
##  1991 200    2.6%
##  1992 200    2.6%
##  1993 200    2.6%
##  1994 200    2.6%
##  1995 200    2.6%
##  1996 200    2.6%
##  1997 200    2.6%
##  1998 200    2.6%
##  1999 200    2.6%
##  2000 200    2.6%
##  2001 200    2.6%
##  2002 200    2.6%
##  2003 200    2.6%
##  2004 200    2.6%
##  2005 200    2.6%
##  2006 200    2.6%
##  2007 200    2.6%
##  2008 200    2.6%
##  2009 200    2.6%
##  2010 200    2.6%
##  2011 200    2.6%
##  2012 200    2.6%
##  2013 200    2.6%
##  2014 200    2.6%
##  2015 200    2.6%
##  2016 200    2.6%
##  2017 200    2.6%
##  2018 200    2.6%
##  2019 200    2.6%
##  2020  25    0.3%

This output shows that the only outlier years are 1980-1984 and 2020. The rest of the years in the data set have 200 movies per year which should lead to easy data manipulation in the future.

Question 2 - How many movies have the top 50 production companies produced?

I wanted to investigate this one as I was curious what the top production companies were in the data set as well as how many movies those top companies account for.

movies_ |>
  tabyl(company, show_na=FALSE) |>
  adorn_pct_formatting() |>
  arrange(desc(n)) |>
  slice(1:51)

##                                  company   n percent
##                       Universal Pictures 377    4.9%
##                             Warner Bros. 334    4.4%
##                        Columbia Pictures 332    4.3%
##                       Paramount Pictures 320    4.2%
##                    Twentieth Century Fox 240    3.1%
##                          New Line Cinema 174    2.3%
##                      Touchstone Pictures 132    1.7%
##                Metro-Goldwyn-Mayer (MGM) 125    1.6%
##                     Walt Disney Pictures 123    1.6%
##                         TriStar Pictures  94    1.2%
##                      Dreamworks Pictures  76    1.0%
##                                  Miramax  74    1.0%
##                        Fox 2000 Pictures  63    0.8%
##                           Orion Pictures  60    0.8%
##                     Summit Entertainment  60    0.8%
##                Castle Rock Entertainment  59    0.8%
##                                Lionsgate  58    0.8%
##                 Fox Searchlight Pictures  57    0.7%
##                          Dimension Films  56    0.7%
##                              Screen Gems  56    0.7%
##                       Hollywood Pictures  49    0.6%
##                           Focus Features  45    0.6%
##                           United Artists  41    0.5%
##                  New Regency Productions  40    0.5%
##               Morgan Creek Entertainment  36    0.5%
##                                BBC Films  32    0.4%
##                       Revolution Studios  32    0.4%
##                    The Weinstein Company  30    0.4%
##                     DreamWorks Animation  28    0.4%
##                      Alcon Entertainment  26    0.3%
##                         Carolco Pictures  25    0.3%
##                               EuropaCorp  24    0.3%
##               Dino De Laurentiis Company  20    0.3%
##                       New World Pictures  20    0.3%
##                                CBS Films  19    0.2%
##                       Channel Four Films  19    0.2%
##                 Golan-Globus Productions  19    0.2%
##                    Imagine Entertainment  19    0.2%
##                                   Canal+  18    0.2%
##                   Golden Harvest Company  18    0.2%
##            Polygram Filmed Entertainment  18    0.2%
##  De Laurentiis Entertainment Group (DEG)  17    0.2%
##                         Lions Gate Films  17    0.2%
##            PolyGram Filmed Entertainment  17    0.2%
##                              StudioCanal  17    0.2%
##            Walt Disney Animation Studios  17    0.2%
##                         Relativity Media  16    0.2%
##                         The Cannon Group  16    0.2%
##                         Caravan Pictures  15    0.2%
##                                  Gaumont  15    0.2%
##                         Millennium Films  15    0.2%

In this output we can see that all the well known names are near the top with Universal Pictures being the top production company with 377 movies and almost making up 5% of the data set. In the future when analyzing different production companies it might be a good idea to take only the top ~10 as they all have over 100 movies and could lead to some usable analysis.

Question 3 - What individual date had the most amount of movies released on that day? And what were those movies?

This question began with just the first sentence but once I completed the code,but once I saw that only one date had 9 movies released on it, I thought it would be interesting to see what movies all were competing against each other.

most_movies <- movies_ |>
  tabyl(release_date, show_na=FALSE) |>
  arrange(desc(n)) |>
  slice(1:1)    #now have the top one day for movies

most_date<- most_movies$release_date[1]   #get just the date

most_movie_day <- movies_ |>
  filter(release_date == most_date)        #now have all 9 movies!

kable(most_movie_day)

name	rating	genre	year	release_new	country_released	score	votes	director	writer	star	country_filmed	budget	gross	company	runtime	release_date
Kaos	R	Comedy	1984	February 14, 1986	United States	7.9	2700	Paolo Taviani	Luigi Pirandello	Margarita Lozano	Italy	NA	NA	Filmtre	187	1986-02-14
Winter Flight	NA	Drama	1984	February 14, 1986	United States	7.3	67	Roy Battersby	Alan Janes	Reece Dinsdale	United Kingdom	NA	2729	Enigma Productions	89	1986-02-14
La Cage aux Folles 3: The Wedding	PG-13	Comedy	1985	February 14, 1986	United States	4.7	760	Georges Lautner	Philippe Nicaud	Michel Serrault	France	NA	345280	Columbia Films	87	1986-02-14
Turtle Diary	PG	Comedy	1985	February 14, 1986	United States	6.9	726	John Irvin	Russell Hoban	Glenda Jackson	United Kingdom	NA	2156772	Britannic	97	1986-02-14
The Delta Force	R	Action	1986	February 14, 1986	United States	5.6	19000	Menahem Golan	James Bruner	Chuck Norris	United States	1.2e+07	17768900	Golan-Globus Productions	125	1986-02-14
Wildcats	R	Comedy	1986	February 14, 1986	United States	6.1	12000	Michael Ritchie	Ezra Sacks	Goldie Hawn	United States	NA	26285544	Hawn / Sylbert Movie Company	106	1986-02-14
TerrorVision	R	Comedy	1986	February 14, 1986	United States	5.7	6600	Ted Nicolaou	Ted Nicolaou	Diane Franklin	United States	NA	320256	Empire Pictures	85	1986-02-14
Quicksilver	PG	Crime	1986	February 14, 1986	United States	5.7	4500	Thomas Michael Donnelly	Thomas Michael Donnelly	Kevin Bacon	United States	NA	7634909	Columbia Pictures	105	1986-02-14
Knights of the City	R	Action	1986	February 14, 1986	United States	4.8	153	Dominic Orlando	Leon Isaac Kennedy	Leon Isaac Kennedy	United States	NA	601451	Grace	87	1986-02-14

After all that code we finally end up with the 9 movies that released on February 14, 1986, the most on a single day in the entire data set! Some interesting tidbits, 4 movies released on this day made over $1 million, with Wildcats making over $26 million and being the most successful of all 9. If you add up all the run times from all 9 movies it is 968 minutes or just over 16 hours. Theoretically someone could have seen all 9 on the same day!

Visual Summaries

Graph 1 - Scatter plot of Gross Revenue and Budget

ggplot(movies_, mapping=aes(x = budget, y = gross)) +
  geom_point(color = "black") +
  geom_smooth(method = "gam", color = "blue", se=FALSE) +
  geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
  scale_x_continuous(labels = label_dollar()) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Scatterplot of Gross Revenue vs Budget",
    x = "Budget",
    y = "Gross Revenue",
    subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
  ) +
  theme_minimal()

I really like how this graph turned out, as it shows the Gross Revenue on the y-axis against the Budget on the x-axis. The blue line is the trend line showing that there is a positive correlation between a higher budget and higher revenue. This makes sense as companies are more likely to give more money to projects or directors they are more confident in, although you still do have some movies that under perform. The green dashed line shows if a movie made money or lost money (doesn’t account for marketing or other factors besides budget). We see that most movies past $100,000,000 seem to make a slight profit at the least while every movie over $250 million turned a profit. Some things to investigate in the future are what are the most profitable movies in the data set as well as what are the biggest under performers.

Graph 2 - Box plots of gross revenue for the 8 most popular genres

genres_keep<- c("Action","Adventure","Animation","Biography","Comedy","Crime","Drama","Horror")
movie_boxplot <- movies_ |>
  filter(genre %in% genres_keep)
ggplot(movie_boxplot, mapping=aes(x = genre,y = gross)) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Boxplot of the top 8 most popular genres vs gross revenue",
    x = "Genre",
    y = "Gross Revenue",
  ) +
  geom_boxplot(outlier.color = "red")

This graph shows the top 8 most popular genres against gross revenue. I choose the 8 most popular genres based on the amount in the data set, these 8 genres make up 98.2% of the data set, so I feel that they do a good job of showing the data. The obvious problem with this graph is that we have serious outlines skewing the data. Most movies simply do not make millions of dollars meaning that when we have certain films making upwards of $1 billion, it causes the box plot to be situated at the bottom. I still thought it was important to include this graph as it shows in the future that more filtering might need to be done so that the average movies are better represented instead of just the highest grossing. It would be an interesting investigation in the future to find which are the highest grossing films of each genre, especially genres like horror, crime, or drama which have huge outliers.