First we bring in all the libraries we will be using. Then we load the data set we have downloaded.
#Load Libraries
library(tidyr)
library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(janitor)
library(ggplot2)
library(scales)
library(knitr)
#Load in the dataset
movies_raw <- read_csv("/Users/jus10segrest/Downloads/iu indy/stat for data science/movies.csv")
The next step for our data set is to clean it and format it so that we can begin to work through it.
#create a new table separating the released column into two release date/country
movies_ <- movies_raw |>
separate(released, into = c("release_new","country_released"), sep=" \\(") |>
mutate(country_released = str_remove(country_released, "\\)$")) |> #remove the end parathensis
mutate(release_date=mdy(release_new)) |> #then change the date to an easier format
rename(country_filmed=country) #rename column for ease of understanding
movies_
## # A tibble: 7,668 × 17
## name rating genre year release_new country_released score votes director
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 The Sh… R Drama 1980 June 13, 1… United States 8.4 9.27e5 Stanley…
## 2 The Bl… R Adve… 1980 July 2, 19… United States 5.8 6.5 e4 Randal …
## 3 Star W… PG Acti… 1980 June 20, 1… United States 8.7 1.20e6 Irvin K…
## 4 Airpla… PG Come… 1980 July 2, 19… United States 7.7 2.21e5 Jim Abr…
## 5 Caddys… R Come… 1980 July 25, 1… United States 7.3 1.08e5 Harold …
## 6 Friday… R Horr… 1980 May 9, 1980 United States 6.4 1.23e5 Sean S.…
## 7 The Bl… R Acti… 1980 June 20, 1… United States 7.9 1.88e5 John La…
## 8 Raging… R Biog… 1980 December 1… United States 8.2 3.30e5 Martin …
## 9 Superm… PG Acti… 1980 June 19, 1… United States 6.8 1.01e5 Richard…
## 10 The Lo… R Biog… 1980 May 16, 19… United States 7 1 e4 Walter …
## # ℹ 7,658 more rows
## # ℹ 8 more variables: writer <chr>, star <chr>, country_filmed <chr>,
## # budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>,
## # release_date <date>
The first column I wanted to look at was “score.” This column is a collection of the average IMDB score each movie and is on a scale of 0 to 10.
movies_ |>
summarize(
mean_score = mean(score, na.rm=TRUE),
min_score = min(score, na.rm=TRUE),
max_score = max(score, na.rm=TRUE),
quantile_score = quantile(score, na.rm=TRUE)
)
## # A tibble: 5 × 4
## mean_score min_score max_score quantile_score
## <dbl> <dbl> <dbl> <dbl>
## 1 6.39 1.9 9.3 1.9
## 2 6.39 1.9 9.3 5.8
## 3 6.39 1.9 9.3 6.5
## 4 6.39 1.9 9.3 7.1
## 5 6.39 1.9 9.3 9.3
We can see in the above output that the minimum average score was 1.9 while the maximum score was 9.3. It could be an interesting question later to see if there are common variables between low scoring movies and high scoring movies such as budget, production company, etc. The mean score is 6.39 which is interesting because your first thought would be that it should be closer to 5 if all of these movies from each year are randomly selected. This is another interesting question to look into in the future.
The next column I looked at was “gross,” which is the gross revenue of the movie (just in the United States according to the data set author).
movies_ |>
summarize(
mean_gross = mean(gross, na.rm=TRUE),
median_gross = median(gross, na.rm=TRUE),
min_gross = min(gross, na.rm=TRUE),
max_gross = max(gross, na.rm=TRUE),
quantile_gross = quantile(gross, na.rm=TRUE)
)
## # A tibble: 5 × 5
## mean_gross median_gross min_gross max_gross quantile_gross
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 78500541. 20205757 309 2847246203 309
## 2 78500541. 20205757 309 2847246203 4532056.
## 3 78500541. 20205757 309 2847246203 20205757
## 4 78500541. 20205757 309 2847246203 76016692.
## 5 78500541. 20205757 309 2847246203 2847246203
In this output we see that the minimum gross revenue for a movie was a measly $309 while the max gross is well over $2 billion! Because of that huge disparity I decided to look at both mean and median. The mean is $78,500,541 while the median is $20,205,757, showing that the highest grossing movies are bringing up the average revenue by a lot. Later I look at the correlation between gross revenue and budget using a scatter plot but it could be interesting to delve into it more and see the relationships between revenue and budgets.
Question 1 - How many movies are in each year in the data set.
I felt this was a good question as it would show which movies might need to be left out in the future in terms of not having enough data and skewing results.
movies_ |>
tabyl(year) |>
adorn_pct_formatting()
## year n percent
## 1980 92 1.2%
## 1981 113 1.5%
## 1982 126 1.6%
## 1983 144 1.9%
## 1984 168 2.2%
## 1985 200 2.6%
## 1986 200 2.6%
## 1987 200 2.6%
## 1988 200 2.6%
## 1989 200 2.6%
## 1990 200 2.6%
## 1991 200 2.6%
## 1992 200 2.6%
## 1993 200 2.6%
## 1994 200 2.6%
## 1995 200 2.6%
## 1996 200 2.6%
## 1997 200 2.6%
## 1998 200 2.6%
## 1999 200 2.6%
## 2000 200 2.6%
## 2001 200 2.6%
## 2002 200 2.6%
## 2003 200 2.6%
## 2004 200 2.6%
## 2005 200 2.6%
## 2006 200 2.6%
## 2007 200 2.6%
## 2008 200 2.6%
## 2009 200 2.6%
## 2010 200 2.6%
## 2011 200 2.6%
## 2012 200 2.6%
## 2013 200 2.6%
## 2014 200 2.6%
## 2015 200 2.6%
## 2016 200 2.6%
## 2017 200 2.6%
## 2018 200 2.6%
## 2019 200 2.6%
## 2020 25 0.3%
This output shows that the only outlier years are 1980-1984 and 2020. The rest of the years in the data set have 200 movies per year which should lead to easy data manipulation in the future.
Question 2 - How many movies have the top 50 production companies produced?
I wanted to investigate this one as I was curious what the top production companies were in the data set as well as how many movies those top companies account for.
movies_ |>
tabyl(company, show_na=FALSE) |>
adorn_pct_formatting() |>
arrange(desc(n)) |>
slice(1:51)
## company n percent
## Universal Pictures 377 4.9%
## Warner Bros. 334 4.4%
## Columbia Pictures 332 4.3%
## Paramount Pictures 320 4.2%
## Twentieth Century Fox 240 3.1%
## New Line Cinema 174 2.3%
## Touchstone Pictures 132 1.7%
## Metro-Goldwyn-Mayer (MGM) 125 1.6%
## Walt Disney Pictures 123 1.6%
## TriStar Pictures 94 1.2%
## Dreamworks Pictures 76 1.0%
## Miramax 74 1.0%
## Fox 2000 Pictures 63 0.8%
## Orion Pictures 60 0.8%
## Summit Entertainment 60 0.8%
## Castle Rock Entertainment 59 0.8%
## Lionsgate 58 0.8%
## Fox Searchlight Pictures 57 0.7%
## Dimension Films 56 0.7%
## Screen Gems 56 0.7%
## Hollywood Pictures 49 0.6%
## Focus Features 45 0.6%
## United Artists 41 0.5%
## New Regency Productions 40 0.5%
## Morgan Creek Entertainment 36 0.5%
## BBC Films 32 0.4%
## Revolution Studios 32 0.4%
## The Weinstein Company 30 0.4%
## DreamWorks Animation 28 0.4%
## Alcon Entertainment 26 0.3%
## Carolco Pictures 25 0.3%
## EuropaCorp 24 0.3%
## Dino De Laurentiis Company 20 0.3%
## New World Pictures 20 0.3%
## CBS Films 19 0.2%
## Channel Four Films 19 0.2%
## Golan-Globus Productions 19 0.2%
## Imagine Entertainment 19 0.2%
## Canal+ 18 0.2%
## Golden Harvest Company 18 0.2%
## Polygram Filmed Entertainment 18 0.2%
## De Laurentiis Entertainment Group (DEG) 17 0.2%
## Lions Gate Films 17 0.2%
## PolyGram Filmed Entertainment 17 0.2%
## StudioCanal 17 0.2%
## Walt Disney Animation Studios 17 0.2%
## Relativity Media 16 0.2%
## The Cannon Group 16 0.2%
## Caravan Pictures 15 0.2%
## Gaumont 15 0.2%
## Millennium Films 15 0.2%
In this output we can see that all the well known names are near the top with Universal Pictures being the top production company with 377 movies and almost making up 5% of the data set. In the future when analyzing different production companies it might be a good idea to take only the top ~10 as they all have over 100 movies and could lead to some usable analysis.
Question 3 - What individual date had the most amount of movies released on that day? And what were those movies?
This question began with just the first sentence but once I completed the code,but once I saw that only one date had 9 movies released on it, I thought it would be interesting to see what movies all were competing against each other.
most_movies <- movies_ |>
tabyl(release_date, show_na=FALSE) |>
arrange(desc(n)) |>
slice(1:1) #now have the top one day for movies
most_date<- most_movies$release_date[1] #get just the date
most_movie_day <- movies_ |>
filter(release_date == most_date) #now have all 9 movies!
kable(most_movie_day)
name | rating | genre | year | release_new | country_released | score | votes | director | writer | star | country_filmed | budget | gross | company | runtime | release_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kaos | R | Comedy | 1984 | February 14, 1986 | United States | 7.9 | 2700 | Paolo Taviani | Luigi Pirandello | Margarita Lozano | Italy | NA | NA | Filmtre | 187 | 1986-02-14 |
Winter Flight | NA | Drama | 1984 | February 14, 1986 | United States | 7.3 | 67 | Roy Battersby | Alan Janes | Reece Dinsdale | United Kingdom | NA | 2729 | Enigma Productions | 89 | 1986-02-14 |
La Cage aux Folles 3: The Wedding | PG-13 | Comedy | 1985 | February 14, 1986 | United States | 4.7 | 760 | Georges Lautner | Philippe Nicaud | Michel Serrault | France | NA | 345280 | Columbia Films | 87 | 1986-02-14 |
Turtle Diary | PG | Comedy | 1985 | February 14, 1986 | United States | 6.9 | 726 | John Irvin | Russell Hoban | Glenda Jackson | United Kingdom | NA | 2156772 | Britannic | 97 | 1986-02-14 |
The Delta Force | R | Action | 1986 | February 14, 1986 | United States | 5.6 | 19000 | Menahem Golan | James Bruner | Chuck Norris | United States | 1.2e+07 | 17768900 | Golan-Globus Productions | 125 | 1986-02-14 |
Wildcats | R | Comedy | 1986 | February 14, 1986 | United States | 6.1 | 12000 | Michael Ritchie | Ezra Sacks | Goldie Hawn | United States | NA | 26285544 | Hawn / Sylbert Movie Company | 106 | 1986-02-14 |
TerrorVision | R | Comedy | 1986 | February 14, 1986 | United States | 5.7 | 6600 | Ted Nicolaou | Ted Nicolaou | Diane Franklin | United States | NA | 320256 | Empire Pictures | 85 | 1986-02-14 |
Quicksilver | PG | Crime | 1986 | February 14, 1986 | United States | 5.7 | 4500 | Thomas Michael Donnelly | Thomas Michael Donnelly | Kevin Bacon | United States | NA | 7634909 | Columbia Pictures | 105 | 1986-02-14 |
Knights of the City | R | Action | 1986 | February 14, 1986 | United States | 4.8 | 153 | Dominic Orlando | Leon Isaac Kennedy | Leon Isaac Kennedy | United States | NA | 601451 | Grace | 87 | 1986-02-14 |
After all that code we finally end up with the 9 movies that released on February 14, 1986, the most on a single day in the entire data set! Some interesting tidbits, 4 movies released on this day made over $1 million, with Wildcats making over $26 million and being the most successful of all 9. If you add up all the run times from all 9 movies it is 968 minutes or just over 16 hours. Theoretically someone could have seen all 9 on the same day!
Graph 1 - Scatter plot of Gross Revenue and Budget
ggplot(movies_, mapping=aes(x = budget, y = gross)) +
geom_point(color = "black") +
geom_smooth(method = "gam", color = "blue", se=FALSE) +
geom_abline(slope=1, intercept=0, color="green", linetype="dashed") +
scale_x_continuous(labels = label_dollar()) +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Scatterplot of Gross Revenue vs Budget",
x = "Budget",
y = "Gross Revenue",
subtitle = "Above the dashed line: Profitable | Below the dashed line: Loss"
) +
theme_minimal()
I really like how this graph turned out, as it shows the Gross Revenue on the y-axis against the Budget on the x-axis. The blue line is the trend line showing that there is a positive correlation between a higher budget and higher revenue. This makes sense as companies are more likely to give more money to projects or directors they are more confident in, although you still do have some movies that under perform. The green dashed line shows if a movie made money or lost money (doesn’t account for marketing or other factors besides budget). We see that most movies past $100,000,000 seem to make a slight profit at the least while every movie over $250 million turned a profit. Some things to investigate in the future are what are the most profitable movies in the data set as well as what are the biggest under performers.
Graph 2 - Box plots of gross revenue for the 8 most popular genres
genres_keep<- c("Action","Adventure","Animation","Biography","Comedy","Crime","Drama","Horror")
movie_boxplot <- movies_ |>
filter(genre %in% genres_keep)
ggplot(movie_boxplot, mapping=aes(x = genre,y = gross)) +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Boxplot of the top 8 most popular genres vs gross revenue",
x = "Genre",
y = "Gross Revenue",
) +
geom_boxplot(outlier.color = "red")
This graph shows the top 8 most popular genres against gross revenue. I choose the 8 most popular genres based on the amount in the data set, these 8 genres make up 98.2% of the data set, so I feel that they do a good job of showing the data. The obvious problem with this graph is that we have serious outlines skewing the data. Most movies simply do not make millions of dollars meaning that when we have certain films making upwards of $1 billion, it causes the box plot to be situated at the bottom. I still thought it was important to include this graph as it shows in the future that more filtering might need to be done so that the average movies are better represented instead of just the highest grossing. It would be an interesting investigation in the future to find which are the highest grossing films of each genre, especially genres like horror, crime, or drama which have huge outliers.