Henry Yang (): Data search, define questions, summary statistics, Rstudio.

Kyle Wong (): Data search, define questions, summary statistics, Rstudio.

Sasinathaya Aphichatphokhin (): Data search, define questions, summary statistics, documentation.

The Data: We are interested in what factors affect the gross earnings of movies in the United States in a span of 10 years. Our response variables are the gross earning, and our explanatory variables are the month of movie release, movie genre and the number of tickets sold. The month of release and the genre is our categorical variables, the number of tickets sold is our discrete data, and the gross earning is our continuous data. The movie’s gross ranges from $24 to $158 million. The number of tickets sold ranges from 2 to 94 million tickets.

We obtained the US movie gross earnings data set from: the-numbers.com. There are 15 genres, which includes Action, Adventure, Black Comedy, Comedy, Concert, Documentary, Drama, Horror, Multiple Genre, Musical, Reality, Romantic Comedy, Thriller/Suspense, and Western. The sample size, or the number of movies, is 5,434. The data was taken from the year 2011 to 2020.

The confounding variables of the source is that we were unable to verify the data pool that our source gathered its data from. Another confounding factor of gross earnings is the production budget, which we did not include as our response variables. This is because the amount of budget is generally both difficult to find and unreliable because studios and film makers often try to keep the information confidential and will almost always never give an accurate value.

The Questions:

  1. Do movie gross earnings vary significantly with the time in which the movies were released?

  2. To what extent is movie gross earnings influenced by the movie genre?

  3. Does the number of movie tickets sold have a linear relationship with movie gross earnings?

The Summary Statistics:

Q1: The relationship between the month of release and gross earnings The bar graph of gross in millions vs. months of release shows that the gross earnings is the highest in December, and the lowest in January from 2011 to 2020.
The confounding variables are the COVID pandemic in 2020, which heavily impacted the gross between the month of June onwards, and that our data is for gross per title, not monthly gross per title, and this might result in inaccurate portrayal of a movie’s release-month performance. Another thing to note is that some genres might do well in specific months, such as Romance in February, which is unaccounted for in the data. We will investigate the relationship further considering the confounding factors.

Q2: The relationship between the gross earnings and genre The bar graph of gross earnings vs. genre for the span of 10 years showed a significant difference between the gross earned by each genre, such as comedy and adventure, although some genres were much closer. Our confounding variable is that some genre might have a greater number of movies than others due to the difference in popularity. However, we will investigate further and negate this through comparing the ratio of gross to number of movies per genre as shown in the table.

Q3: The relationship between gross earnings and number of tickets sold. From the scatter plot, we observed quite a clear positive linear relationship between gross and the number of tickets sold, especially for the first 50 million tickets sold. The confounding variable is the varying ticket prices per region/state, which is visible as the number of tickets exceed 50 million. This is because some movies are more well known than others, therefore tickets are offered to a greater number of different regions/states of the US, resulting in varying ticket prices per region/state. Taking into account the confounding variable, we will further investigate the linearity.

Reference: https://www.the-numbers.com/

Appendix:

knitr::opts_chunk$set(echo = FALSE)
invisible(library(tidyverse))
invisible(library(lubridate))
invisible(library(tidyr))
invisible(library(readr))
invisible(library(dplyr))
library(tidyverse)
library(lubridate)
library(tidyr)

df = read.csv("df2.csv")
jan = df[df[,2] == "Jan",]
jan.gross = parse_number(jan$Gross)
sum.jan.gross = sum(jan.gross)

feb = df[df[,2] == "Feb",]
sum.feb.gross = sum(parse_number(feb$Gross))

mar = df[df[,2] == "Mar",]
sum.mar.gross = sum(parse_number(mar$Gross))

apr = df[df[,2] == "Apr",]
sum.apr.gross = sum(parse_number(apr$Gross))

may = df[df[,2] == "May",]
sum.may.gross = sum(parse_number(may$Gross))

jun = df[df[,2] == "Jun",]
sum.jun.gross = sum(parse_number(jun$Gross))


jul = df[df[,2] == "Jul",]
sum.jul.gross = sum(parse_number(jul$Gross))

oct = df[df[,2] == "Oct",]
sum.oct.gross = sum(parse_number(oct$Gross))

nov = df[df[,2] == "Nov",]
sum.nov.gross = sum(parse_number(nov$Gross))

dec = df[df[,2] == "Dec",]
sum.dec.gross = sum(parse_number(dec$Gross))

sum.gross = c(sum.jan.gross, sum.feb.gross, sum.mar.gross, sum.apr.gross, 
              sum.may.gross, sum.jun.gross, sum.jul.gross, sum.oct.gross, 
              sum.nov.gross, sum.dec.gross)
sum.gross.month = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "OCT", 
                    "NOV", "DEC")

sum.bymonth = data.frame(sum.gross, sum.gross.month)

barplot(sum.gross, main = "Gross Earning by Release Month", names.arg = c("JAN", 
  "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "OCT", "NOV", "DEC"), 
  ylab = "Millions of Dollars", col = "#69b3a2")
library(readr)
library(dplyr)

df = read.csv("df2.csv")
df2 = aggregate(parse_number(df$Gross), by=list(company=df$Genre), FUN = sum)
barplot(df2$x, names.arg = c("Unknown", "Action", "Adventure", "Black Comedy", 
                             "Comedy", "Concert", "Documentary", "Drama", 
                             "Horror", "Multiple Genre", "Musical", "Reality", 
                             "Romantic Comedy", "Thriller/Suspense", "Western")
                            , las = 2)
options(warn = -1)
df = read.csv("df2.csv")
plot( df$Ticket, df$Gross, xlim = c(0,150), xlab= "Millions of Tickets", 
      ylab = "Gross(Millions)")