Introduction

Main Question: How many movies progressed over time, what movies have sold most tickets and grossed the most money? #I will collect the data by scraping it off of the website I have used. I have decided to compare movies and their Lifetime gross money, what year it was made, and the ticket sales produced. I will provide analysis by using visualizations to assess the data.

Website used with data

#download file from web source
mojo_url <- "https://www.boxofficemojo.com/chart/top_lifetime_gross_adjusted/?adjust_gross_to=2020"

Data Scraping

mojo_url_doc <- 
  GET(mojo_url)$content %>% 
  rawToChar() %>% 
  htmlParse()

imdb_test_tables <- readHTMLTable(mojo_url_doc, stringsAsFactors = FALSE)
mojo <- imdb_test_tables[[1]]
names(mojo) <- str_replace_all(names(mojo), "[^\\w]", "")

Data Cleanup and Wrangling

mojo$AdjLifetimeGross <- as.numeric(str_replace_all(mojo$AdjLifetimeGross, "[^\\d]", ""))
mojo$Rank <- as.numeric(str_replace_all(mojo$Rank, "[^\\d]", ""))
mojo$AdjLifetimeGross <- as.numeric(str_replace_all(mojo$AdjLifetimeGross, "[^\\d]", ""))
mojo$LifetimeGross <- as.numeric(str_replace_all(mojo$LifetimeGross, "[^\\d]", ""))
mojo$EstNumTickets <- as.numeric(str_replace_all(mojo$EstNumTickets, "[^\\d]", ""))
mojo$Year <- as.numeric(str_replace_all(mojo$Year, "[^\\d]", ""))

mojo %>%
  group_by(Year) %>%
  summarize(GrossTicketSales = sum(AdjLifetimeGross))
## # A tibble: 74 x 2
##     Year GrossTicketSales
##    <dbl>            <dbl>
##  1  1921        430255408
##  2  1937       1021330000
##  3  1939       1895421694
##  4  1940        631568921
##  5  1941       1229031979
##  6  1942        596985188
##  7  1945        587921587
##  8  1946        993411148
##  9  1950        565024118
## 10  1952        562200000
## # … with 64 more rows

Data analysis and visualizations

How has movie boxoffice revenue changed over lifetime (visualization)

This is a very interesting gg plot. It shows the distribution amongst the years and how much money the movies have grossed. The distribution is scattered a bit through the years in 1975. After 975 they are less scattered and they are more clumped together.

mojo %>%
  ggplot(aes(x=Year, y=AdjLifetimeGross)) +
  geom_point(aes(size=EstNumTickets)) +
  geom_smooth(method = "loess", formula = y ~ x, se=1)

What movie sold the most tickets? (visualization)

We can see that Jaws sold the most tickets within the movies selected.

mojo %>%
  arrange(-EstNumTickets) %>%
  .[1:5, ]
##   Rank                              Title AdjLifetimeGross LifetimeGross
## 1    1                 Gone with the Wind       1895421694     200852579
## 2    2 Star Wars: Episode IV - A New Hope       1668979715     460998507
## 3    3                 The Sound of Music       1335086324     159287539
## 4    4         E.T. the Extra-Terrestrial       1329174791     435110554
## 5    5                            Titanic       1270101626     659363944
##   EstNumTickets Year
## 1     202286200 1939
## 2     178119500 1977
## 3     142485200 1965
## 4     141854300 1982
## 5     135549800 1997

What are top 5 movies that sold most tickets (visualization)

In this visual, I chose a bar grapph to distinguish each of the movies to show who sold the most tickets. Jaws sold around 128,000,000 and more. There really werent many if any at all that came close to selling that amount.

mojo %>%
  arrange(-EstNumTickets) %>%
  head(5) %>%  # gives us first five values, we can also use .[1:5, ]
  ggplot(aes(x=Title, y=EstNumTickets, fill="r")) +
  geom_bar(stat="identity")

How nany movies were made over time? (visualizagtion)

In this histogram we can see that there were of course many movies made over time. I wanted to make this a histogram to show the years over time and the amount of movies that were made. Overall, as the years went on there were consistenly more movies made.

mojo %>%
  ggplot(aes(x=Year)) +
  geom_histogram(bins=length(unique(mojo$Year))) +
  ggtitle("Movies Being Made over Time") +
  ylab("Number of Movies")

Top 5 Years in Ticket Sales, Adjusted for Inflation (visualization)

In this visual, we can see that gross ticket sales in 2019 were significantly higher than the other years that were compared in the data.

mojo %>%
  group_by(Year) %>%
  summarize(GrossTicketSales = sum(AdjLifetimeGross)) %>%
  arrange(-GrossTicketSales) %>%
  head(5)
## # A tibble: 5 x 2
##    Year GrossTicketSales
##   <dbl>            <dbl>
## 1  2019       3751792976
## 2  2015       3396696212
## 3  1965       3199111959
## 4  1977       3105966663
## 5  2016       3094334745
mojo %>%
  arrange(-EstNumTickets) %>%
  head(5) %>%  # gives us first five values, we can also use .[1:5, ]
  ggplot(aes(x=Title, y=EstNumTickets, fill="r")) +
  geom_bar(stat="identity")

What statistical method would I use if I cotninued my analysis To provide further validation of my findings I would continued to analysis distribution of the movies and how they compare from 1975-2020. I would be interested to continue researching the gross profit of these movies and compare the years. I would also be more interested in the overall revenue these movies have produced. I would also used the statistical method of the regression model. It would be a efficient model to use when find the impact these variables have on one another. I want to know what influences the other, and which variables matter the most.