Main Question: How many movies progressed over time, what movies have sold most tickets and grossed the most money? #I will collect the data by scraping it off of the website I have used. I have decided to compare movies and their Lifetime gross money, what year it was made, and the ticket sales produced. I will provide analysis by using visualizations to assess the data.
Website used with data
#download file from web source
mojo_url <- "https://www.boxofficemojo.com/chart/top_lifetime_gross_adjusted/?adjust_gross_to=2020"
Data Scraping
mojo_url_doc <-
GET(mojo_url)$content %>%
rawToChar() %>%
htmlParse()
imdb_test_tables <- readHTMLTable(mojo_url_doc, stringsAsFactors = FALSE)
mojo <- imdb_test_tables[[1]]
names(mojo) <- str_replace_all(names(mojo), "[^\\w]", "")
mojo$AdjLifetimeGross <- as.numeric(str_replace_all(mojo$AdjLifetimeGross, "[^\\d]", ""))
mojo$Rank <- as.numeric(str_replace_all(mojo$Rank, "[^\\d]", ""))
mojo$AdjLifetimeGross <- as.numeric(str_replace_all(mojo$AdjLifetimeGross, "[^\\d]", ""))
mojo$LifetimeGross <- as.numeric(str_replace_all(mojo$LifetimeGross, "[^\\d]", ""))
mojo$EstNumTickets <- as.numeric(str_replace_all(mojo$EstNumTickets, "[^\\d]", ""))
mojo$Year <- as.numeric(str_replace_all(mojo$Year, "[^\\d]", ""))
mojo %>%
group_by(Year) %>%
summarize(GrossTicketSales = sum(AdjLifetimeGross))
## # A tibble: 74 x 2
## Year GrossTicketSales
## <dbl> <dbl>
## 1 1921 430255408
## 2 1937 1021330000
## 3 1939 1895421694
## 4 1940 631568921
## 5 1941 1229031979
## 6 1942 596985188
## 7 1945 587921587
## 8 1946 993411148
## 9 1950 565024118
## 10 1952 562200000
## # … with 64 more rows
This is a very interesting gg plot. It shows the distribution amongst the years and how much money the movies have grossed. The distribution is scattered a bit through the years in 1975. After 975 they are less scattered and they are more clumped together.
mojo %>%
ggplot(aes(x=Year, y=AdjLifetimeGross)) +
geom_point(aes(size=EstNumTickets)) +
geom_smooth(method = "loess", formula = y ~ x, se=1)
We can see that Jaws sold the most tickets within the movies selected.
mojo %>%
arrange(-EstNumTickets) %>%
.[1:5, ]
## Rank Title AdjLifetimeGross LifetimeGross
## 1 1 Gone with the Wind 1895421694 200852579
## 2 2 Star Wars: Episode IV - A New Hope 1668979715 460998507
## 3 3 The Sound of Music 1335086324 159287539
## 4 4 E.T. the Extra-Terrestrial 1329174791 435110554
## 5 5 Titanic 1270101626 659363944
## EstNumTickets Year
## 1 202286200 1939
## 2 178119500 1977
## 3 142485200 1965
## 4 141854300 1982
## 5 135549800 1997
In this visual, I chose a bar grapph to distinguish each of the movies to show who sold the most tickets. Jaws sold around 128,000,000 and more. There really werent many if any at all that came close to selling that amount.
mojo %>%
arrange(-EstNumTickets) %>%
head(5) %>% # gives us first five values, we can also use .[1:5, ]
ggplot(aes(x=Title, y=EstNumTickets, fill="r")) +
geom_bar(stat="identity")
In this histogram we can see that there were of course many movies made over time. I wanted to make this a histogram to show the years over time and the amount of movies that were made. Overall, as the years went on there were consistenly more movies made.
mojo %>%
ggplot(aes(x=Year)) +
geom_histogram(bins=length(unique(mojo$Year))) +
ggtitle("Movies Being Made over Time") +
ylab("Number of Movies")
In this visual, we can see that gross ticket sales in 2019 were significantly higher than the other years that were compared in the data.
mojo %>%
group_by(Year) %>%
summarize(GrossTicketSales = sum(AdjLifetimeGross)) %>%
arrange(-GrossTicketSales) %>%
head(5)
## # A tibble: 5 x 2
## Year GrossTicketSales
## <dbl> <dbl>
## 1 2019 3751792976
## 2 2015 3396696212
## 3 1965 3199111959
## 4 1977 3105966663
## 5 2016 3094334745
mojo %>%
arrange(-EstNumTickets) %>%
head(5) %>% # gives us first five values, we can also use .[1:5, ]
ggplot(aes(x=Title, y=EstNumTickets, fill="r")) +
geom_bar(stat="identity")
What statistical method would I use if I cotninued my analysis To provide further validation of my findings I would continued to analysis distribution of the movies and how they compare from 1975-2020. I would be interested to continue researching the gross profit of these movies and compare the years. I would also be more interested in the overall revenue these movies have produced. I would also used the statistical method of the regression model. It would be a efficient model to use when find the impact these variables have on one another. I want to know what influences the other, and which variables matter the most.