In this project, I scraped, cleaned, and transformed data of Marvel Cinematic Universe films from https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films. Afterwards, I analyzed the data and visualized my findings with ggplot2. ***This information is valid until April 2019.

Loading packages

library(rvest)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggrepel)
library(gridExtra)
library(zoo)
library(reshape2)
library(car)
library(plotly)

Scraping two tables from Wikipedia: Critical Response table and Box Office Performance table. Afterwards, I cleaned and merged them into a single data frame.

movie_table <- read_html("https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films")
boxoff <- html_table(html_nodes(movie_table, "table")[[10]],fill=T)[c(-1,-24),c(-3,-4,-6,-7,-9)] 
critical <- html_table(html_nodes(movie_table, "table")[[11]],fill=T)[-23,]
merged <- merge(boxoff,critical,by.x="Film",by.y="Film")
colnames(merged)[3] <- "Box.Office.Gross.Worldwide"
colnames(merged)[2] <- "Year"
colnames(merged) <- make.names(colnames(merged))
merged
The data in the table is not ready for analysis because there are symbols (commas, dollar signs, spaces) and useless words. Now I will start by cleaning the data by keeping the columns: Box office gross worldwide, release year, Budget, Rotten Tomatoes scores, and Metacritic scores. Also, I will create a ‘Revenue’ column in the table.

Removing the days and months while keeping the year and converting to numeric values

a=regexpr("[[:digit:]]{4}", merged$Year)
merged$Year <- regmatches(merged$Year,a) %>% as.numeric() 

Removing the “$” and “,” from the gross column and converting to numeric values

merged$Box.Office.Gross.Worldwide <- gsub("(\\$|,)", "", merged$Box.Office.Gross.Worldwide) %>% as.numeric()

Removed symbols and converted the net budget to numeric values by multiplying by 1000000

c=gsub("(\\$|\\s{1}million)", "", merged$Budget) 
merged$Budget <- gsub("[[:digit:]]{3}(\\–){1}","",c) %>% as.numeric()*1000000

Removed symbols and only kept the scores as numeric values

merged$Rotten.Tomatoes <- substr(merged$Rotten.Tomatoes,1,2) %>% as.numeric()

Removed symbols and only kept the scores as numeric values

merged$`Metacritic` <- substr(merged$`Metacritic`,1,2) %>% as.numeric()

Adding new ‘Revenue’ feature to the merged data frame

merged <- mutate(merged, Revenue=Box.Office.Gross.Worldwide-Budget)

Cleaned and merged table

write.csv(merged, "CleanedMarvel.csv")
merged

Moving averages of Box Office Gross Worldwide and Budget over time

The budget seems to be fairly constant throughout the years however gross has fallen slightly in 2015 - 2017 and begins increasing again in 2017 more at a higher rate than previous years

ggplot(data=merged, aes(x=Year)) + 
  geom_point(aes(y=Box.Office.Gross.Worldwide)) +
  geom_smooth(method="loess", aes(y=Box.Office.Gross.Worldwide, colour="Gross_Loess")) +
    geom_smooth(method="lm", se=F,aes(y=Box.Office.Gross.Worldwide, colour="Gross_Linear")) +
  geom_point(data=merged,aes(y=Budget, colour="Budget_MovingAvg")) +
  geom_line(data=merged,aes(y=rollapplyr(merged$Budget,width=10,FUN=mean,partial=T),colour="Budget_MovingAvg")) +
  scale_x_continuous(breaks=c(2008:2019)) +
  labs(title="Box Office Gross worldwide and Budget income over time for MCU movies") +
  ylab(label="Dollars") 

  #scale_colour_manual(name="", values=c(Gross_MovingAvg="#00BA38", Budget_MovingAvg="#F8766D", Gross_Loess="#619CFF")) 

What is the distribution of revenue for Marvel movies?

The red dashed line represents the mean revenue and the black surve represents the density curve. Before taking the log of the revenue, the distribution appears to be positively skewed to the right. After the transformation, the distribution appears to be closer to a normal distribution.

ggplot(data=merged, aes(x=Revenue)) +
  geom_density(color="black") +
  geom_vline(xintercept = mean(merged$Revenue), color = "red", linetype = "dashed") +
  labs(title="Distribution of Revenue for MCU movies") +
  xlab(label="Revenue in Dollars") +
  ylab(label="Density")


ggplot(data=merged, aes(x=log(Revenue))) +
  geom_density(color="black") +
  geom_vline(xintercept = mean(log(merged$Revenue)), color = "red", linetype = "dashed") +
  labs(title="Distribution of log(Revenue) for MCU movies") +
  xlab(label="log(Revenue) in Dollars") +
  ylab(label="Density")

Which years did MCU movies bring in the most revenue?

The graphs show that 2018 and 2019 MCU movies brought in the highest Revenue income. However there were 2 MCU released in 2019 and 3 in 2018. A film released in 2019 generated significantly more revenue than movies from previous years. (Avengers: Endgame)

merged %>%
  group_by(Year) %>%
  summarize(Total.Revenue=sum(Revenue)) %>%
  ggplot(aes(x=Year, y=Total.Revenue)) + 
  geom_bar(stat="identity", fill="#F8766D", position="dodge") +
  geom_text(aes(label=Total.Revenue), position=position_dodge(width=0.1), size=2.3, vjust=-.25,) +
  scale_x_continuous(breaks=c(2008:2019)) +
  labs(title="Total Revenue vs Year of MCU movies") +
  ylab(label="Total Revenue")


ggplot(merged, aes(x=Film, y=Revenue)) + 
  geom_bar(stat="identity") + facet_wrap(~Year) + 
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  labs(title="Revenue of MCU movies from 2008-2019")

Relationship of Budget and Gross Revenue

The linear regression model suggests that they have a positive relationship and the moving average and loess curve of budget is rising but starts dropping when revenue reaches ~$1.75 million. The summary of the simple linear regression with Budget as the predictor and Revenue as the response shows that Budget is a a significant factor that influences the revenue positively. However since the sample size is small, it is not clear. The residual plot is not very unsual, however there are a few outliers (4 and 3). The normal QQ plot shows that the residuals are approximately normally dsitributed.

ggplot(merged, aes(x=Budget, y=Revenue)) +
  geom_point() +
  geom_smooth(aes(colour="Loess"), method="loess",size=.6, se=F) +
  geom_smooth(aes(colour="Linear Regression"), method="lm", se=F, size=.6) +
  xlab(label="Budget (Dollars)") +
  ylab(label="Revenue (Dollars)") +
  labs(title="REvenue vs Budget for MCU movies") 

revfit <- lm(Revenue ~ Budget,data=merged)
summary(revfit)

Call:
lm(formula = Revenue ~ Budget, data = merged)

Residuals:
       Min         1Q     Median         3Q        Max 
-588368455 -173496574  -38885727  137761684  816134505 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.263e+08  2.108e+08  -1.548    0.137    
Budget       5.348e+00  9.685e-01   5.522  2.1e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 336500000 on 20 degrees of freedom
Multiple R-squared:  0.6039,    Adjusted R-squared:  0.5841 
F-statistic: 30.49 on 1 and 20 DF,  p-value: 2.097e-05
plot(revfit,1:2)

Do Rotten Tomatoes and Metacritic agree?

They seem to have a positive relatiionship, however it is not proportional. Metacritic appears to give lower ratings than Rotten Tomatoes.

ggplot(merged, aes(x=Metacritic, Rotten.Tomatoes)) +
  geom_point( size=.5) +
  geom_text_repel(label=merged$Film, size=3) +
  xlim(50,100) +
  ylim(50,100) +
  geom_line(colour="darkred", alpha=.6, method="loess", stat="smooth", size=1.3) +
  labs(title="Metacritic ratings vs Rotten Tomatoes ratings of MCU Movies") +
  ylab(label="Rotten Tomatoes scores") +
  xlab(label="Metacritic scores")


ggplot(merged) +
  geom_bar(aes(x=reorder(Film, Rotten.Tomatoes), y=Rotten.Tomatoes, fill="Rotten Tomatoes"), stat="identity", alpha=1) +
  geom_bar(aes(x=reorder(Film, Metacritic), y=Metacritic, fill="Metacritic"), stat="identity",alpha=1) +
  coord_flip() +
  scale_color_manual(values=c(a="red",b="blue")) +
  scale_fill_discrete("") +
  labs(title="Ratings of MCU movies") +
  ylab("Rating") +
  xlab("Movies")

How have the ratings evolved over time?

In general, Rotten Tomatoes reviewers appears to give more generous ratings than Metacritic reviewers. The graphs suggests that the ratings are increasing the but have fallen a bit since 2018.

plot1 <- ggplot(merged, aes(x=Year, y=Metacritic)) + 
  geom_point() +
  geom_smooth(method="loess") +
  scale_x_continuous(breaks=c(2008:2019)) +
  theme(axis.text.x=element_text(angle = 60, hjust=1)) +
  ylim(50,100) +
  labs(title="Metacritic rating scores \n over time of MCU movies") +
  ylab(label="Metacritic Scores")

plot2 <- ggplot(merged, aes(x=Year, y=Rotten.Tomatoes)) + 
  geom_point() +
  geom_smooth(method="loess") +
  scale_x_continuous(breaks=c(2008:2019)) +
  ylim(50,100) +
  theme(axis.text.x=element_text(angle = 60, hjust=1)) +
  labs(title="Rotten Tomatoes rating scores \n over time of MCU movies") +
  ylab(label="Rotten Tomatoes scores")

grid.arrange(plot2, plot1, ncol=2)

Conlusion

  1. Budget appears to be somewhat constant while Gross is on an upward trend
  2. Rotten tomatoes tends to give higher ratings than Metacritic
  3. The highest by Rotten Tomatoes and Metacritic is Black Panther
  4. The movie that brought in the most revenue was Avengers:Endgame in 2019
  5. Budget is a significant predictor of Revenue
---
title: "Exploratory Data Analysis of Marvel Cinematic Universe films from 2008 - 2019"
author: "Alexander Lo"
date: "26/03/2019"
output:
  html_notebook: default
  html_document: default
---

```{r,cache=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE)
```

In this project, I scraped, cleaned, and transformed data of Marvel Cinematic Universe films from https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films. Afterwards, I analyzed the data and visualized my findings with ggplot2. ***This information is valid until April 2019.

Loading packages
```{r message=FALSE}
library(rvest)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggrepel)
library(gridExtra)
library(zoo)
library(reshape2)
```


Scraping two tables from Wikipedia: Critical Response table and Box Office Performance table. Afterwards, I cleaned and merged them into a single data frame.
```{r}
movie_table <- read_html("https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films")
boxoff <- html_table(html_nodes(movie_table, "table")[[10]],fill=T)[c(-1,-24),c(-3,-4,-6,-7,-9)] 
critical <- html_table(html_nodes(movie_table, "table")[[11]],fill=T)[-23,]
merged <- merge(boxoff,critical,by.x="Film",by.y="Film")
colnames(merged)[3] <- "Box.Office.Gross.Worldwide"
colnames(merged)[2] <- "Year"
colnames(merged) <- make.names(colnames(merged))
merged
```

##### The data in the table is not ready for analysis because there are symbols (commas, dollar signs, spaces) and useless words. Now I will start by cleaning the data by keeping the columns: Box office gross worldwide, release year, Budget, Rotten Tomatoes scores, and Metacritic scores. Also, I will create a 'Revenue' column in the table.


Removing the days and months while keeping the year and converting to numeric values
```{r}
a=regexpr("[[:digit:]]{4}", merged$Year)
merged$Year <- regmatches(merged$Year,a) %>% as.numeric() 
```

Removing the "$" and "," from the gross column and converting to numeric values
```{r}
merged$Box.Office.Gross.Worldwide <- gsub("(\\$|,)", "", merged$Box.Office.Gross.Worldwide) %>% as.numeric()
```

Removed symbols and converted the net budget to numeric values by multiplying by 1000000
```{r}
c=gsub("(\\$|\\s{1}million)", "", merged$Budget) 
merged$Budget <- gsub("[[:digit:]]{3}(\\–){1}","",c) %>% as.numeric()*1000000
```

Removed symbols and only kept the scores as numeric values
```{r}
merged$Rotten.Tomatoes <- substr(merged$Rotten.Tomatoes,1,2) %>% as.numeric()
```

Removed symbols and only kept the scores as numeric values
```{r}
merged$`Metacritic` <- substr(merged$`Metacritic`,1,2) %>% as.numeric()
```

Adding new 'Revenue' feature to the merged data frame
```{r}
merged <- mutate(merged, Revenue=Box.Office.Gross.Worldwide-Budget)
```

Cleaned and merged table
```{r}
write.csv(merged, "CleanedMarvel.csv")
merged
```

### Moving averages of Box Office Gross Worldwide and Budget over time
The budget seems to be fairly constant throughout the years however gross has fallen slightly in 2015 - 2017 and begins increasing again in 2017 more at a higher rate than previous years
```{r}
ggplot(data=merged, aes(x=Year)) + 
  geom_point(aes(y=Box.Office.Gross.Worldwide)) +
  geom_smooth(method="loess", aes(y=Box.Office.Gross.Worldwide, colour="Gross_Loess")) +
    geom_smooth(method="lm", se=F,aes(y=Box.Office.Gross.Worldwide, colour="Gross_Linear")) +
  geom_point(data=merged,aes(y=Budget, colour="Budget_MovingAvg")) +
  geom_line(data=merged,aes(y=rollapplyr(merged$Budget,width=10,FUN=mean,partial=T),colour="Budget_MovingAvg")) +
  scale_x_continuous(breaks=c(2008:2019)) +
  labs(title="Box Office Gross worldwide and Budget income over time for MCU movies") +
  ylab(label="Dollars") 
  #scale_colour_manual(name="", values=c(Gross_MovingAvg="#00BA38", Budget_MovingAvg="#F8766D", Gross_Loess="#619CFF")) 
```

### What is the distribution of revenue for Marvel movies? 
The red dashed line represents the mean revenue and the black surve represents the density curve. Before taking the log of the revenue, the distribution appears to be positively skewed to the right. After the transformation, the distribution appears to be closer to a normal distribution.
```{r}
ggplot(data=merged, aes(x=Revenue)) +
  geom_density(color="black") +
  geom_vline(xintercept = mean(merged$Revenue), color = "red", linetype = "dashed") +
  labs(title="Distribution of Revenue for MCU movies") +
  xlab(label="Revenue in Dollars") +
  ylab(label="Density")

ggplot(data=merged, aes(x=log(Revenue))) +
  geom_density(color="black") +
  geom_vline(xintercept = mean(log(merged$Revenue)), color = "red", linetype = "dashed") +
  labs(title="Distribution of log(Revenue) for MCU movies") +
  xlab(label="log(Revenue) in Dollars") +
  ylab(label="Density")
```

### Which years did MCU movies bring in the most revenue? 
The graphs show that 2018 and 2019 MCU movies brought in the highest Revenue income. However there were 2 MCU released in 2019 and 3 in 2018. A film released in 2019 generated significantly more revenue than movies from previous years. (Avengers: Endgame)
```{r}
merged %>%
  group_by(Year) %>%
  summarize(Total.Revenue=sum(Revenue)) %>%
  ggplot(aes(x=Year, y=Total.Revenue)) + 
  geom_bar(stat="identity", fill="#F8766D", position="dodge") +
  geom_text(aes(label=Total.Revenue), position=position_dodge(width=0.1), size=2.3, vjust=-.25,) +
  scale_x_continuous(breaks=c(2008:2019)) +
  labs(title="Total Revenue vs Year of MCU movies") +
  ylab(label="Total Revenue")

ggplot(merged, aes(x=Film, y=Revenue)) + 
  geom_bar(stat="identity") + facet_wrap(~Year) + 
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  labs(title="Revenue of MCU movies from 2008-2019")
```

### Relationship of Budget and Gross Revenue
The linear regression model suggests that they have a positive relationship and the moving average and loess curve of budget is rising but starts dropping when revenue reaches ~$1.75 million. The summary of the simple linear regression with Budget as the predictor and Revenue as the response shows that Budget is a a significant factor that influences the revenue positively. However since the sample size is small, it is not clear.  The residual plot is not very unsual, however there are a few outliers (4 and 3). The normal QQ plot shows that the residuals are approximately normally dsitributed.
```{r}
ggplot(merged, aes(x=Budget, y=Revenue)) +
  geom_point() +
  geom_smooth(aes(colour="Loess"), method="loess",size=.6, se=F) +
  geom_smooth(aes(colour="Linear Regression"), method="lm", se=F, size=.6) +
  xlab(label="Budget (Dollars)") +
  ylab(label="Revenue (Dollars)") +
  labs(title="REvenue vs Budget for MCU movies") 
revfit <- lm(Revenue ~ Budget,data=merged)
summary(revfit)
plot(revfit,1:2)
```

### Do Rotten Tomatoes and Metacritic agree? 
They seem to have a positive relatiionship, however it is not proportional. Metacritic appears to give lower ratings than Rotten Tomatoes.
```{r}
ggplot(merged, aes(x=Metacritic, Rotten.Tomatoes)) +
  geom_point( size=.5) +
  geom_text_repel(label=merged$Film, size=3) +
  xlim(50,100) +
  ylim(50,100) +
  geom_line(colour="darkred", alpha=.6, method="loess", stat="smooth", size=1.3) +
  labs(title="Metacritic ratings vs Rotten Tomatoes ratings of MCU Movies") +
  ylab(label="Rotten Tomatoes scores") +
  xlab(label="Metacritic scores")

ggplot(merged) +
  geom_bar(aes(x=reorder(Film, Rotten.Tomatoes), y=Rotten.Tomatoes, fill="Rotten Tomatoes"), stat="identity", alpha=1) +
  geom_bar(aes(x=reorder(Film, Metacritic), y=Metacritic, fill="Metacritic"), stat="identity",alpha=1) +
  coord_flip() +
  scale_color_manual(values=c(a="red",b="blue")) +
  scale_fill_discrete("") +
  labs(title="Ratings of MCU movies") +
  ylab("Rating") +
  xlab("Movies")

```

### How have the ratings evolved over time?
In general, Rotten Tomatoes reviewers appears to give more generous ratings than Metacritic reviewers. The graphs suggests that the ratings are increasing the but have fallen a bit since 2018.
```{r}
plot1 <- ggplot(merged, aes(x=Year, y=Metacritic)) + 
  geom_point() +
  geom_smooth(method="loess") +
  scale_x_continuous(breaks=c(2008:2019)) +
  theme(axis.text.x=element_text(angle = 60, hjust=1)) +
  ylim(50,100) +
  labs(title="Metacritic rating scores \n over time of MCU movies") +
  ylab(label="Metacritic Scores")

plot2 <- ggplot(merged, aes(x=Year, y=Rotten.Tomatoes)) + 
  geom_point() +
  geom_smooth(method="loess") +
  scale_x_continuous(breaks=c(2008:2019)) +
  ylim(50,100) +
  theme(axis.text.x=element_text(angle = 60, hjust=1)) +
  labs(title="Rotten Tomatoes rating scores \n over time of MCU movies") +
  ylab(label="Rotten Tomatoes scores")

grid.arrange(plot2, plot1, ncol=2)
```

### Conlusion
1. Budget appears to be somewhat constant while Gross is on an upward trend
2. Rotten tomatoes tends to give higher ratings than Metacritic
3. The highest by Rotten Tomatoes and Metacritic is Black Panther
4. The movie that brought in the most revenue was Avengers:Endgame in 2019
5. Budget is a significant predictor of Revenue

