Here we will import our data and do some preliminary analysis on it in order to get familiar with it.

library(readxl)

## Warning: package 'readxl' was built under R version 3.4.4

Final_Dataset<-read_xlsx("C:/Users/User/Downloads/Final Dataset.xlsx")
str(Final_Dataset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    87 obs. of  15 variables:
##  $ Rank                      : num  670 641 298 886 100 300 659 568 568 137 ...
##  $ Year                      : num  2006 2006 2006 2006 2006 ...
##  $ Ceremony                  : num  79 79 79 79 79 80 80 80 80 80 ...
##  $ Award                     : chr  "Actor in a Leading Role" "Actor in a Leading Role" "Actress in a Leading Role" "Director" ...
##  $ Winner                    : num  0 0 0 0 1 1 0 0 0 1 ...
##  $ Name                      : chr  "Leonardo DiCaprio" "Will Smith" "Meryl Streep" "Alejandro GonzÃ¡lez IÃ±Ã¡rritu" ...
##  $ Film                      : chr  "Blood Diamond" "The Pursuit of Happyness" "The Devil Wears Prada" "Babel" ...
##  $ Genre                     : chr  "Adventure,Drama,Thriller" "Biography,Drama" "Comedy,Drama" "Drama" ...
##  $ Director                  : chr  "Edward Zwick" "Gabriele Muccino" "David Frankel" "Alejandro González Iñárritu" ...
##  $ Year (IMDB-Movie-Data.csv): num  2006 2006 2006 2006 2006 ...
##  $ Runtime (Minutes)         : num  143 117 109 143 151 158 100 96 96 122 ...
##  $ Rating                    : num  8 8 6.8 7.5 8.5 8.1 7.7 7.5 7.5 8.1 ...
##  $ Votes                     : num  422014 361105 302268 253417 937414 ...
##  $ Revenue (Millions)        : chr  "57.37" "162.59" "124.73" "34.299999999999997" ...
##  $ Metascore                 : chr  "64" "64" "62" "69" ...

View(Final_Dataset)

\[DATA\\PREPROCESSING\]

Let’s see the summary of all records in our dataset

summary(Final_Dataset)

##       Rank            Year         Ceremony        Award          
##  Min.   : 68.0   Min.   :2006   Min.   :79.00   Length:87         
##  1st Qu.:250.0   1st Qu.:2010   1st Qu.:82.50   Class :character  
##  Median :350.0   Median :2012   Median :85.00   Mode  :character  
##  Mean   :436.3   Mean   :2011   Mean   :84.33                     
##  3rd Qu.:671.5   3rd Qu.:2014   3rd Qu.:87.00                     
##  Max.   :952.0   Max.   :2015   Max.   :88.00                     
##      Winner           Name               Film              Genre          
##  Min.   :0.0000   Length:87          Length:87          Length:87         
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.2299                                                           
##  3rd Qu.:0.0000                                                           
##  Max.   :1.0000                                                           
##    Director         Year (IMDB-Movie-Data.csv) Runtime (Minutes)
##  Length:87          Min.   :2006               Min.   : 91.0    
##  Class :character   1st Qu.:2010               1st Qu.:115.0    
##  Mode  :character   Median :2012               Median :122.0    
##                     Mean   :2011               Mean   :127.7    
##                     3rd Qu.:2014               3rd Qu.:142.0    
##                     Max.   :2015               Max.   :180.0    
##      Rating          Votes        Revenue (Millions)  Metascore        
##  Min.   :6.600   Min.   : 77995   Length:87          Length:87         
##  1st Qu.:7.400   1st Qu.:241791   Class :character   Class :character  
##  Median :7.800   Median :352801   Mode  :character   Mode  :character  
##  Mean   :7.698   Mean   :389004                                        
##  3rd Qu.:8.000   3rd Qu.:520491                                        
##  Max.   :8.500   Max.   :959065

Since we do not have any NA in our dataset, let do some preprocessing by changing changing the name of certains columns and converting columns that R misread; But, before that we will separate the “Gender” column elements by only keeping the first element and remove the 2 others

#changing the columns' name
Final_Dataset<-edit(Final_Dataset)

## Warning in edit.data.frame(Final_Dataset): NAs introduced by coercion

#The below package allow us to separate the elements of "Genre" column and keep the first element
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.4

Final_Dataset<-separate(Final_Dataset, Genre, into = "Genre", sep = ",", error=Silence)

## Warning: Expected 1 pieces. Additional pieces discarded in 77 rows [1, 2,
## 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].

Final_Dataset$Winner<- as.factor(Final_Dataset$Winner)
Final_Dataset$Genre<-as.factor(Final_Dataset$Genre)
Final_Dataset$Award<-as.factor(Final_Dataset$Award)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.4

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(Final_Dataset, n=10)

## Observations: 87
## Variables: 15
## $ Rank                         <dbl> 670, 641, 298, 886, 100, 300, 659...
## $ Year                         <dbl> 2006, 2006, 2006, 2006, 2006, 200...
## $ Ceremony                     <dbl> 79, 79, 79, 79, 79, 80, 80, 80, 8...
## $ Award                        <fct> Actor in a Leading Role, Actor in...
## $ Winner                       <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, ...
## $ Name                         <chr> "Leonardo DiCaprio", "Will Smith"...
## $ Film                         <chr> "Blood Diamond", "The Pursuit of ...
## $ Genre                        <fct> Adventure, Biography, Comedy, Dra...
## $ Director                     <chr> "Edward Zwick", "Gabriele Muccino...
## $ `Year (IMDB-Movie-Data.csv)` <dbl> 2006, 2006, 2006, 2006, 2006, 200...
## $ Runtime                      <dbl> 143, 117, 109, 143, 151, 158, 100...
## $ Rating                       <dbl> 8.0, 8.0, 6.8, 7.5, 8.5, 8.1, 7.7...
## $ Votes                        <dbl> 422014, 361105, 302268, 253417, 9...
## $ Revenue                      <dbl> 57.37, 162.59, 124.73, 34.30, 132...
## $ Metascore                    <chr> "64", "64", "62", "69", "85", "92...

Let’s see the distribution of our data

Our Revenue is right skewed and kind of Bimodal (2 peaks), which mean that our MEAN values is greater than the MEDIAN and we might have 2 differents groups of revenue(Not 100% sure)

library(lattice)
densityplot(as.numeric(Final_Dataset$Revenue))

In overall, most of our data in Votes accross the Genre are right skewed excep the “COMEDY” genre. In addition, Biography, Adventure and Drama seems to be bi or mul-modal distributed

densityplot(~Votes|Genre, data=Final_Dataset, color='black', group=Genre)

The runtime is a little bit right skewed & unimodal of course, and most visually, we can say that most the runtime reside between 100 and 150 minutes

histogram(Final_Dataset$Runtime, plot.points=FALSE)

Here we will summarize our data by finding out the average revenue per each year as well as the total revenue per each year from 2006 up to 2015

library(dplyr)

YearRevenue<-Final_Dataset %>% group_by(Year) %>% summarise(TotalRevPerYear = sum(as.numeric(Revenue), na.rm = TRUE), 
                                      Avg_Rev_Year = mean(as.numeric(Revenue), na.rm=TRUE))

## Warning: package 'bindrcpp' was built under R version 3.4.4

YearRevenue %>% arrange(desc(Avg_Rev_Year))

## # A tibble: 10 x 3
##     Year TotalRevPerYear Avg_Rev_Year
##    <dbl>           <dbl>        <dbl>
##  1  2009           1336         191. 
##  2  2013           1406.        128. 
##  3  2012           1092.        121. 
##  4  2008            432.        108. 
##  5  2010           1149         104. 
##  6  2006            511.        102. 
##  7  2014            979.         89.0
##  8  2015           1019.         78.4
##  9  2007            459.         76.5
## 10  2011            681.         75.7

\[Plotting\] #A picture worth thousand words– Time to visualize our above finding

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

ggplot(YearRevenue, aes(TotalRevPerYear, Avg_Rev_Year, size=.9)) + geom_jitter()+ facet_wrap(~Year)

# plot of Year against the Total Revenue per year 
plot(YearRevenue$Year, YearRevenue$TotalRevPerYear)

library(lattice)
#Plots of Average revenue per year against Year
xyplot(YearRevenue$Avg_Rev_Year~YearRevenue$Year)

This below code give us the distribution of our movies accordingly. We see that 2010 and 2015 are the years that produced more movies; 2008 and 2006 are the years that produced less movies according to our graph</>

table(Final_Dataset$Year)

## 
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##    5    6    4    7   12    9    9   11   11   13

barplot(table(Final_Dataset$Year), xlab="Movie per year", col="black")

This following summary gives us the the total revenue and the average revenue of each “Genre” of movie produced

Genre_Revenue<-Final_Dataset %>% group_by(Genre) %>% summarise(Total_Revenue = sum(as.numeric(Revenue),na.rm=TRUE),
                                       Avg_Revenue= mean(as.numeric(Revenue), na.rm = TRUE))
Genre_Revenue

## # A tibble: 6 x 3
##   Genre     Total_Revenue Avg_Revenue
##   <fct>             <dbl>       <dbl>
## 1 Action            1358.       339. 
## 2 Adventure         1430.       119. 
## 3 Biography         2095.        95.2
## 4 Comedy            1115.       101. 
## 5 Crime              989.       110. 
## 6 Drama             2078.        74.2

As we already know, a PICTURE worth THoUSAND number Mdr

library(ggplot2)
ggplot(Genre_Revenue, aes(Total_Revenue, Avg_Revenue)) + geom_point(aes(color=Genre, size=0.9))

Time for some digging, as we see in the previous graph, one of our average revenue for GENRE is far away from the rest, let’s see what GENRE is it.

#For the outlier in the Y axis
Genre_Revenue[which(Genre_Revenue$Avg_Revenue>250),]

## # A tibble: 1 x 3
##   Genre  Total_Revenue Avg_Revenue
##   <fct>          <dbl>       <dbl>
## 1 Action         1358.        339.

As we can, action genre revenue is considerably higher than the other genre, it has almost twice the revenue of the other genres

tapply(as.numeric(Final_Dataset$Revenue),Final_Dataset$Genre, mean, na.rm=TRUE) %>% barplot()

# what is the genre that is more produced in our dataset.

We see that Drama movies are a lot in this dataset, while Action is not. Even though Action movies are the lowest Genre produced, they generated the highest revenue

ggplot(Final_Dataset, aes(Genre)) + geom_bar()

Revenue distribution per years and per Genre

ggplot(Final_Dataset, aes(x = Year, y = as.numeric(Revenue),fill=Genre)) +
  geom_bar(stat = "identity", na.rm=TRUE)

## Warning: Removed 1 rows containing missing values (position_stack).

#For the outlier in the X axis
Genre_Revenue[which(Genre_Revenue$Total_Revenue>1800),]

## # A tibble: 2 x 3
##   Genre     Total_Revenue Avg_Revenue
##   <fct>             <dbl>       <dbl>
## 1 Biography         2095.        95.2
## 2 Drama             2078.        74.2

Let’s see what year has generated highest average revenue

xyplot(YearRevenue$Avg_Rev_Year~YearRevenue$Year)

The above plot shows us that the average revenue in 2009 was considerably higher than the rest of the years

Now let see if Revenue generated depend on the vote received

We see that there is a slight relationship between the vote and revenue. Our revenue increased slightly based on how much votes we get. </>

ggplot(Final_Dataset, aes(Votes, as.numeric(Revenue), color=Genre)) + geom_point(na.rm=TRUE)

However, we see that we have some Genres that are far away from the rest, let’s see what movies are in those Genre, andget some other information about those movies.

Final_Dataset1<-Final_Dataset[,c("Year","Award","Winner","Name","Film","Genre","Director",
                                 "Runtime", "Rating","Votes","Revenue")]

Final_Dataset1[which(as.numeric(Final_Dataset1$Votes)>750000),]

##    Year                   Award Winner              Name
## 5  2006                Director      1   Martin Scorsese
## 19 2009                Director      0     James Cameron
## 21 2009                Director      0 Quentin Tarantino
## 54 2013 Actor in a Leading Role      0 Leonardo DiCaprio
## 63 2013                Director      0   Martin Scorsese
##                       Film     Genre          Director Runtime Rating
## 5             The Departed     Crime   Martin Scorsese     151    8.5
## 19                  Avatar    Action     James Cameron     162    7.8
## 21    Inglourious Basterds Adventure Quentin Tarantino     153    8.3
## 54 The Wolf of Wall Street Biography   Martin Scorsese     180    8.2
## 63 The Wolf of Wall Street Biography   Martin Scorsese     180    8.2
##     Votes Revenue
## 5  937414  132.37
## 19 935408  760.51
## 21 959065  120.52
## 54 865134  116.87
## 63 865134  116.87

The below plot seem to show that the more Rating a movie get, the more revenue it generate. We also see that Drama and Biography seem to be more represented compared to the other GENRE

ggplot(Final_Dataset, aes(Rating,as.numeric(Revenue), color=Genre)) + geom_point(na.rm = TRUE)

p>Let’s see the information fo the one movie has generated more revenue according to the above graph

Final_Dataset1[which(as.numeric(Final_Dataset1$Revenue)>400),] %>% head()

##    Year    Award Winner          Name   Film  Genre      Director Runtime
## 19 2009 Director      0 James Cameron Avatar Action James Cameron     162
##    Rating  Votes Revenue
## 19    7.8 935408  760.51

What is the Runtime of our movies ?

As we can see, most of the movies produced have runtime between 110 min and 140. It seems like our plot is clustered in 3 sections:

Movies with Runtime between 50 to 100 minutes\
Moview with Runtime between 108 to 150 minutes\
Movies with Runtime from 160 to 180 minutes\

ggplot(Final_Dataset,aes(Runtime,as.numeric(Revenue), color=Genre)) + geom_point(na.rm=TRUE)

#Now, let’s see How many movies have been nominated for the Best Director category?\ We have 36 movies that were nominated as the best awarded movie.

Final_Dataset2<-Final_Dataset1[(Final_Dataset1$Award=="Director"),]
head(Final_Dataset2,6)

##    Year    Award Winner                           Name
## 4  2006 Director      0 Alejandro GonzÃ¡lez IÃ±Ã¡rritu
## 5  2006 Director      1                Martin Scorsese
## 9  2007 Director      0                  Jason Reitman
## 10 2007 Director      1       Joel Coen and Ethan Coen
## 11 2007 Director      0           Paul Thomas Anderson
## 14 2008 Director      0                  David Fincher
##                                   Film  Genre                    Director
## 4                                Babel  Drama Alejandro González Iñárritu
## 5                         The Departed  Crime             Martin Scorsese
## 9                                 Juno Comedy               Jason Reitman
## 10              No Country for Old Men  Crime                  Ethan Coen
## 11                 There Will Be Blood  Drama        Paul Thomas Anderson
## 14 The Curious Case of Benjamin Button  Drama               David Fincher
##    Runtime Rating  Votes Revenue
## 4      143    7.5 253417   34.30
## 5      151    8.5 937414  132.37
## 9       96    7.5 432461  143.49
## 10     122    8.1 660286   74.27
## 11     158    8.1 400682   40.22
## 14     166    7.8 485075  127.49

As you can see in this graph, only 25% of our nominated movies won the “BEST DIRECTOR”, while the 75% did not.

ggplot(Final_Dataset2, aes(Winner,fill=Winner), stat="count") + geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = ..count.., y= ..prop..), stat= "count", vjust = -.5)+
    labs(y = "Percent", fill="Winner")

What is the average rating of a movie that has been nominated for the Best Director category?\

Final_Dataset2 %>% summarize(Average_rating= mean(Rating)) %>% glimpse()

## Observations: 1
## Variables: 1
## $ Average_rating <dbl> 7.822222

Of the movies that were nominated for Best Director, were they also nominated for other Oscar categories?\

As we can see we have 9 movies in the “Best Director” nominees that were nominated in other oscar categories

#Here we are creating a dataset for only the movies that were nominated 
Nominees<-Final_Dataset2$Film

# Now we will see which movies is repeated IN the both "Nominees" and "Final_Dataset" and stored it in "both"
both<- Final_Dataset1[which(Nominees %in% Final_Dataset1$Film),] 

# Now we will be able to see how many movies from Best Director are also nominated in other oscar categories
Multiple_Nomination<-both[which(duplicated(both$Film)),]
Multiple_Nomination %>% select(Film, Genre,Director)

##                                   Film     Genre             Director
## 9                                 Juno    Comedy        Jason Reitman
## 11                 There Will Be Blood     Drama Paul Thomas Anderson
## 14 The Curious Case of Benjamin Button     Drama        David Fincher
## 20                     The Hurt Locker     Drama      Kathryn Bigelow
## 22                       Up in the Air     Drama        Jason Reitman
## 30                          Black Swan     Drama     Darren Aronofsky
## 32                   The King's Speech Biography           Tom Hooper
## 33                  The Social Network Biography        David Fincher
## 34                           True Grit Adventure           Ethan Coen

Exploratory Analysis

The Miners

November 24, 2019