Here we will import our data and do some preliminary analysis on it in order to get familiar with it.
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
Final_Dataset<-read_xlsx("C:/Users/User/Downloads/Final Dataset.xlsx")
str(Final_Dataset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 87 obs. of 15 variables:
## $ Rank : num 670 641 298 886 100 300 659 568 568 137 ...
## $ Year : num 2006 2006 2006 2006 2006 ...
## $ Ceremony : num 79 79 79 79 79 80 80 80 80 80 ...
## $ Award : chr "Actor in a Leading Role" "Actor in a Leading Role" "Actress in a Leading Role" "Director" ...
## $ Winner : num 0 0 0 0 1 1 0 0 0 1 ...
## $ Name : chr "Leonardo DiCaprio" "Will Smith" "Meryl Streep" "Alejandro González Iñárritu" ...
## $ Film : chr "Blood Diamond" "The Pursuit of Happyness" "The Devil Wears Prada" "Babel" ...
## $ Genre : chr "Adventure,Drama,Thriller" "Biography,Drama" "Comedy,Drama" "Drama" ...
## $ Director : chr "Edward Zwick" "Gabriele Muccino" "David Frankel" "Alejandro González Iñárritu" ...
## $ Year (IMDB-Movie-Data.csv): num 2006 2006 2006 2006 2006 ...
## $ Runtime (Minutes) : num 143 117 109 143 151 158 100 96 96 122 ...
## $ Rating : num 8 8 6.8 7.5 8.5 8.1 7.7 7.5 7.5 8.1 ...
## $ Votes : num 422014 361105 302268 253417 937414 ...
## $ Revenue (Millions) : chr "57.37" "162.59" "124.73" "34.299999999999997" ...
## $ Metascore : chr "64" "64" "62" "69" ...
View(Final_Dataset)
\[DATA\\PREPROCESSING\]
Let’s see the summary of all records in our dataset
summary(Final_Dataset)
## Rank Year Ceremony Award
## Min. : 68.0 Min. :2006 Min. :79.00 Length:87
## 1st Qu.:250.0 1st Qu.:2010 1st Qu.:82.50 Class :character
## Median :350.0 Median :2012 Median :85.00 Mode :character
## Mean :436.3 Mean :2011 Mean :84.33
## 3rd Qu.:671.5 3rd Qu.:2014 3rd Qu.:87.00
## Max. :952.0 Max. :2015 Max. :88.00
## Winner Name Film Genre
## Min. :0.0000 Length:87 Length:87 Length:87
## 1st Qu.:0.0000 Class :character Class :character Class :character
## Median :0.0000 Mode :character Mode :character Mode :character
## Mean :0.2299
## 3rd Qu.:0.0000
## Max. :1.0000
## Director Year (IMDB-Movie-Data.csv) Runtime (Minutes)
## Length:87 Min. :2006 Min. : 91.0
## Class :character 1st Qu.:2010 1st Qu.:115.0
## Mode :character Median :2012 Median :122.0
## Mean :2011 Mean :127.7
## 3rd Qu.:2014 3rd Qu.:142.0
## Max. :2015 Max. :180.0
## Rating Votes Revenue (Millions) Metascore
## Min. :6.600 Min. : 77995 Length:87 Length:87
## 1st Qu.:7.400 1st Qu.:241791 Class :character Class :character
## Median :7.800 Median :352801 Mode :character Mode :character
## Mean :7.698 Mean :389004
## 3rd Qu.:8.000 3rd Qu.:520491
## Max. :8.500 Max. :959065
Since we do not have any NA in our dataset, let do some preprocessing by changing changing the name of certains columns and converting columns that R misread; But, before that we will separate the “Gender” column elements by only keeping the first element and remove the 2 others
#changing the columns' name
Final_Dataset<-edit(Final_Dataset)
## Warning in edit.data.frame(Final_Dataset): NAs introduced by coercion
#The below package allow us to separate the elements of "Genre" column and keep the first element
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
Final_Dataset<-separate(Final_Dataset, Genre, into = "Genre", sep = ",", error=Silence)
## Warning: Expected 1 pieces. Additional pieces discarded in 77 rows [1, 2,
## 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].
Final_Dataset$Winner<- as.factor(Final_Dataset$Winner)
Final_Dataset$Genre<-as.factor(Final_Dataset$Genre)
Final_Dataset$Award<-as.factor(Final_Dataset$Award)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(Final_Dataset, n=10)
## Observations: 87
## Variables: 15
## $ Rank <dbl> 670, 641, 298, 886, 100, 300, 659...
## $ Year <dbl> 2006, 2006, 2006, 2006, 2006, 200...
## $ Ceremony <dbl> 79, 79, 79, 79, 79, 80, 80, 80, 8...
## $ Award <fct> Actor in a Leading Role, Actor in...
## $ Winner <fct> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, ...
## $ Name <chr> "Leonardo DiCaprio", "Will Smith"...
## $ Film <chr> "Blood Diamond", "The Pursuit of ...
## $ Genre <fct> Adventure, Biography, Comedy, Dra...
## $ Director <chr> "Edward Zwick", "Gabriele Muccino...
## $ `Year (IMDB-Movie-Data.csv)` <dbl> 2006, 2006, 2006, 2006, 2006, 200...
## $ Runtime <dbl> 143, 117, 109, 143, 151, 158, 100...
## $ Rating <dbl> 8.0, 8.0, 6.8, 7.5, 8.5, 8.1, 7.7...
## $ Votes <dbl> 422014, 361105, 302268, 253417, 9...
## $ Revenue <dbl> 57.37, 162.59, 124.73, 34.30, 132...
## $ Metascore <chr> "64", "64", "62", "69", "85", "92...
Let’s see the distribution of our data
Our Revenue is right skewed and kind of Bimodal (2 peaks), which mean that our MEAN values is greater than the MEDIAN and we might have 2 differents groups of revenue(Not 100% sure)
library(lattice)
densityplot(as.numeric(Final_Dataset$Revenue))
In overall, most of our data in Votes accross the Genre are right skewed excep the “COMEDY” genre. In addition, Biography, Adventure and Drama seems to be bi or mul-modal distributed
densityplot(~Votes|Genre, data=Final_Dataset, color='black', group=Genre)
The runtime is a little bit right skewed & unimodal of course, and most visually, we can say that most the runtime reside between 100 and 150 minutes
histogram(Final_Dataset$Runtime, plot.points=FALSE)
Here we will summarize our data by finding out the average revenue per each year as well as the total revenue per each year from 2006 up to 2015
library(dplyr)
YearRevenue<-Final_Dataset %>% group_by(Year) %>% summarise(TotalRevPerYear = sum(as.numeric(Revenue), na.rm = TRUE),
Avg_Rev_Year = mean(as.numeric(Revenue), na.rm=TRUE))
## Warning: package 'bindrcpp' was built under R version 3.4.4
YearRevenue %>% arrange(desc(Avg_Rev_Year))
## # A tibble: 10 x 3
## Year TotalRevPerYear Avg_Rev_Year
## <dbl> <dbl> <dbl>
## 1 2009 1336 191.
## 2 2013 1406. 128.
## 3 2012 1092. 121.
## 4 2008 432. 108.
## 5 2010 1149 104.
## 6 2006 511. 102.
## 7 2014 979. 89.0
## 8 2015 1019. 78.4
## 9 2007 459. 76.5
## 10 2011 681. 75.7
\[Plotting\] #A picture worth thousand words– Time to visualize our above finding
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(YearRevenue, aes(TotalRevPerYear, Avg_Rev_Year, size=.9)) + geom_jitter()+ facet_wrap(~Year)
# plot of Year against the Total Revenue per year
plot(YearRevenue$Year, YearRevenue$TotalRevPerYear)
library(lattice)
#Plots of Average revenue per year against Year
xyplot(YearRevenue$Avg_Rev_Year~YearRevenue$Year)
This below code give us the distribution of our movies accordingly. We see that 2010 and 2015 are the years that produced more movies; 2008 and 2006 are the years that produced less movies according to our graph</>
table(Final_Dataset$Year)
##
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 5 6 4 7 12 9 9 11 11 13
barplot(table(Final_Dataset$Year), xlab="Movie per year", col="black")
This following summary gives us the the total revenue and the average revenue of each “Genre” of movie produced
Genre_Revenue<-Final_Dataset %>% group_by(Genre) %>% summarise(Total_Revenue = sum(as.numeric(Revenue),na.rm=TRUE),
Avg_Revenue= mean(as.numeric(Revenue), na.rm = TRUE))
Genre_Revenue
## # A tibble: 6 x 3
## Genre Total_Revenue Avg_Revenue
## <fct> <dbl> <dbl>
## 1 Action 1358. 339.
## 2 Adventure 1430. 119.
## 3 Biography 2095. 95.2
## 4 Comedy 1115. 101.
## 5 Crime 989. 110.
## 6 Drama 2078. 74.2
library(ggplot2)
ggplot(Genre_Revenue, aes(Total_Revenue, Avg_Revenue)) + geom_point(aes(color=Genre, size=0.9))
Time for some digging, as we see in the previous graph, one of our average revenue for GENRE is far away from the rest, let’s see what GENRE is it.
#For the outlier in the Y axis
Genre_Revenue[which(Genre_Revenue$Avg_Revenue>250),]
## # A tibble: 1 x 3
## Genre Total_Revenue Avg_Revenue
## <fct> <dbl> <dbl>
## 1 Action 1358. 339.
As we can, action genre revenue is considerably higher than the other genre, it has almost twice the revenue of the other genres
tapply(as.numeric(Final_Dataset$Revenue),Final_Dataset$Genre, mean, na.rm=TRUE) %>% barplot()
# what is the genre that is more produced in our dataset.
We see that Drama movies are a lot in this dataset, while Action is not. Even though Action movies are the lowest Genre produced, they generated the highest revenue
ggplot(Final_Dataset, aes(Genre)) + geom_bar()
ggplot(Final_Dataset, aes(x = Year, y = as.numeric(Revenue),fill=Genre)) +
geom_bar(stat = "identity", na.rm=TRUE)
## Warning: Removed 1 rows containing missing values (position_stack).
#For the outlier in the X axis
Genre_Revenue[which(Genre_Revenue$Total_Revenue>1800),]
## # A tibble: 2 x 3
## Genre Total_Revenue Avg_Revenue
## <fct> <dbl> <dbl>
## 1 Biography 2095. 95.2
## 2 Drama 2078. 74.2
xyplot(YearRevenue$Avg_Rev_Year~YearRevenue$Year)
The above plot shows us that the average revenue in 2009 was considerably higher than the rest of the years
We see that there is a slight relationship between the vote and revenue. Our revenue increased slightly based on how much votes we get. </>
ggplot(Final_Dataset, aes(Votes, as.numeric(Revenue), color=Genre)) + geom_point(na.rm=TRUE)
However, we see that we have some Genres that are far away from the rest, let’s see what movies are in those Genre, andget some other information about those movies.
Final_Dataset1<-Final_Dataset[,c("Year","Award","Winner","Name","Film","Genre","Director",
"Runtime", "Rating","Votes","Revenue")]
Final_Dataset1[which(as.numeric(Final_Dataset1$Votes)>750000),]
## Year Award Winner Name
## 5 2006 Director 1 Martin Scorsese
## 19 2009 Director 0 James Cameron
## 21 2009 Director 0 Quentin Tarantino
## 54 2013 Actor in a Leading Role 0 Leonardo DiCaprio
## 63 2013 Director 0 Martin Scorsese
## Film Genre Director Runtime Rating
## 5 The Departed Crime Martin Scorsese 151 8.5
## 19 Avatar Action James Cameron 162 7.8
## 21 Inglourious Basterds Adventure Quentin Tarantino 153 8.3
## 54 The Wolf of Wall Street Biography Martin Scorsese 180 8.2
## 63 The Wolf of Wall Street Biography Martin Scorsese 180 8.2
## Votes Revenue
## 5 937414 132.37
## 19 935408 760.51
## 21 959065 120.52
## 54 865134 116.87
## 63 865134 116.87
The below plot seem to show that the more Rating a movie get, the more revenue it generate. We also see that Drama and Biography seem to be more represented compared to the other GENRE
ggplot(Final_Dataset, aes(Rating,as.numeric(Revenue), color=Genre)) + geom_point(na.rm = TRUE)
Final_Dataset1[which(as.numeric(Final_Dataset1$Revenue)>400),] %>% head()
## Year Award Winner Name Film Genre Director Runtime
## 19 2009 Director 0 James Cameron Avatar Action James Cameron 162
## Rating Votes Revenue
## 19 7.8 935408 760.51
As we can see, most of the movies produced have runtime between 110 min and 140. It seems like our plot is clustered in 3 sections:
ggplot(Final_Dataset,aes(Runtime,as.numeric(Revenue), color=Genre)) + geom_point(na.rm=TRUE)
#Now, let’s see How many movies have been nominated for the Best Director category?\ We have 36 movies that were nominated as the best awarded movie.
Final_Dataset2<-Final_Dataset1[(Final_Dataset1$Award=="Director"),]
head(Final_Dataset2,6)
## Year Award Winner Name
## 4 2006 Director 0 Alejandro González Iñárritu
## 5 2006 Director 1 Martin Scorsese
## 9 2007 Director 0 Jason Reitman
## 10 2007 Director 1 Joel Coen and Ethan Coen
## 11 2007 Director 0 Paul Thomas Anderson
## 14 2008 Director 0 David Fincher
## Film Genre Director
## 4 Babel Drama Alejandro González Iñárritu
## 5 The Departed Crime Martin Scorsese
## 9 Juno Comedy Jason Reitman
## 10 No Country for Old Men Crime Ethan Coen
## 11 There Will Be Blood Drama Paul Thomas Anderson
## 14 The Curious Case of Benjamin Button Drama David Fincher
## Runtime Rating Votes Revenue
## 4 143 7.5 253417 34.30
## 5 151 8.5 937414 132.37
## 9 96 7.5 432461 143.49
## 10 122 8.1 660286 74.27
## 11 158 8.1 400682 40.22
## 14 166 7.8 485075 127.49
As you can see in this graph, only 25% of our nominated movies won the “BEST DIRECTOR”, while the 75% did not.
ggplot(Final_Dataset2, aes(Winner,fill=Winner), stat="count") + geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = ..count.., y= ..prop..), stat= "count", vjust = -.5)+
labs(y = "Percent", fill="Winner")
Final_Dataset2 %>% summarize(Average_rating= mean(Rating)) %>% glimpse()
## Observations: 1
## Variables: 1
## $ Average_rating <dbl> 7.822222
As we can see we have 9 movies in the “Best Director” nominees that were nominated in other oscar categories
#Here we are creating a dataset for only the movies that were nominated
Nominees<-Final_Dataset2$Film
# Now we will see which movies is repeated IN the both "Nominees" and "Final_Dataset" and stored it in "both"
both<- Final_Dataset1[which(Nominees %in% Final_Dataset1$Film),]
# Now we will be able to see how many movies from Best Director are also nominated in other oscar categories
Multiple_Nomination<-both[which(duplicated(both$Film)),]
Multiple_Nomination %>% select(Film, Genre,Director)
## Film Genre Director
## 9 Juno Comedy Jason Reitman
## 11 There Will Be Blood Drama Paul Thomas Anderson
## 14 The Curious Case of Benjamin Button Drama David Fincher
## 20 The Hurt Locker Drama Kathryn Bigelow
## 22 Up in the Air Drama Jason Reitman
## 30 Black Swan Drama Darren Aronofsky
## 32 The King's Speech Biography Tom Hooper
## 33 The Social Network Biography David Fincher
## 34 True Grit Adventure Ethan Coen