library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(tidyr)
library(stringr)
library(stringdist)
library(mlr)
## Loading required package: ParamHelpers
library(ggplot2)
library(ggthemes)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
##
## theme_map
## The following object is masked from 'package:ggplot2':
##
## ggsave
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
Each initial dataset underwent pre-processing procedures before merging. Using the summarizeColumns function we were able to constantly evaluate the progress of each procedure after every refinement. The first dataset, oscars
, did not have many issues and minimal editing was conducted. The second dataset, movies
, contained many anomalies which were later addressed after merging. We used partial string matching to merge both datasets. This worked quite efficiently although there was one minor mishap due to overfitting. Among the remaining variables, there were a few variables which were converted to different data types due to appropriateness of the type of data. NA values were not present in the merged dataset, however NA values were created to replace values storing 0 for budget
and revenue
. These were renamed as these values indicated an absence of data rather than zero dollars. We concluded the tidying process, by removing several redundant variables. A final summary of the dataset presented some new potential anomalies with unusual min values, these were addressed when handling outliers. Histograms and boxplots were created for all numeric variables to further understand the data, remove inconsistencies and emphasis potential outliers. We discovered a few observations with errors for budget
, these were removed. However, overall the data did not appear to have many unusual outliers. For every numerical variable, winners and nominees of Best Picture were segregrated to investigate patterns and correlations. A new variable was created, profit
, which would deem to be a good indicator of a film’s commercial success. However films with large profits did not necessarily ensure a Best Picture award, with most being just nominees. We explored correlations between numerical variables by creating scatterplox matrices, here we found the largest correlation could be found between a film’s budget and revenue. This correlation among others increased when subsetting the films that won Best Picture.
This report will focus on Academy Award winning and nominated films. We explore two datasets, the first containing metadata of over 45,000 films listed on TMDB (www.themoviedb.org). This dataset can be found at https://www.kaggle.com/rounakbanik/the-movies-dataset and was collected using the TMDB API. The dataset consists of movies released on or before July 2017.
The variables for this dataset include
adult
: boolean determining if film is an adult filmbelongs_to_collection
: information regarding sequelsbudget
: amount of money spent to produce the filmgenres
: genre classificationhomepage
: direct url link to film’s homepageid
: id number assigned by TMDBimdb_id
: id number assigned by IMDBoriginal_language
original_title
overview
: plot summary of filmpopularity
: a score given between 1 and 10 of the current popularity of the filmposter_path
: direct url link to film’s poster imageproduction_companies
production_countries
release_date
: given as Y:M:Drevenue
: total revenue of film after releaseruntime
: duration of film given in minutesspoken_languages
status
: status of film (Released, Rumoured, Post Production, Planned, In Production, Cancelled)tagline
title
video
: boolean determining if the trailer of the film is on TMDBvote_average
: average user vote of filmvote_count
: count of user votesWe also explore the Academy Awards Database containing the official record of past Academy Award winners and nominees. This dataset contains records from 1927 to 2015 (9964 observations). This dataset can be found at https://www.kaggle.com/theacademy/academy-awards.
The variables for this dataset include
Year
- the year of the ceremonyCeremony
- ordinal number of ceremonyAward
- type of award (Best Picture, Best Actor, Best Actress, etc.)Winner
- “1” indicates film won this awardName
- who or what is receiving award (name of film, name of actor, etc.)Film
- title of filmWe read both datasets into R using read.csv function and name the dataset containing the TMDB database as movies
and name the dataset containing the Academy Awards database as oscars
. We observe the oscars
dataset first.
setwd("/Users/Justin/Desktop/DP - ASS3")
movies <- read.csv("movies_metadata.csv")
oscars <- read.csv("database.csv")
head(oscars)
## Year Ceremony Award Winner Name Film
## 1 1927 1 Actor NA Richard Barthelmess The Noose
## 2 1927 1 Actor 1 Emil Jannings The Last Command
## 3 1927 1 Actress NA Louise Dresser A Ship Comes In
## 4 1927 1 Actress 1 Janet Gaynor 7th Heaven
## 5 1927 1 Actress NA Gloria Swanson Sadie Thompson
## 6 1927 1 Art Direction NA Rochus Gliese Sunrise
dim(oscars)
## [1] 9964 6
With summarizeColumns
, we notice the following anomalies:
Ceremony
may provide redundant information. The Year
column represents similar information, however with more specificity.Winner
contains 7643 NA values. We assume these NA values represent a nominee who did not win. This may be better represented as a boolean or factor variable with renamed values.As many films can be nominated for various awards and many actors/actresses may be nominated for several films over many years, we expect to have a varying number of levels among all factor variables.
summarizeColumns(oscars)
## name type na mean disp median mad min max nlevs
## 1 Year factor 0 NA 0.9813328 NA NA 3 186 95
## 2 Ceremony integer 0 45.4718 24.8413920 44 32.6172 1 88 0
## 3 Award factor 0 NA 0.9569450 NA NA 1 429 114
## 4 Winner integer 7643 1.0000 0.0000000 1 0.0000 1 1 0
## 5 Name factor 0 NA 0.9980931 NA NA 1 19 5754
## 6 Film factor 0 NA 0.9665797 NA NA 1 333 6361
oscars
As we are interested in specific films that have been nominated for Best Picture, we begin by subsetting the oscars
dataset. Before 1962, the title for the award category, “Best Picture” had changed many times and initially was named “Outstanding Picture”. In total there are 5 different titles for the same award, we ensure each name change has been considered when subsetting the oscars
dataset. We rename this dataset, best_picture
. Now that we have filtered our original dataset, the Award
column becomes redundant as all the rows essentially hold the same value for the same award. This also makes the Name
and Film
columns share the same information as Name
tells us what or who is receiving the award and Film
tells us the associated film. We are only dealing with film awards, therefore we only need one of these columns to represent the name of the film. We decide to remove Ceremony
, Award
and Film
by using the select function.
The new dimensions of the best_picture
are 528 observations and 3 columns. With summarizeColumns
, we notice that the number of levels of Name
is 524 whereas the number of total observations is 528. We investigate by creating a table of number of occurences of each value in Name
. Essentially this should determine the possibility of any duplicate values. We subset the newly created table to display only values with more than one occurence. The table returns 4 films, all with frequencies of 2 which would explain the inequality. This may be due to film titles with identical names, we create an additional column, name_year
which will store the concatenation of the name of the film and the year. Using the summarizeColumns function, we notice that the number of levels of this new column is equal to the number of total observations, this ensures that every film included is distinct. We conclude tidying the best_picture
dataset by addressing the NA values in Winner
. We change the NA values to 0 and convert the variable type to factor. Values are renamed for better readability.
best_picture <- oscars[ which(oscars$Award =='Outstanding Picture' | oscars$Award =='Outstanding Production' | oscars$Award =='Outstanding Motion Picture' | oscars$Award =='Best Motion Picture' | oscars$Award =='Best Picture'), ]
best_picture <- subset(best_picture, select= c("Year", "Winner", "Name"))
dim(best_picture)
## [1] 528 3
summarizeColumns(best_picture)
## name type na mean disp median mad min max nlevs
## 1 Year factor 0 NA 0.9772727 NA NA 0 12 89
## 2 Winner integer 439 1 0.0000000 1 0 1 1 0
## 3 Name factor 0 NA 0.9962121 NA NA 0 2 524
n_occur <- data.frame(table(best_picture$Name))
subset(n_occur, Freq > 1)
## Var1 Freq
## 1329 Cleopatra 2
## 2163 Heaven Can Wait 2
## 3249 Mutiny on the Bounty 2
## 3805 Romeo and Juliet 2
best_picture <- transform(best_picture,name_year=paste0(best_picture$Name,best_picture$Year))
summarizeColumns(best_picture)
## name type na mean disp median mad min max nlevs
## 1 Year factor 0 NA 0.9772727 NA NA 0 12 89
## 2 Winner integer 439 1 0.0000000 1 0 1 1 0
## 3 Name factor 0 NA 0.9962121 NA NA 0 2 524
## 4 name_year factor 0 NA 0.9981061 NA NA 1 1 528
best_picture$Winner[is.na(best_picture$Winner)] <- "0"
best_picture$Winner <- factor(best_picture$Winner, labels=c("Nominated","Won"))
head(best_picture)
## Year Winner Name name_year
## 20 1928 Nominated The Racket The Racket1928
## 21 1927 Nominated 7th Heaven 7th Heaven1927
## 22 1927 Won Wings Wings1927
## 63 1929 Nominated Alibi Alibi1929
## 64 1928 Nominated In Old Arizona In Old Arizona1928
## 65 1929 Won The Broadway Melody The Broadway Melody1929
Using the head function, we see that there are several variables, many of which storing irrelevant information. We decide to select columns of interest.
With summarizeColumns
, we notice the following anomalies:
id
column. This may be due to duplicate values. We expect duplicate values in original_title
as there is a good chance that many films share the same name.adult
appears to only consist of binary values. There may be more suitable format types to represent these values.budget
should be a continuous variable, there may be a format type more suitable to represent these values.revenue
, runtime
and vote_average
contain NA values.head(movies)
## adult
## 1 False
## 2 False
## 3 False
## 4 False
## 5 False
## 6 False
## belongs_to_collection
## 1 {'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}
## 2
## 3 {'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}
## 4
## 5 {'id': 96871, 'name': 'Father of the Bride Collection', 'poster_path': '/nts4iOmNnq7GNicycMJ9pSAn204.jpg', 'backdrop_path': '/7qwE57OVZmMJChBpLEbJEmzUydk.jpg'}
## 6
## budget
## 1 30000000
## 2 65000000
## 3 0
## 4 16000000
## 5 0
## 6 60000000
## genres
## 1 [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
## 2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
## 3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
## 4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
## 5 [{'id': 35, 'name': 'Comedy'}]
## 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
## homepage id imdb_id original_language
## 1 http://toystory.disney.com/toy-story 862 tt0114709 en
## 2 8844 tt0113497 en
## 3 15602 tt0113228 en
## 4 31357 tt0114885 en
## 5 11862 tt0113041 en
## 6 949 tt0113277 en
## original_title
## 1 Toy Story
## 2 Jumanji
## 3 Grumpier Old Men
## 4 Waiting to Exhale
## 5 Father of the Bride Part II
## 6 Heat
## overview
## 1 Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
## 2 When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
## 3 A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.
## 4 Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive "good man" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.
## 5 Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnant ... and that George's wife, Nina, is expecting too. He was planning on selling their home, but that's a plan that -- like George -- will have to change with the arrival of both a grandchild and a kid of his own.
## 6 Obsessive master thief, Neil McCauley leads a top-notch crew on various insane heists throughout Los Angeles while a mentally unstable detective, Vincent Hanna pursues him without rest. Each man recognizes and respects the ability and the dedication of the other even though they are aware their cat-and-mouse game may end in violence.
## popularity poster_path
## 1 21.946943 /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
## 2 17.015539 /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
## 3 11.7129 /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
## 4 3.859495 /16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
## 5 8.387519 /e64sOI48hQXyru7naBFyssKFxVd.jpg
## 6 17.924927 /zMyfPUelumio3tiDKPffaUpsQTD.jpg
## production_companies
## 1 [{'name': 'Pixar Animation Studios', 'id': 3}]
## 2 [{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
## 3 [{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]
## 4 [{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]
## 5 [{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]
## 6 [{'name': 'Regency Enterprises', 'id': 508}, {'name': 'Forward Pass', 'id': 675}, {'name': 'Warner Bros.', 'id': 6194}]
## production_countries release_date
## 1 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-10-30
## 2 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-12-15
## 3 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-12-22
## 4 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-12-22
## 5 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-02-10
## 6 [{'iso_3166_1': 'US', 'name': 'United States of America'}] 1995-12-15
## revenue runtime
## 1 373554033 81
## 2 262797249 104
## 3 0 101
## 4 81452156 127
## 5 76578911 106
## 6 187436818 170
## spoken_languages
## 1 [{'iso_639_1': 'en', 'name': 'English'}]
## 2 [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
## 3 [{'iso_639_1': 'en', 'name': 'English'}]
## 4 [{'iso_639_1': 'en', 'name': 'English'}]
## 5 [{'iso_639_1': 'en', 'name': 'English'}]
## 6 [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'es', 'name': 'Español'}]
## status
## 1 Released
## 2 Released
## 3 Released
## 4 Released
## 5 Released
## 6 Released
## tagline
## 1
## 2 Roll the dice and unleash the excitement!
## 3 Still Yelling. Still Fighting. Still Ready for Love.
## 4 Friends are the people who let you be yourself... and never let you forget it.
## 5 Just When His World Is Back To Normal... He's In For The Surprise Of His Life!
## 6 A Los Angeles Crime Saga
## title video vote_average vote_count
## 1 Toy Story False 7.7 5415
## 2 Jumanji False 6.9 2413
## 3 Grumpier Old Men False 6.5 92
## 4 Waiting to Exhale False 6.1 34
## 5 Father of the Bride Part II False 5.7 173
## 6 Heat False 7.7 1886
dim(movies)
## [1] 45466 24
movies <- subset(movies, select= c("original_title","adult", "budget", "id", "release_date", "revenue", "runtime", "vote_average"))
summarizeColumns(movies)
## name type na mean disp median mad min
## 1 original_title factor 0 NA 9.998240e-01 NA NA 1
## 2 adult factor 0 NA 2.639335e-04 NA NA 1
## 3 budget factor 0 NA 1.955967e-01 NA NA 1
## 4 id factor 0 NA 9.999340e-01 NA NA 1
## 5 release_date factor 0 NA 9.970088e-01 NA NA 1
## 6 revenue numeric 6 1.120935e+07 6.433225e+07 0 0.00000 0
## 7 runtime numeric 263 9.412820e+01 3.840781e+01 95 16.30860 0
## 8 vote_average numeric 6 5.618207e+00 1.924216e+00 6 1.33434 0
## max nlevs
## 1 8 43373
## 2 45454 5
## 3 36573 1226
## 4 3 45436
## 5 136 17337
## 6 2787965087 0
## 7 1256 0
## 8 10 0
movies
We begin tidying the movies
dataset by applying the distinct function, this will ensure that duplicate values with the same id are removed. The best_picture
dataset contains films strictly between 1927 and 2015, therefore to minimize the size of movies
we filter out films outside that timeframe. The release_date
provides us with the corresponding day, month and year of release, we extract the year for filtering purposes. We also create a new variable, name_year
that stores the concatenation of the name of the film and the year. Now we can use this variable as a key for merging and will ensure that films are matched by film title and year.
Further tidying will be applied after merging. This will address the previously listed anomalies.
movies <- movies %>% distinct(id, .keep_all = TRUE)
movies$release_date <- as.numeric(substring(movies$release_date,1,4))
movies <- subset(movies, movies$release_date >= 1927 & movies$release_date <= 2015)
movies <- transform(movies,name_year=paste0(movies$original_title,movies$release_date))
head(movies)
## original_title adult budget id release_date revenue
## 1 Toy Story False 30000000 862 1995 373554033
## 2 Jumanji False 65000000 8844 1995 262797249
## 3 Grumpier Old Men False 0 15602 1995 0
## 4 Waiting to Exhale False 16000000 31357 1995 81452156
## 5 Father of the Bride Part II False 0 11862 1995 76578911
## 6 Heat False 60000000 949 1995 187436818
## runtime vote_average name_year
## 1 81 7.7 Toy Story1995
## 2 104 6.9 Jumanji1995
## 3 101 6.5 Grumpier Old Men1995
## 4 127 6.1 Waiting to Exhale1995
## 5 106 5.7 Father of the Bride Part II1995
## 6 170 7.7 Heat1995
We use the amatch function for merging. The amatch function is specifically used for partial string matching. This will ensure that delimiter issues or misspellings do not affect the outcome. As mentioned earlier, we use the newly created variable, name_year
as the key. We create a function to search through the movies
dataset for values that have partial string matches with the values in best_picture
. For every match the function finds, the rbind function is used to add joining variables to a new dataset, merge
.
The dimensions of the merge
are 489 observations and 13 columns. We see a decrease of values as compared to the best_picture
dataset which had 528 observations. The missing observations may not be in the movies
dataset.
am<-amatch(movies$name_year,best_picture$name_year,maxDist = 1)
merge<-data.frame()
for (i in 1:dim(movies)[1])
{
if(!is.na(best_picture[am[i],]))
{
merge <-rbind(merge,data.frame(movies[i,],best_picture[am[i],]))
}
}
dim(merge)
## [1] 489 13
With summarizeColumns
, we notice that the number of levels of name_year.1
is 487 whereas the number of total observations is name_year
is 489. This is unusual as these two variables who have equal number of levels. We check the values in name_year.1
and find a film named Room which occurs three times in merge
. When we observed films with more than one occurence within best_picture
, the film, Room was not present. We investigate values with the same film name. When we subset the merge
dataset, we see that the values for name_year
are all different. They incude Room2015, Room2005 and Zoom2015. It appears that the amatch function may have overfitted. To ensure that these films should be included in merge
we check if they exist in best_picture
. We see that only Room2015 exists, therefore we remove the misspecified values. We rename the updated the dataset oscar_metadata
.
summarizeColumns(merge)
## name type na mean disp median
## 1 original_title factor 0 NA 9.959100e-01 NA
## 2 adult factor 0 NA 0.000000e+00 NA
## 3 budget factor 0 NA 6.441718e-01 NA
## 4 id factor 0 NA 9.979550e-01 NA
## 5 release_date numeric 0 1.970168e+03 2.765924e+01 1969.0
## 6 revenue numeric 0 9.480191e+07 2.145595e+08 19255967.0
## 7 runtime numeric 0 1.242924e+02 2.799031e+01 120.0
## 8 vote_average numeric 0 6.900204e+00 9.570810e-01 7.1
## 9 name_year factor 0 NA 9.979550e-01 NA
## 10 Year factor 0 NA 9.775051e-01 NA
## 11 Winner factor 0 NA 1.595092e-01 NA
## 12 Name factor 0 NA 9.938650e-01 NA
## 13 name_year.1 factor 0 NA 9.938650e-01 NA
## mad min max nlevs
## 1 NA 0.0 2.0 484
## 2 NA 0.0 489.0 1
## 3 NA 0.0 174.0 147
## 4 NA 0.0 1.0 489
## 5 4.00302e+01 1927.0 2015.0 0
## 6 2.85489e+07 0.0 2787965087.0 0
## 7 2.37216e+01 0.0 248.0 0
## 8 7.41300e-01 2.5 8.5 0
## 9 NA 0.0 1.0 489
## 10 NA 0.0 11.0 89
## 11 NA 78.0 411.0 2
## 12 NA 0.0 3.0 483
## 13 NA 0.0 3.0 487
n_occur2 <- data.frame(table(merge$name_year.1))
subset(n_occur2, Freq > 1)
## Var1 Freq
## 314 Room2015 3
subset(merge, merge$name_year.1 == "Room2015")
## original_title adult budget id release_date revenue runtime
## 32124 Room False 6000000 264644 2015 35401758 117
## 36661 Room False 0 69917 2005 0 0
## 37876 Zoom False 0 351065 2015 0 96
## vote_average name_year Year Winner Name name_year.1
## 32124 8.1 Room2015 2015 Nominated Room Room2015
## 36661 2.5 Room2005 2015 Nominated Room Room2015
## 37876 5.7 Zoom2015 2015 Nominated Room Room2015
subset(best_picture, best_picture$Name == "Room" | best_picture$Name == "Zoom")
## Year Winner Name name_year
## 9920 2015 Nominated Room Room2015
oscar_metadata <-merge [!(merge$id == "69917" | merge$id == "351065"), ]
We review oscar_metadata
using summarizeColumns. We see that name_year.1
has equal number of levels as name_year
. Now that we have ensured that our dataset contains only distinct observations, we remove the following columns - name_year
and name_year.1
. release_year
and Year
have the same information, therefore we remove Year
and original_title
and Name
have the same information, therefore we remove Name
. We also remove Adult
as it only has one level, indicating constant values. We see that revenue
has a min value of 0 which is unusual and may indicate an absence of data rather than zero dollars. We decide to convert all 0 values in revenue
to NA. We do the same for budget
and convert budget
to a numeric variable as it is continuous. release_date
represents a categorical feature therefore we convert it to a factor variable. The head and summarizeColumns functions are used to view the cleaner update. The only anomaly that remains is that budget
and revenue
have extremely low min values. These may be data entry errors and will be addressed when outliers are handled.
summarizeColumns(oscar_metadata)
## name type na mean disp median
## 1 original_title factor 0 NA 9.958932e-01 NA
## 2 adult factor 0 NA 0.000000e+00 NA
## 3 budget factor 0 NA 6.468172e-01 NA
## 4 id factor 0 NA 9.979466e-01 NA
## 5 release_date numeric 0 1.970004e+03 2.759570e+01 1.968e+03
## 6 revenue numeric 0 9.519124e+07 2.149142e+08 2.000e+07
## 7 runtime numeric 0 1.246057e+02 2.744356e+01 1.200e+02
## 8 vote_average numeric 0 6.911704e+00 9.363963e-01 7.100e+00
## 9 name_year factor 0 NA 9.979466e-01 NA
## 10 Year factor 0 NA 9.774127e-01 NA
## 11 Winner factor 0 NA 1.601643e-01 NA
## 12 Name factor 0 NA 9.958932e-01 NA
## 13 name_year.1 factor 0 NA 9.979466e-01 NA
## mad min max nlevs
## 1 NA 0.0 2.0 483
## 2 NA 0.0 487.0 1
## 3 NA 0.0 172.0 147
## 4 NA 0.0 1.0 487
## 5 3.85476e+01 1927.0 2015.0 0
## 6 2.96520e+07 0.0 2787965087.0 0
## 7 2.37216e+01 66.0 248.0 0
## 8 7.41300e-01 2.5 8.5 0
## 9 NA 0.0 1.0 487
## 10 NA 0.0 11.0 89
## 11 NA 78.0 409.0 2
## 12 NA 0.0 2.0 483
## 13 NA 0.0 1.0 487
oscar_metadata <- subset(oscar_metadata, select= -c(adult, name_year, name_year.1, Year, Name))
oscar_metadata$revenue[oscar_metadata$revenue==0] <- NA
oscar_metadata$budget[oscar_metadata$budget==0] <- NA
oscar_metadata$budget <- as.numeric(as.vector(oscar_metadata$budget))
oscar_metadata$release_date <- as.factor(oscar_metadata$release_date)
head(oscar_metadata)
## original_title budget id release_date revenue runtime
## 17 Sense and Sensibility 16500000 4584 1995 135000000 136
## 34 Babe 30000000 9598 1995 254134910 89
## 109 Braveheart 72000000 197 1995 210000000 177
## 110 Taxi Driver 1300000 103 1976 28262574 114
## 148 Apollo 13 52000000 568 1995 355237933 140
## 257 Star Wars 11000000 11 1977 775398007 121
## vote_average Winner
## 17 7.2 Nominated
## 34 6.0 Nominated
## 109 7.7 Won
## 110 8.1 Nominated
## 148 7.3 Nominated
## 257 8.1 Nominated
summarizeColumns(oscar_metadata)
## name type na mean disp median
## 1 original_title factor 0 NA 9.958932e-01 NA
## 2 budget numeric 172 2.363877e+07 3.613357e+07 11000000.0
## 3 id factor 0 NA 9.979466e-01 NA
## 4 release_date factor 0 NA 9.774127e-01 NA
## 5 revenue numeric 181 1.514972e+08 2.550293e+08 68690110.5
## 6 runtime numeric 0 1.246057e+02 2.744356e+01 120.0
## 7 vote_average numeric 0 6.911704e+00 9.363963e-01 7.1
## 8 Winner factor 0 NA 1.601643e-01 NA
## mad min max nlevs
## 1 NA 0.0 2.0 483
## 2 1.363992e+07 4.0 237000000.0 0
## 3 NA 0.0 1.0 487
## 4 NA 2.0 11.0 89
## 5 8.541032e+07 25.0 2787965087.0 0
## 6 2.372160e+01 66.0 248.0 0
## 7 7.413000e-01 2.5 8.5 0
## 8 NA 78.0 409.0 2
We explore the distribution of all numeric variables to better understand the data.
The histogram below displays a slightly left skew, this is expected as most films nominated for Best Picture tend to be praised by the general public. The mean vote_average
value is 6.91. We see that values do not exceed below 2.5 or beyond 8.5. When we look at the observations holding minimum values, we see that the three films listed are quite old and that these films were nominated for Best Picture, but were not winners in their respective years. The lack of success for these films may have been better explained if there was information about budget or revenue. We see that “The Shawshank Redemption” and “The Godfather” have the highest vote_average
scores with values of 8.5. Both films with large revenue
values compared to budget
indicating a success cinematic release. “The Godfather” was awarded Best Picture in 1972, however “The Shawshank Redemption” failed to win despite being well-received among users.
theme_set(theme_solarized())
mean_vote <- oscar_metadata %>%
mutate(Label = paste0("µ: ",prettyNum((mean(vote_average)))))
ggplot(oscar_metadata, aes(x = vote_average, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
labs(title = 'Histogram of Vote Average') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_fill_manual(values = c("#696969")) + guides(fill=FALSE) +
geom_vline(data=oscar_metadata, aes(xintercept=mean(vote_average)),linetype="dashed", size=1.3, colour="red", alpha=0.7) +
ylab("Frequency") + geom_text(data = mean_vote, aes(x = mean(vote_average), y = 40, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
xlab("Vote Average") + xlim(0, 10)
subset(oscar_metadata, oscar_metadata$vote_average == min(oscar_metadata$vote_average, na.rm=TRUE))
## original_title budget id release_date revenue runtime
## 16683 Trader Horn NA 84296 1931 NA 122
## 25714 Flirtation Walk NA 84100 1934 NA 97
## 27047 One Night of Love NA 84097 1934 NA 84
## vote_average Winner
## 16683 2.5 Nominated
## 25714 2.5 Nominated
## 27047 2.5 Nominated
subset(oscar_metadata, oscar_metadata$vote_average == max(oscar_metadata$vote_average, na.rm=TRUE))
## original_title budget id release_date revenue runtime
## 315 The Shawshank Redemption 2.5e+07 278 1994 28341469 142
## 835 The Godfather 6.0e+06 238 1972 245066411 175
## vote_average Winner
## 315 8.5 Nominated
## 835 8.5 Won
When we segregrate the data by winners and nominees, we see that the distribution is spread out more for nominated films. This can be explained by the disproportion of values. As there can only be one winner per year and multiple nominees per year, we expect the distribution to be less variant for winners. The histogram for Nominated
follows the previous histogram with a negative skew. The boxplot for Nominated
displays multiple outliers after 5.0. There is no need to handle these outliers as they sit well within the suitable values for vote_average (between 1 and 10). The distribution for Won
centres around 7.5, which is slightly higher than the mean of the entire distribution (6.91). We see 3 outliers that exceed below ~5.0, despite having such low vote_average
scores, these films managed to win Best Picture
in their respective years. We take a closer look at these outliers and see that these films are from the early 1930’s. We infer that older films tend to have the lower scores for vote_average
. We explore correlations between variables later on. Despite these films being voted quite badly, “The Broadway Melody” and “Cavalcade” both have higher revenue
values than budget
values, indicating that these films were still a success.
theme_set(theme_gray())
p1 <- ggplot(oscar_metadata, aes(x = factor(1), y = vote_average)) +
geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 10)) + theme(axis.title.y=element_blank()) +
facet_grid(. ~ Winner)
p2 <- ggplot(oscar_metadata, aes(x = vote_average)) +
geom_histogram(colour="white",aes(vote_average), fill="dodgerblue", alpha = 1/2,bins = 50) +
scale_x_continuous(limits = c(0, 10)) + ggtitle("Distribution of Vote Average") + facet_grid(. ~ Winner) +
theme(axis.title.x=element_blank())
plot_grid(p2, p1 + coord_flip() +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()),
ncol=1, align="v", rel_heights = c(2,1))
winning_films <- subset(oscar_metadata, oscar_metadata$Winner == "Won")
boxplot.stats(winning_films$vote_average)$out
## [1] 5.0 5.1 4.6
subset_winning_outliers <- subset(winning_films, winning_films$vote_average <= 5.1)
head(subset_winning_outliers[order(subset_winning_outliers$vote_average, decreasing = TRUE),])
## original_title budget id release_date revenue runtime
## 1819 Cimarron NA 42861 1931 NA 123
## 1817 The Broadway Melody 379000 65203 1929 4358000 100
## 1821 Cavalcade 1180280 56164 1933 7630000 110
## vote_average Winner
## 1819 5.1 Won
## 1817 5.0 Won
## 1821 4.6 Won
The histogram of runtime
displays a slightly right skewed normal distribution with a wider spread of values. The mean runtime
value is 125 minutes or 2 hours and 5 minutes. When we explore minimum values, we find that the shortest film in this dataset, “She Done Him Wrong” is 66 minutes long. The film with the longest run time is “Cleopatra” (1963) with a run time of over 3 hours. We explore these extreme values to ensure they are not outliers.
theme_set(theme_solarized())
mean_runtime <- oscar_metadata %>%
mutate(Label = paste0("µ: ",prettyNum((mean(runtime)))))
ggplot(oscar_metadata, aes(x = runtime, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
labs(title = 'Histogram of Run Time') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_fill_manual(values = c("#696969")) + guides(fill=FALSE) +
geom_vline(data=oscar_metadata, aes(xintercept=mean(runtime)),linetype="dashed", size=1.3, colour="red", alpha=0.7) +
ylab("Frequency") + geom_text(data = mean_runtime, aes(x = mean(runtime), y = 25, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
xlab("Run Time (minutes)") + xlim(0, 250)
subset(oscar_metadata, oscar_metadata$runtime == min(oscar_metadata$runtime, na.rm=TRUE))
## original_title budget id release_date revenue runtime
## 8345 She Done Him Wrong 2e+05 43595 1933 2200000 66
## vote_average Winner
## 8345 5.2 Nominated
subset(oscar_metadata, oscar_metadata$runtime == max(oscar_metadata$runtime, na.rm=TRUE))
## original_title budget id release_date revenue runtime
## 4146 Cleopatra 31115000 8095 1963 7.1e+07 248
## vote_average Winner
## 4146 6.7 Nominated
The histogram for Nominated
follows the previous histogram with a positive skew. The boxplot for Nominated
displays multiple outliers above 175 minutes. The distribution for Won
is relatively flat, however these films tend to be longer than nominated films. The boxplot for Won
returns one outlier. As we saw previously, the film with the maximum value of 248 minutes didd not appear to be an error, therefore we do not handle these outliers as they fit within a suitable range for a film’s run time.
theme_set(theme_gray())
p3 <- ggplot(oscar_metadata, aes(x = factor(1), y = runtime)) +
geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 250)) + theme(axis.title.y=element_blank()) +
facet_grid(. ~ Winner)
p4 <- ggplot(oscar_metadata, aes(x = runtime)) +
geom_histogram(colour="white",aes(runtime), fill="dodgerblue", alpha = 1/2,bins = 50) +
scale_x_continuous(limits = c(0, 250)) + ggtitle("Distribution of Runtime") + facet_grid(. ~ Winner) + theme(axis.title.x=element_blank())
plot_grid(p4, p3 + coord_flip() +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), ncol=1, align="v", rel_heights = c(2,1))
We previously converted 0 values of budget
to NA. We remove these observations when we investigate outliers. The boxlot of budget
displays several outliers over a budget of 50 million dollars. As mentioned before we found rather unusual minimum values when reviewing the last summary of the whole dataset. Values as low as $4. We explore this by subsetting the dataset to return only values lower than 100,000. We discover 4 films with budgets of 4, 424, 340 and 762. All listed films are from the early 1930’s and may be storing incorrect budget values. Three of these films do not a value for revenue and the film that does, has a revenue of 25 dollars. We decide to remove these errors from the dataset and review the boxplot after the update. We also divide the budget
values by a million for better readability. There appears to be little to no change to the distribution of budget
values. The boxplot returns 30 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly and return higher revenue
values. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. Interestingly enough, only one of these films had won Best Picture (“Titanic”), suggesting that perhaps having a large budget does not correlate with winning Best Picture.
budget_noNAs <- oscar_metadata[!is.na(oscar_metadata$budget),]
boxplot(budget_noNAs$budget, main="Box Plot of Budget", ylab="Budget",col = "grey")
subset(oscar_metadata, budget < 100000)
## original_title budget id release_date revenue runtime
## 950 A Farewell to Arms 4 22649 1932 25 89
## 4791 Little Women 424 39938 1933 NA 115
## 7766 The Divorcee 340 80708 1930 NA 84
## 14371 One Hundred Men and a Girl 762 61647 1937 NA 84
## vote_average Winner
## 950 6.2 Nominated
## 4791 6.0 Nominated
## 7766 5.6 Nominated
## 14371 4.9 Nominated
oscar_metadata <-oscar_metadata [!(oscar_metadata$id == "22649" | oscar_metadata$id == "39938" |
oscar_metadata$id == "80708" | oscar_metadata$id == "61647"), ]
oscar_metadata$budget <- oscar_metadata$budget/1000000
budget_noNAs <- oscar_metadata[!is.na(oscar_metadata$budget),]
boxplot(budget_noNAs$budget, main="Box Plot of Budget", ylab="Budget",col = "grey")
outlier_budget <- boxplot.stats(budget_noNAs$budget)$out
length(outlier_budget)
## [1] 30
subset_budget_outliers <- subset(oscar_metadata, oscar_metadata$budget > min(outlier_budget))
head(subset_budget_outliers[order(subset_budget_outliers$budget, decreasing = TRUE),],10)
## original_title budget id
## 14547 Avatar 237 19995
## 1639 Titanic 200 597
## 15343 Toy Story 3 200 10193
## 13720 Up 175 14160
## 18085 Hugo 170 44826
## 15475 Inception 160 27205
## 6796 Master and Commander: The Far Side of the World 150 8619
## 13216 The Curious Case of Benjamin Button 150 4922
## 26537 Mad Max: Fury Road 150 76341
## 31845 The Revenant 135 281957
## release_date revenue runtime vote_average Winner
## 14547 2009 2787965087 162 7.2 Nominated
## 1639 1997 1845034188 194 7.5 Won
## 15343 2010 1066969703 103 7.6 Nominated
## 13720 2009 735099082 96 7.8 Nominated
## 18085 2011 185770160 126 7.0 Nominated
## 15475 2010 825532764 148 8.1 Nominated
## 6796 2003 212011111 138 6.9 Nominated
## 13216 2008 333932083 166 7.3 Nominated
## 26537 2015 378858340 120 7.3 Nominated
## 31845 2015 532950503 156 7.3 Nominated
The histogram of budget
displays a positively skewed distribution with values reaching close to 250 million dollars. The mean budget
value is ~24 million dollars.
mean_budget <- budget_noNAs %>%
mutate(Label = paste0("µ: ",prettyNum((mean(budget)))))
ggplot(budget_noNAs, aes(x = budget, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
labs(title = 'Histogram of Budget') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_fill_manual(values = c("#696969")) + guides(fill=FALSE) +
geom_vline(data=budget_noNAs, aes(xintercept=mean(budget)),linetype="dashed", size=1.3, colour="red", alpha=0.7) +
ylab("Frequency") + geom_text(data = mean_budget, aes(x = mean(budget), y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
xlab("Budget (in millions)") + xlim(0, 250)
The histogram for Nominated
follows the previous histogram with a positive skew. The boxplot for Nominated
displays multiple outliers above 75 million dollars. The distribution for Won
follows a similar pattern, however values rarely exceed 100 million dollars. The boxplot for Won
returns several outliers above 50 million dollars.
theme_set(theme_gray())
p5 <- ggplot(budget_noNAs, aes(x = factor(1), y = budget)) +
geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 250)) + theme(axis.title.y=element_blank()) +
facet_grid(. ~ Winner)
p6 <- ggplot(budget_noNAs, aes(x = budget)) +
geom_histogram(colour="white",aes(budget), fill="dodgerblue", alpha = 1/2,bins = 50) +
scale_x_continuous(limits = c(0, 250)) + ggtitle("Distribution of Budget") + facet_grid(. ~ Winner) +
theme(axis.title.x=element_blank())
plot_grid(p6, p5 + coord_flip() +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()),
ncol=1, align="v", rel_heights = c(2,1))
Similar to budget
, we previously converted 0 values of revenue
to NA. We remove these observations when we investigate outliers. As mentioned before we found rather unusual minimum values when reviewing the last summary of the whole dataset. Values as low as $25. We explore this by calling the summarizeColumns function, here we see that the min value for revenue
is 83,305 dollars. This is a suitable revenue, therefore we decide to keep this observation and review the boxplot for revenue
. We also divide the revenue
values by a million for better readability.
The boxlot of revenue
displays several outliers over a revenue of 500 million dollars. Overall, the boxplot returns 24 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. The only films to had won Best Picture out of these maximum outlier values were “Titanic” and “The Lord of the Rings: The Return of the King”.
revenue_noNAs <- oscar_metadata[!is.na(oscar_metadata$revenue),]
summarizeColumns(revenue_noNAs)
## name type na mean disp median
## 1 original_title factor 0 NA 9.934426e-01 NA
## 2 budget numeric 17 2.558674e+01 3.717385e+01 14.0
## 3 id factor 0 NA 9.967213e-01 NA
## 4 release_date factor 0 NA 9.672131e-01 NA
## 5 revenue numeric 0 1.519939e+08 2.553001e+08 68706993.0
## 6 runtime numeric 0 1.314951e+02 2.819739e+01 126.0
## 7 vote_average numeric 0 7.289180e+00 5.688004e-01 7.3
## 8 Winner factor 0 NA 2.229508e-01 NA
## mad min max nlevs
## 1 NA 0.0 2.0 304
## 2 1.630860e+01 0.2 237.0 0
## 3 NA 0.0 1.0 305
## 4 NA 0.0 10.0 82
## 5 8.526425e+07 83305.0 2787965087.0 0
## 6 2.223900e+01 66.0 248.0 0
## 7 4.447800e-01 4.6 8.5 0
## 8 NA 68.0 237.0 2
oscar_metadata$revenue <- oscar_metadata$revenue/1000000
revenue_noNAs <- oscar_metadata[!is.na(oscar_metadata$revenue),]
boxplot(revenue_noNAs$revenue, main="Box Plot of Revenue", ylab="Revenue",col = "grey")
outlier_revenue <- boxplot.stats(revenue_noNAs$revenue)$out
length(outlier_revenue)
## [1] 24
subset_revenue_outliers <- subset(oscar_metadata, oscar_metadata$revenue > min(outlier_revenue))
head(subset_revenue_outliers[order(subset_revenue_outliers$revenue, decreasing = TRUE),],10)
## original_title budget id
## 14547 Avatar 237 19995
## 1639 Titanic 200 597
## 7000 The Lord of the Rings: The Return of the King 94 122
## 15343 Toy Story 3 200 10193
## 5814 The Lord of the Rings: The Two Towers 79 121
## 4863 The Lord of the Rings: The Fellowship of the Ring 93 120
## 15475 Inception 160 27205
## 257 Star Wars 11 11
## 13720 Up 175 14160
## 21581 Gravity 105 49047
## release_date revenue runtime vote_average Winner
## 14547 2009 2787.9651 162 7.2 Nominated
## 1639 1997 1845.0342 194 7.5 Won
## 7000 2003 1118.8890 201 8.1 Won
## 15343 2010 1066.9697 103 7.6 Nominated
## 5814 2002 926.2874 179 8.0 Nominated
## 4863 2001 871.3684 178 8.0 Nominated
## 15475 2010 825.5328 148 8.1 Nominated
## 257 1977 775.3980 121 8.1 Nominated
## 13720 2009 735.0991 96 7.8 Nominated
## 21581 2013 716.3927 91 7.3 Nominated
The histogram of revenue
displays a positively skewed distribution with values reaching close to 2 billion dollars. The mean revenue
value is ~152 million dollars. When comparing this mean to budget
, we see that overall films tend to profit ~128 million dollars.
mean_revenue <- revenue_noNAs %>%
mutate(Label = paste0("µ: ",prettyNum((mean(revenue)))))
ggplot(revenue_noNAs, aes(x = revenue, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
labs(title = 'Histogram of Revenue') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_fill_manual(values = c("#696969")) + guides(fill=FALSE) +
geom_vline(data=revenue_noNAs, aes(xintercept=mean(revenue)),linetype="dashed", size=1.3, colour="red", alpha=0.7) +
ylab("Frequency") + geom_text(data = mean_revenue, aes(x = mean(revenue), y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
xlab("Revenue (in millions)") + xlim(0, 2800)
The histogram for Nominated
follows the previous histogram with a positive skew. The boxplot for Nominated
displays multiple outliers above 400 million dollars. The distribution for Won
follows a similar pattern. The boxplot for Won
returns three outliers above 500 million dollars.
theme_set(theme_gray())
p7 <- ggplot(revenue_noNAs, aes(x = factor(1), y = revenue)) +
geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 2800)) + theme(axis.title.y=element_blank()) +
facet_grid(. ~ Winner)
p8 <- ggplot(revenue_noNAs, aes(x = revenue)) +
geom_histogram(colour="white",aes(revenue), fill="dodgerblue", alpha = 1/2,bins = 50) +
scale_x_continuous(limits = c(0, 2800)) + ggtitle("Distribution of Revenue") + facet_grid(. ~ Winner) +
theme(axis.title.x=element_blank())
plot_grid(p8, p7 + coord_flip() +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()),
ncol=1, align="v", rel_heights = c(2,1))
We create a new variable, named profit
. This is calculated by subtracting the budget
from the revenue
. Before creating profit
, we have to remove all NA values. The summary belows confirms this. We also change the name of the updated dataset to oscar_metadata2
.
The boxlot of profit
displays several outliers over a profit of 500 million dollars. Overall, the boxplot returns 24 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. The only films to had won Best Picture out of these maximum outlier values were “Titanic”, “The Lord of the Rings: The Return of the King” and “Forrest Gump”.
The summarizeColumns function is used to view a summary of the dataset with the new variable. We see that profit
has a negative minimum value, which would indicate films that did not make a profit after cinematic release. We explore these unsuccessful films. These films vary in release_date
and have slightly lower vote_average
values. None of these films won Best Picture.
oscar_metadata2 <- na.omit(oscar_metadata)
summarizeColumns(oscar_metadata2)
## name type na mean disp median mad
## 1 original_title factor 0 NA 0.9930556 NA NA
## 2 budget numeric 0 25.586745 37.1738463 14.00000 16.30860
## 3 id factor 0 NA 0.9965278 NA NA
## 4 release_date factor 0 NA 0.9652778 NA NA
## 5 revenue numeric 0 158.945729 260.9980755 75.99815 92.68516
## 6 runtime numeric 0 132.166667 28.7471253 127.00000 23.72160
## 7 vote_average numeric 0 7.297917 0.5785322 7.40000 0.51891
## 8 Winner factor 0 NA 0.2326389 NA NA
## min max nlevs
## 1 0.00 2.000 287
## 2 0.20 237.000 0
## 3 0.00 1.000 288
## 4 0.00 10.000 81
## 5 0.95 2787.965 0
## 6 66.00 248.000 0
## 7 4.60 8.500 0
## 8 67.00 221.000 2
oscar_metadata2$profit <- oscar_metadata2$revenue - oscar_metadata2$budget
boxplot(oscar_metadata2$profit, main="Box Plot of Profit", ylab="Profit",col = "grey")
outlier_profit <- boxplot.stats(oscar_metadata2$profit)$out
length(outlier_profit)
## [1] 24
subset_profit_outliers <- subset(oscar_metadata2, oscar_metadata2$profit > min(outlier_profit))
head(subset_profit_outliers[order(subset_profit_outliers$profit, decreasing = TRUE),],10)
## original_title budget id
## 14547 Avatar 237 19995
## 1639 Titanic 200 597
## 7000 The Lord of the Rings: The Return of the King 94 122
## 15343 Toy Story 3 200 10193
## 5814 The Lord of the Rings: The Two Towers 79 121
## 4863 The Lord of the Rings: The Fellowship of the Ring 93 120
## 257 Star Wars 11 11
## 15475 Inception 160 27205
## 2647 The Sixth Sense 40 745
## 352 Forrest Gump 55 13
## release_date revenue runtime vote_average Winner profit
## 14547 2009 2787.9651 162 7.2 Nominated 2550.9651
## 1639 1997 1845.0342 194 7.5 Won 1645.0342
## 7000 2003 1118.8890 201 8.1 Won 1024.8890
## 15343 2010 1066.9697 103 7.6 Nominated 866.9697
## 5814 2002 926.2874 179 8.0 Nominated 847.2874
## 4863 2001 871.3684 178 8.0 Nominated 778.3684
## 257 1977 775.3980 121 8.1 Nominated 764.3980
## 15475 2010 825.5328 148 8.1 Nominated 665.5328
## 2647 1999 672.8063 107 7.7 Nominated 632.8063
## 352 1994 677.9454 142 8.2 Won 622.9454
summarizeColumns(oscar_metadata2)
## name type na mean disp median mad
## 1 original_title factor 0 NA 0.9930556 NA NA
## 2 budget numeric 0 25.586745 37.1738463 14.00000 16.30860
## 3 id factor 0 NA 0.9965278 NA NA
## 4 release_date factor 0 NA 0.9652778 NA NA
## 5 revenue numeric 0 158.945729 260.9980755 75.99815 92.68516
## 6 runtime numeric 0 132.166667 28.7471253 127.00000 23.72160
## 7 vote_average numeric 0 7.297917 0.5785322 7.40000 0.51891
## 8 Winner factor 0 NA 0.2326389 NA NA
## 9 profit numeric 0 133.358984 235.2069421 53.54244 70.89049
## min max nlevs
## 1 0.00000 2.000 287
## 2 0.20000 237.000 0
## 3 0.00000 1.000 288
## 4 0.00000 10.000 81
## 5 0.95000 2787.965 0
## 6 66.00000 248.000 0
## 7 4.60000 8.500 0
## 8 67.00000 221.000 2
## 9 -29.71009 2550.965 0
losers <- subset(oscar_metadata2, oscar_metadata2$profit < 0)
losers[order(losers$profit),]
## original_title budget id release_date revenue runtime
## 2890 The Insider 90.000 9008 1999 60.28991 157
## 9540 The Aviator 116.000 2567 2004 102.00000 170
## 2026 Doctor Dolittle 18.000 16081 1967 9.00000 152
## 2630 The Mission 24.500 11416 1986 17.21802 126
## 1188 The Right Stuff 27.000 9549 1983 21.50000 193
## 8556 Mutiny on the Bounty 19.000 11085 1962 13.68000 178
## 3917 The Alamo 12.000 11209 1960 7.90000 167
## 9973 Wilson 2.995 84084 1944 2.00000 154
## vote_average Winner profit
## 2890 7.3 Nominated -29.710088
## 9540 7.0 Nominated -14.000000
## 2026 6.0 Nominated -9.000000
## 2630 7.1 Nominated -7.281977
## 1188 7.4 Nominated -5.500000
## 8556 6.5 Nominated -5.320000
## 3917 6.5 Nominated -4.100000
## 9973 5.5 Nominated -0.995000
The histogram of profit
displays a positively skewed distribution with values reaching over 2.5 billion dollars. The mean profit
value is ~133 million dollars.
mean_profit <- oscar_metadata2 %>%
mutate(Label = paste0("µ: ",prettyNum((mean(profit)))))
ggplot(oscar_metadata2, aes(x = profit, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
labs(title = 'Histogram of Profit') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) +
scale_fill_manual(values = c("#696969")) + guides(fill=FALSE) +
geom_vline(data=oscar_metadata2, aes(xintercept=mean(profit)),linetype="dashed", size=1.3, colour="red", alpha=0.7) +
ylab("Frequency") + geom_text(data = mean_profit, aes(x = mean(profit), y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
xlab("Profit (in millions)") + xlim(-30, 2600)
## Warning: Removed 1 rows containing missing values (geom_bar).
The histogram for Revenue
follows the previous histogram with a positive skew. The boxplot for Nominated
displays multiple outliers above 400 million dollars. The distribution for Won
follows a similar pattern. The boxplot for Won
returns three outliers above 500 million dollars.
theme_set(theme_gray())
p9 <- ggplot(oscar_metadata2, aes(x = factor(1), y = profit)) +
geom_boxplot(width = .50) + scale_y_continuous(limits = c(-30, 2600)) + theme(axis.title.y=element_blank()) +
facet_grid(. ~ Winner)
p10 <- ggplot(oscar_metadata2, aes(x = profit)) +
geom_histogram(colour="white",aes(profit), fill="dodgerblue", alpha = 1/2,bins = 50) +
scale_x_continuous(limits = c(-30, 2600)) + ggtitle("Distribution of Profit") + facet_grid(. ~ Winner) +
theme(axis.title.x=element_blank())
plot_grid(p10, p9 + coord_flip() +
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()),
ncol=1, align="v", rel_heights = c(2,1))
## Warning: Removed 2 rows containing missing values (geom_bar).
We explore correlations between corresponding numeric variables. We convert release_date
to numeric for the purpose of this scatterplot matrix. We can use this visualisation to quickly scan for the strongest relationship between variables. The strongest linear correlation is between budget
and revenue
with a value of 0.731. This indicates films that spend more money to create the film, tend to gain more revenue. This also indicates that films that spent less can expect a lower revenue.
We expect to see correlations for profit
with budget
and revenue
, as profit
is dependent on these variables. There is slight correlation between budget
and release_date
with a value of 0.49. We infer that older films did not have the same budget as newer films.
oscar_metadata2$release_date <- as.numeric(as.vector(oscar_metadata2$release_date))
ggpairs(oscar_metadata2, columns = c(2,4:7,9),axisLabels = "internal")
We segregrate the data by winners and nominees and create new subsets renamed respectively. We find that all correlations are higher when we focus on solely the award-winning films. Correlations with vote_average
appear to have changed significantly as opposed to nominated films. We also see slightly increased correlations with runtime
.
The strongest linear correlation for both plots is between budget
and revenue
. In comparison to the entire oscar_metadata2
dataset, the correlation within winners
increases by 13% and the correlation within nominees
decreases by 2%.
nominees <- subset(oscar_metadata2, oscar_metadata2$Winner == "Nominated")
winners <- subset(oscar_metadata2, oscar_metadata2$Winner == "Won")
ggpairs(nominees, columns = c(2,4:7,9),axisLabels = "internal", title = "Nominees")
ggpairs(winners, columns = c(2,4:7,9),axisLabels = "internal", title = "Winners")