Required packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(tidyr)
library(stringr)
library(stringdist)
library(mlr)
## Loading required package: ParamHelpers
library(ggplot2)
library(ggthemes)
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
## 
##     theme_map
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Executive Summary

Each initial dataset underwent pre-processing procedures before merging. Using the summarizeColumns function we were able to constantly evaluate the progress of each procedure after every refinement. The first dataset, oscars, did not have many issues and minimal editing was conducted. The second dataset, movies, contained many anomalies which were later addressed after merging. We used partial string matching to merge both datasets. This worked quite efficiently although there was one minor mishap due to overfitting. Among the remaining variables, there were a few variables which were converted to different data types due to appropriateness of the type of data. NA values were not present in the merged dataset, however NA values were created to replace values storing 0 for budget and revenue. These were renamed as these values indicated an absence of data rather than zero dollars. We concluded the tidying process, by removing several redundant variables. A final summary of the dataset presented some new potential anomalies with unusual min values, these were addressed when handling outliers. Histograms and boxplots were created for all numeric variables to further understand the data, remove inconsistencies and emphasis potential outliers. We discovered a few observations with errors for budget, these were removed. However, overall the data did not appear to have many unusual outliers. For every numerical variable, winners and nominees of Best Picture were segregrated to investigate patterns and correlations. A new variable was created, profit, which would deem to be a good indicator of a film’s commercial success. However films with large profits did not necessarily ensure a Best Picture award, with most being just nominees. We explored correlations between numerical variables by creating scatterplox matrices, here we found the largest correlation could be found between a film’s budget and revenue. This correlation among others increased when subsetting the films that won Best Picture.

Data

This report will focus on Academy Award winning and nominated films. We explore two datasets, the first containing metadata of over 45,000 films listed on TMDB (www.themoviedb.org). This dataset can be found at https://www.kaggle.com/rounakbanik/the-movies-dataset and was collected using the TMDB API. The dataset consists of movies released on or before July 2017.

The variables for this dataset include

We also explore the Academy Awards Database containing the official record of past Academy Award winners and nominees. This dataset contains records from 1927 to 2015 (9964 observations). This dataset can be found at https://www.kaggle.com/theacademy/academy-awards.

The variables for this dataset include

We read both datasets into R using read.csv function and name the dataset containing the TMDB database as movies and name the dataset containing the Academy Awards database as oscars. We observe the oscars dataset first.

setwd("/Users/Justin/Desktop/DP - ASS3")
movies <- read.csv("movies_metadata.csv")
oscars <- read.csv("database.csv")

head(oscars)
##   Year Ceremony         Award Winner                Name             Film
## 1 1927        1         Actor     NA Richard Barthelmess        The Noose
## 2 1927        1         Actor      1       Emil Jannings The Last Command
## 3 1927        1       Actress     NA      Louise Dresser  A Ship Comes In
## 4 1927        1       Actress      1        Janet Gaynor       7th Heaven
## 5 1927        1       Actress     NA      Gloria Swanson   Sadie Thompson
## 6 1927        1 Art Direction     NA       Rochus Gliese          Sunrise
dim(oscars)
## [1] 9964    6

Data Preprocessing

Academy Awards Database

With summarizeColumns, we notice the following anomalies:

  • The Ceremony may provide redundant information. The Year column represents similar information, however with more specificity.
  • The Winner contains 7643 NA values. We assume these NA values represent a nominee who did not win. This may be better represented as a boolean or factor variable with renamed values.

As many films can be nominated for various awards and many actors/actresses may be nominated for several films over many years, we expect to have a varying number of levels among all factor variables.

summarizeColumns(oscars)
##       name    type   na    mean       disp median     mad min max nlevs
## 1     Year  factor    0      NA  0.9813328     NA      NA   3 186    95
## 2 Ceremony integer    0 45.4718 24.8413920     44 32.6172   1  88     0
## 3    Award  factor    0      NA  0.9569450     NA      NA   1 429   114
## 4   Winner integer 7643  1.0000  0.0000000      1  0.0000   1   1     0
## 5     Name  factor    0      NA  0.9980931     NA      NA   1  19  5754
## 6     Film  factor    0      NA  0.9665797     NA      NA   1 333  6361

Tidying oscars

As we are interested in specific films that have been nominated for Best Picture, we begin by subsetting the oscars dataset. Before 1962, the title for the award category, “Best Picture” had changed many times and initially was named “Outstanding Picture”. In total there are 5 different titles for the same award, we ensure each name change has been considered when subsetting the oscars dataset. We rename this dataset, best_picture. Now that we have filtered our original dataset, the Award column becomes redundant as all the rows essentially hold the same value for the same award. This also makes the Name and Film columns share the same information as Name tells us what or who is receiving the award and Film tells us the associated film. We are only dealing with film awards, therefore we only need one of these columns to represent the name of the film. We decide to remove Ceremony, Award and Film by using the select function.

The new dimensions of the best_picture are 528 observations and 3 columns. With summarizeColumns, we notice that the number of levels of Name is 524 whereas the number of total observations is 528. We investigate by creating a table of number of occurences of each value in Name. Essentially this should determine the possibility of any duplicate values. We subset the newly created table to display only values with more than one occurence. The table returns 4 films, all with frequencies of 2 which would explain the inequality. This may be due to film titles with identical names, we create an additional column, name_year which will store the concatenation of the name of the film and the year. Using the summarizeColumns function, we notice that the number of levels of this new column is equal to the number of total observations, this ensures that every film included is distinct. We conclude tidying the best_picture dataset by addressing the NA values in Winner. We change the NA values to 0 and convert the variable type to factor. Values are renamed for better readability.

best_picture <- oscars[ which(oscars$Award =='Outstanding Picture' | oscars$Award =='Outstanding Production' | oscars$Award =='Outstanding Motion Picture' | oscars$Award =='Best Motion Picture' | oscars$Award =='Best Picture'), ]

best_picture <- subset(best_picture, select= c("Year", "Winner", "Name"))

dim(best_picture)
## [1] 528   3
summarizeColumns(best_picture)
##     name    type  na mean      disp median mad min max nlevs
## 1   Year  factor   0   NA 0.9772727     NA  NA   0  12    89
## 2 Winner integer 439    1 0.0000000      1   0   1   1     0
## 3   Name  factor   0   NA 0.9962121     NA  NA   0   2   524
n_occur <- data.frame(table(best_picture$Name))
subset(n_occur, Freq > 1)
##                       Var1 Freq
## 1329            Cleopatra     2
## 2163      Heaven Can Wait     2
## 3249 Mutiny on the Bounty     2
## 3805     Romeo and Juliet     2
best_picture <- transform(best_picture,name_year=paste0(best_picture$Name,best_picture$Year))

summarizeColumns(best_picture)
##        name    type  na mean      disp median mad min max nlevs
## 1      Year  factor   0   NA 0.9772727     NA  NA   0  12    89
## 2    Winner integer 439    1 0.0000000      1   0   1   1     0
## 3      Name  factor   0   NA 0.9962121     NA  NA   0   2   524
## 4 name_year  factor   0   NA 0.9981061     NA  NA   1   1   528
best_picture$Winner[is.na(best_picture$Winner)] <- "0"
best_picture$Winner <- factor(best_picture$Winner, labels=c("Nominated","Won"))

head(best_picture)
##    Year    Winner                Name               name_year
## 20 1928 Nominated          The Racket          The Racket1928
## 21 1927 Nominated          7th Heaven          7th Heaven1927
## 22 1927       Won               Wings               Wings1927
## 63 1929 Nominated               Alibi               Alibi1929
## 64 1928 Nominated      In Old Arizona      In Old Arizona1928
## 65 1929       Won The Broadway Melody The Broadway Melody1929

TMDB Movie Database

Using the head function, we see that there are several variables, many of which storing irrelevant information. We decide to select columns of interest.

With summarizeColumns, we notice the following anomalies:

  • There are 45466 observations in this dataset, however there are 45436 levels in the id column. This may be due to duplicate values. We expect duplicate values in original_title as there is a good chance that many films share the same name.
  • The adult appears to only consist of binary values. There may be more suitable format types to represent these values.
  • The budget should be a continuous variable, there may be a format type more suitable to represent these values.
  • The revenue, runtime and vote_average contain NA values.
  • It is highly unlikely that a film has 0 revenue, this suggests that there may be some data entry errors.
head(movies)
##   adult
## 1 False
## 2 False
## 3 False
## 4 False
## 5 False
## 6 False
##                                                                                                                                             belongs_to_collection
## 1           {'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}
## 2                                                                                                                                                                
## 3     {'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}
## 4                                                                                                                                                                
## 5 {'id': 96871, 'name': 'Father of the Bride Collection', 'poster_path': '/nts4iOmNnq7GNicycMJ9pSAn204.jpg', 'backdrop_path': '/7qwE57OVZmMJChBpLEbJEmzUydk.jpg'}
## 6                                                                                                                                                                
##     budget
## 1 30000000
## 2 65000000
## 3        0
## 4 16000000
## 5        0
## 6 60000000
##                                                                                                                     genres
## 1                         [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
## 2                        [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
## 3                                                         [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
## 4                            [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
## 5                                                                                           [{'id': 35, 'name': 'Comedy'}]
## 6 [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]
##                               homepage    id   imdb_id original_language
## 1 http://toystory.disney.com/toy-story   862 tt0114709                en
## 2                                       8844 tt0113497                en
## 3                                      15602 tt0113228                en
## 4                                      31357 tt0114885                en
## 5                                      11862 tt0113041                en
## 6                                        949 tt0113277                en
##                original_title
## 1                   Toy Story
## 2                     Jumanji
## 3            Grumpier Old Men
## 4           Waiting to Exhale
## 5 Father of the Bride Part II
## 6                        Heat
##                                                                                                                                                                                                                                                                                                                                                                                                      overview
## 1                                                                                             Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
## 2 When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
## 3                                                                     A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.
## 4                                                                                                                              Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive "good man" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.
## 5                                                                              Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnant ... and that George's wife, Nina, is expecting too. He was planning on selling their home, but that's a plan that -- like George -- will have to change with the arrival of both a grandchild and a kid of his own.
## 6                                                             Obsessive master thief, Neil McCauley leads a top-notch crew on various insane heists throughout Los Angeles while a mentally unstable detective, Vincent Hanna pursues him without rest. Each man recognizes and respects the ability and the dedication of the other even though they are aware their cat-and-mouse game may end in violence.
##   popularity                      poster_path
## 1  21.946943 /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
## 2  17.015539 /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
## 3    11.7129 /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
## 4   3.859495 /16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
## 5   8.387519 /e64sOI48hQXyru7naBFyssKFxVd.jpg
## 6  17.924927 /zMyfPUelumio3tiDKPffaUpsQTD.jpg
##                                                                                                                  production_companies
## 1                                                                                      [{'name': 'Pixar Animation Studios', 'id': 3}]
## 2 [{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
## 3                                                     [{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]
## 4                                                                     [{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]
## 5                                        [{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]
## 6             [{'name': 'Regency Enterprises', 'id': 508}, {'name': 'Forward Pass', 'id': 675}, {'name': 'Warner Bros.', 'id': 6194}]
##                                         production_countries release_date
## 1 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-10-30
## 2 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-12-15
## 3 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-12-22
## 4 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-12-22
## 5 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-02-10
## 6 [{'iso_3166_1': 'US', 'name': 'United States of America'}]   1995-12-15
##     revenue runtime
## 1 373554033      81
## 2 262797249     104
## 3         0     101
## 4  81452156     127
## 5  76578911     106
## 6 187436818     170
##                                                                    spoken_languages
## 1                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 2 [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
## 3                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 4                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 5                                          [{'iso_639_1': 'en', 'name': 'English'}]
## 6  [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'es', 'name': 'Español'}]
##     status
## 1 Released
## 2 Released
## 3 Released
## 4 Released
## 5 Released
## 6 Released
##                                                                          tagline
## 1                                                                               
## 2                                      Roll the dice and unleash the excitement!
## 3                           Still Yelling. Still Fighting. Still Ready for Love.
## 4 Friends are the people who let you be yourself... and never let you forget it.
## 5 Just When His World Is Back To Normal... He's In For The Surprise Of His Life!
## 6                                                       A Los Angeles Crime Saga
##                         title video vote_average vote_count
## 1                   Toy Story False          7.7       5415
## 2                     Jumanji False          6.9       2413
## 3            Grumpier Old Men False          6.5         92
## 4           Waiting to Exhale False          6.1         34
## 5 Father of the Bride Part II False          5.7        173
## 6                        Heat False          7.7       1886
dim(movies)
## [1] 45466    24
movies <- subset(movies, select= c("original_title","adult", "budget", "id", "release_date", "revenue", "runtime", "vote_average"))

summarizeColumns(movies)
##             name    type  na         mean         disp median      mad min
## 1 original_title  factor   0           NA 9.998240e-01     NA       NA   1
## 2          adult  factor   0           NA 2.639335e-04     NA       NA   1
## 3         budget  factor   0           NA 1.955967e-01     NA       NA   1
## 4             id  factor   0           NA 9.999340e-01     NA       NA   1
## 5   release_date  factor   0           NA 9.970088e-01     NA       NA   1
## 6        revenue numeric   6 1.120935e+07 6.433225e+07      0  0.00000   0
## 7        runtime numeric 263 9.412820e+01 3.840781e+01     95 16.30860   0
## 8   vote_average numeric   6 5.618207e+00 1.924216e+00      6  1.33434   0
##          max nlevs
## 1          8 43373
## 2      45454     5
## 3      36573  1226
## 4          3 45436
## 5        136 17337
## 6 2787965087     0
## 7       1256     0
## 8         10     0

Tidying movies

We begin tidying the movies dataset by applying the distinct function, this will ensure that duplicate values with the same id are removed. The best_picture dataset contains films strictly between 1927 and 2015, therefore to minimize the size of movies we filter out films outside that timeframe. The release_date provides us with the corresponding day, month and year of release, we extract the year for filtering purposes. We also create a new variable, name_year that stores the concatenation of the name of the film and the year. Now we can use this variable as a key for merging and will ensure that films are matched by film title and year.

Further tidying will be applied after merging. This will address the previously listed anomalies.

movies <- movies %>% distinct(id, .keep_all = TRUE)
 
movies$release_date <- as.numeric(substring(movies$release_date,1,4))
movies <- subset(movies, movies$release_date >= 1927 & movies$release_date <= 2015)
movies <- transform(movies,name_year=paste0(movies$original_title,movies$release_date))

head(movies)
##                original_title adult   budget    id release_date   revenue
## 1                   Toy Story False 30000000   862         1995 373554033
## 2                     Jumanji False 65000000  8844         1995 262797249
## 3            Grumpier Old Men False        0 15602         1995         0
## 4           Waiting to Exhale False 16000000 31357         1995  81452156
## 5 Father of the Bride Part II False        0 11862         1995  76578911
## 6                        Heat False 60000000   949         1995 187436818
##   runtime vote_average                       name_year
## 1      81          7.7                   Toy Story1995
## 2     104          6.9                     Jumanji1995
## 3     101          6.5            Grumpier Old Men1995
## 4     127          6.1           Waiting to Exhale1995
## 5     106          5.7 Father of the Bride Part II1995
## 6     170          7.7                        Heat1995

Merging

We use the amatch function for merging. The amatch function is specifically used for partial string matching. This will ensure that delimiter issues or misspellings do not affect the outcome. As mentioned earlier, we use the newly created variable, name_year as the key. We create a function to search through the movies dataset for values that have partial string matches with the values in best_picture. For every match the function finds, the rbind function is used to add joining variables to a new dataset, merge.

The dimensions of the merge are 489 observations and 13 columns. We see a decrease of values as compared to the best_picture dataset which had 528 observations. The missing observations may not be in the movies dataset.

am<-amatch(movies$name_year,best_picture$name_year,maxDist = 1)
merge<-data.frame()
for (i in 1:dim(movies)[1]) 
  {
  
      if(!is.na(best_picture[am[i],]))
      {
        merge <-rbind(merge,data.frame(movies[i,],best_picture[am[i],]))
      }
  
}

dim(merge)
## [1] 489  13

Tidying - Removing Duplicates

With summarizeColumns, we notice that the number of levels of name_year.1 is 487 whereas the number of total observations is name_year is 489. This is unusual as these two variables who have equal number of levels. We check the values in name_year.1 and find a film named Room which occurs three times in merge. When we observed films with more than one occurence within best_picture, the film, Room was not present. We investigate values with the same film name. When we subset the merge dataset, we see that the values for name_year are all different. They incude Room2015, Room2005 and Zoom2015. It appears that the amatch function may have overfitted. To ensure that these films should be included in merge we check if they exist in best_picture. We see that only Room2015 exists, therefore we remove the misspecified values. We rename the updated the dataset oscar_metadata.

summarizeColumns(merge)
##              name    type na         mean         disp     median
## 1  original_title  factor  0           NA 9.959100e-01         NA
## 2           adult  factor  0           NA 0.000000e+00         NA
## 3          budget  factor  0           NA 6.441718e-01         NA
## 4              id  factor  0           NA 9.979550e-01         NA
## 5    release_date numeric  0 1.970168e+03 2.765924e+01     1969.0
## 6         revenue numeric  0 9.480191e+07 2.145595e+08 19255967.0
## 7         runtime numeric  0 1.242924e+02 2.799031e+01      120.0
## 8    vote_average numeric  0 6.900204e+00 9.570810e-01        7.1
## 9       name_year  factor  0           NA 9.979550e-01         NA
## 10           Year  factor  0           NA 9.775051e-01         NA
## 11         Winner  factor  0           NA 1.595092e-01         NA
## 12           Name  factor  0           NA 9.938650e-01         NA
## 13    name_year.1  factor  0           NA 9.938650e-01         NA
##            mad    min          max nlevs
## 1           NA    0.0          2.0   484
## 2           NA    0.0        489.0     1
## 3           NA    0.0        174.0   147
## 4           NA    0.0          1.0   489
## 5  4.00302e+01 1927.0       2015.0     0
## 6  2.85489e+07    0.0 2787965087.0     0
## 7  2.37216e+01    0.0        248.0     0
## 8  7.41300e-01    2.5          8.5     0
## 9           NA    0.0          1.0   489
## 10          NA    0.0         11.0    89
## 11          NA   78.0        411.0     2
## 12          NA    0.0          3.0   483
## 13          NA    0.0          3.0   487
n_occur2 <- data.frame(table(merge$name_year.1))
subset(n_occur2, Freq > 1)
##         Var1 Freq
## 314 Room2015    3
subset(merge, merge$name_year.1 == "Room2015")
##       original_title adult  budget     id release_date  revenue runtime
## 32124           Room False 6000000 264644         2015 35401758     117
## 36661           Room False       0  69917         2005        0       0
## 37876           Zoom False       0 351065         2015        0      96
##       vote_average name_year Year    Winner Name name_year.1
## 32124          8.1  Room2015 2015 Nominated Room    Room2015
## 36661          2.5  Room2005 2015 Nominated Room    Room2015
## 37876          5.7  Zoom2015 2015 Nominated Room    Room2015
subset(best_picture, best_picture$Name == "Room" | best_picture$Name == "Zoom")
##      Year    Winner Name name_year
## 9920 2015 Nominated Room  Room2015
oscar_metadata <-merge [!(merge$id == "69917" | merge$id == "351065"), ]

Tidying - Data Type Conversion

We review oscar_metadata using summarizeColumns. We see that name_year.1 has equal number of levels as name_year. Now that we have ensured that our dataset contains only distinct observations, we remove the following columns - name_year and name_year.1. release_year and Year have the same information, therefore we remove Year and original_title and Name have the same information, therefore we remove Name. We also remove Adult as it only has one level, indicating constant values. We see that revenue has a min value of 0 which is unusual and may indicate an absence of data rather than zero dollars. We decide to convert all 0 values in revenue to NA. We do the same for budget and convert budget to a numeric variable as it is continuous. release_date represents a categorical feature therefore we convert it to a factor variable. The head and summarizeColumns functions are used to view the cleaner update. The only anomaly that remains is that budget and revenue have extremely low min values. These may be data entry errors and will be addressed when outliers are handled.

summarizeColumns(oscar_metadata)
##              name    type na         mean         disp    median
## 1  original_title  factor  0           NA 9.958932e-01        NA
## 2           adult  factor  0           NA 0.000000e+00        NA
## 3          budget  factor  0           NA 6.468172e-01        NA
## 4              id  factor  0           NA 9.979466e-01        NA
## 5    release_date numeric  0 1.970004e+03 2.759570e+01 1.968e+03
## 6         revenue numeric  0 9.519124e+07 2.149142e+08 2.000e+07
## 7         runtime numeric  0 1.246057e+02 2.744356e+01 1.200e+02
## 8    vote_average numeric  0 6.911704e+00 9.363963e-01 7.100e+00
## 9       name_year  factor  0           NA 9.979466e-01        NA
## 10           Year  factor  0           NA 9.774127e-01        NA
## 11         Winner  factor  0           NA 1.601643e-01        NA
## 12           Name  factor  0           NA 9.958932e-01        NA
## 13    name_year.1  factor  0           NA 9.979466e-01        NA
##            mad    min          max nlevs
## 1           NA    0.0          2.0   483
## 2           NA    0.0        487.0     1
## 3           NA    0.0        172.0   147
## 4           NA    0.0          1.0   487
## 5  3.85476e+01 1927.0       2015.0     0
## 6  2.96520e+07    0.0 2787965087.0     0
## 7  2.37216e+01   66.0        248.0     0
## 8  7.41300e-01    2.5          8.5     0
## 9           NA    0.0          1.0   487
## 10          NA    0.0         11.0    89
## 11          NA   78.0        409.0     2
## 12          NA    0.0          2.0   483
## 13          NA    0.0          1.0   487
oscar_metadata <- subset(oscar_metadata, select= -c(adult, name_year, name_year.1, Year, Name))

oscar_metadata$revenue[oscar_metadata$revenue==0] <- NA

oscar_metadata$budget[oscar_metadata$budget==0] <- NA
oscar_metadata$budget <- as.numeric(as.vector(oscar_metadata$budget))

oscar_metadata$release_date <- as.factor(oscar_metadata$release_date)

head(oscar_metadata)
##            original_title   budget   id release_date   revenue runtime
## 17  Sense and Sensibility 16500000 4584         1995 135000000     136
## 34                   Babe 30000000 9598         1995 254134910      89
## 109            Braveheart 72000000  197         1995 210000000     177
## 110           Taxi Driver  1300000  103         1976  28262574     114
## 148             Apollo 13 52000000  568         1995 355237933     140
## 257             Star Wars 11000000   11         1977 775398007     121
##     vote_average    Winner
## 17           7.2 Nominated
## 34           6.0 Nominated
## 109          7.7       Won
## 110          8.1 Nominated
## 148          7.3 Nominated
## 257          8.1 Nominated
summarizeColumns(oscar_metadata)
##             name    type  na         mean         disp     median
## 1 original_title  factor   0           NA 9.958932e-01         NA
## 2         budget numeric 172 2.363877e+07 3.613357e+07 11000000.0
## 3             id  factor   0           NA 9.979466e-01         NA
## 4   release_date  factor   0           NA 9.774127e-01         NA
## 5        revenue numeric 181 1.514972e+08 2.550293e+08 68690110.5
## 6        runtime numeric   0 1.246057e+02 2.744356e+01      120.0
## 7   vote_average numeric   0 6.911704e+00 9.363963e-01        7.1
## 8         Winner  factor   0           NA 1.601643e-01         NA
##            mad  min          max nlevs
## 1           NA  0.0          2.0   483
## 2 1.363992e+07  4.0  237000000.0     0
## 3           NA  0.0          1.0   487
## 4           NA  2.0         11.0    89
## 5 8.541032e+07 25.0 2787965087.0     0
## 6 2.372160e+01 66.0        248.0     0
## 7 7.413000e-01  2.5          8.5     0
## 8           NA 78.0        409.0     2

Visualisations

We explore the distribution of all numeric variables to better understand the data.

Vote Average

The histogram below displays a slightly left skew, this is expected as most films nominated for Best Picture tend to be praised by the general public. The mean vote_average value is 6.91. We see that values do not exceed below 2.5 or beyond 8.5. When we look at the observations holding minimum values, we see that the three films listed are quite old and that these films were nominated for Best Picture, but were not winners in their respective years. The lack of success for these films may have been better explained if there was information about budget or revenue. We see that “The Shawshank Redemption” and “The Godfather” have the highest vote_average scores with values of 8.5. Both films with large revenue values compared to budget indicating a success cinematic release. “The Godfather” was awarded Best Picture in 1972, however “The Shawshank Redemption” failed to win despite being well-received among users.

theme_set(theme_solarized())

mean_vote <- oscar_metadata %>%
  mutate(Label = paste0("µ: ",prettyNum((mean(vote_average)))))

ggplot(oscar_metadata, aes(x = vote_average, fill="grey")) + geom_histogram(colour = "white",bins = 100) + 
  labs(title = 'Histogram of Vote Average') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) + 
  scale_fill_manual(values = c("#696969")) +  guides(fill=FALSE) + 
  geom_vline(data=oscar_metadata, aes(xintercept=mean(vote_average)),linetype="dashed", size=1.3, colour="red", alpha=0.7) + 
  ylab("Frequency") +  geom_text(data = mean_vote, aes(x = mean(vote_average),  y = 40, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
  xlab("Vote Average") + xlim(0, 10)

subset(oscar_metadata, oscar_metadata$vote_average == min(oscar_metadata$vote_average, na.rm=TRUE))
##          original_title budget    id release_date revenue runtime
## 16683       Trader Horn     NA 84296         1931      NA     122
## 25714   Flirtation Walk     NA 84100         1934      NA      97
## 27047 One Night of Love     NA 84097         1934      NA      84
##       vote_average    Winner
## 16683          2.5 Nominated
## 25714          2.5 Nominated
## 27047          2.5 Nominated
subset(oscar_metadata, oscar_metadata$vote_average == max(oscar_metadata$vote_average, na.rm=TRUE))
##               original_title  budget  id release_date   revenue runtime
## 315 The Shawshank Redemption 2.5e+07 278         1994  28341469     142
## 835            The Godfather 6.0e+06 238         1972 245066411     175
##     vote_average    Winner
## 315          8.5 Nominated
## 835          8.5       Won

Vote Average - Comparing Winning and Nominated films

When we segregrate the data by winners and nominees, we see that the distribution is spread out more for nominated films. This can be explained by the disproportion of values. As there can only be one winner per year and multiple nominees per year, we expect the distribution to be less variant for winners. The histogram for Nominated follows the previous histogram with a negative skew. The boxplot for Nominated displays multiple outliers after 5.0. There is no need to handle these outliers as they sit well within the suitable values for vote_average (between 1 and 10). The distribution for Won centres around 7.5, which is slightly higher than the mean of the entire distribution (6.91). We see 3 outliers that exceed below ~5.0, despite having such low vote_average scores, these films managed to win Best Picture in their respective years. We take a closer look at these outliers and see that these films are from the early 1930’s. We infer that older films tend to have the lower scores for vote_average. We explore correlations between variables later on. Despite these films being voted quite badly, “The Broadway Melody” and “Cavalcade” both have higher revenue values than budget values, indicating that these films were still a success.

theme_set(theme_gray())

p1 <- ggplot(oscar_metadata, aes(x = factor(1), y = vote_average)) +
  geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 10)) + theme(axis.title.y=element_blank()) +
  facet_grid(. ~ Winner)

p2 <- ggplot(oscar_metadata, aes(x = vote_average)) +
  geom_histogram(colour="white",aes(vote_average), fill="dodgerblue", alpha = 1/2,bins = 50) +
  scale_x_continuous(limits = c(0, 10)) + ggtitle("Distribution of Vote Average") + facet_grid(. ~ Winner) +
  theme(axis.title.x=element_blank())

plot_grid(p2, p1 + coord_flip() + 
theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), 
ncol=1, align="v", rel_heights = c(2,1))

winning_films <- subset(oscar_metadata, oscar_metadata$Winner == "Won")
boxplot.stats(winning_films$vote_average)$out
## [1] 5.0 5.1 4.6
subset_winning_outliers <- subset(winning_films, winning_films$vote_average <= 5.1)
head(subset_winning_outliers[order(subset_winning_outliers$vote_average, decreasing = TRUE),])
##           original_title  budget    id release_date revenue runtime
## 1819            Cimarron      NA 42861         1931      NA     123
## 1817 The Broadway Melody  379000 65203         1929 4358000     100
## 1821           Cavalcade 1180280 56164         1933 7630000     110
##      vote_average Winner
## 1819          5.1    Won
## 1817          5.0    Won
## 1821          4.6    Won

Run Time

The histogram of runtime displays a slightly right skewed normal distribution with a wider spread of values. The mean runtime value is 125 minutes or 2 hours and 5 minutes. When we explore minimum values, we find that the shortest film in this dataset, “She Done Him Wrong” is 66 minutes long. The film with the longest run time is “Cleopatra” (1963) with a run time of over 3 hours. We explore these extreme values to ensure they are not outliers.

theme_set(theme_solarized())

mean_runtime <- oscar_metadata %>%
  mutate(Label = paste0("µ: ",prettyNum((mean(runtime)))))

ggplot(oscar_metadata, aes(x = runtime, fill="grey")) + geom_histogram(colour = "white",bins = 100) + 
  labs(title = 'Histogram of Run Time') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) + 
  scale_fill_manual(values = c("#696969")) +  guides(fill=FALSE) + 
  geom_vline(data=oscar_metadata, aes(xintercept=mean(runtime)),linetype="dashed", size=1.3, colour="red", alpha=0.7) + 
  ylab("Frequency") +  geom_text(data = mean_runtime, aes(x = mean(runtime),  y = 25, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
  xlab("Run Time (minutes)") + xlim(0, 250)

subset(oscar_metadata, oscar_metadata$runtime == min(oscar_metadata$runtime, na.rm=TRUE))
##          original_title budget    id release_date revenue runtime
## 8345 She Done Him Wrong  2e+05 43595         1933 2200000      66
##      vote_average    Winner
## 8345          5.2 Nominated
subset(oscar_metadata, oscar_metadata$runtime == max(oscar_metadata$runtime, na.rm=TRUE))
##      original_title   budget   id release_date revenue runtime
## 4146      Cleopatra 31115000 8095         1963 7.1e+07     248
##      vote_average    Winner
## 4146          6.7 Nominated

Run Time - Comparing Winning and Nominated films

The histogram for Nominated follows the previous histogram with a positive skew. The boxplot for Nominated displays multiple outliers above 175 minutes. The distribution for Won is relatively flat, however these films tend to be longer than nominated films. The boxplot for Won returns one outlier. As we saw previously, the film with the maximum value of 248 minutes didd not appear to be an error, therefore we do not handle these outliers as they fit within a suitable range for a film’s run time.

theme_set(theme_gray())

p3 <- ggplot(oscar_metadata, aes(x = factor(1), y = runtime)) +
  geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 250)) + theme(axis.title.y=element_blank()) +
  facet_grid(. ~ Winner)

p4 <- ggplot(oscar_metadata, aes(x = runtime)) +
  geom_histogram(colour="white",aes(runtime), fill="dodgerblue", alpha = 1/2,bins = 50) +
  scale_x_continuous(limits = c(0, 250)) + ggtitle("Distribution of Runtime") + facet_grid(. ~ Winner) + theme(axis.title.x=element_blank())

plot_grid(p4, p3 + coord_flip() + 
            theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), ncol=1, align="v", rel_heights = c(2,1))

Budget - Outliers

We previously converted 0 values of budget to NA. We remove these observations when we investigate outliers. The boxlot of budget displays several outliers over a budget of 50 million dollars. As mentioned before we found rather unusual minimum values when reviewing the last summary of the whole dataset. Values as low as $4. We explore this by subsetting the dataset to return only values lower than 100,000. We discover 4 films with budgets of 4, 424, 340 and 762. All listed films are from the early 1930’s and may be storing incorrect budget values. Three of these films do not a value for revenue and the film that does, has a revenue of 25 dollars. We decide to remove these errors from the dataset and review the boxplot after the update. We also divide the budget values by a million for better readability. There appears to be little to no change to the distribution of budget values. The boxplot returns 30 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly and return higher revenue values. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. Interestingly enough, only one of these films had won Best Picture (“Titanic”), suggesting that perhaps having a large budget does not correlate with winning Best Picture.

budget_noNAs <- oscar_metadata[!is.na(oscar_metadata$budget),]

boxplot(budget_noNAs$budget, main="Box Plot of Budget", ylab="Budget",col = "grey") 

subset(oscar_metadata, budget < 100000)
##                   original_title budget    id release_date revenue runtime
## 950           A Farewell to Arms      4 22649         1932      25      89
## 4791                Little Women    424 39938         1933      NA     115
## 7766                The Divorcee    340 80708         1930      NA      84
## 14371 One Hundred Men and a Girl    762 61647         1937      NA      84
##       vote_average    Winner
## 950            6.2 Nominated
## 4791           6.0 Nominated
## 7766           5.6 Nominated
## 14371          4.9 Nominated
oscar_metadata <-oscar_metadata [!(oscar_metadata$id == "22649" | oscar_metadata$id == "39938" |
                                   oscar_metadata$id == "80708" | oscar_metadata$id == "61647"), ]

oscar_metadata$budget <- oscar_metadata$budget/1000000
budget_noNAs <- oscar_metadata[!is.na(oscar_metadata$budget),]

boxplot(budget_noNAs$budget, main="Box Plot of Budget", ylab="Budget",col = "grey") 

outlier_budget <- boxplot.stats(budget_noNAs$budget)$out
length(outlier_budget)
## [1] 30
subset_budget_outliers <- subset(oscar_metadata, oscar_metadata$budget > min(outlier_budget))
head(subset_budget_outliers[order(subset_budget_outliers$budget, decreasing = TRUE),],10)
##                                        original_title budget     id
## 14547                                          Avatar    237  19995
## 1639                                          Titanic    200    597
## 15343                                     Toy Story 3    200  10193
## 13720                                              Up    175  14160
## 18085                                            Hugo    170  44826
## 15475                                       Inception    160  27205
## 6796  Master and Commander: The Far Side of the World    150   8619
## 13216             The Curious Case of Benjamin Button    150   4922
## 26537                              Mad Max: Fury Road    150  76341
## 31845                                    The Revenant    135 281957
##       release_date    revenue runtime vote_average    Winner
## 14547         2009 2787965087     162          7.2 Nominated
## 1639          1997 1845034188     194          7.5       Won
## 15343         2010 1066969703     103          7.6 Nominated
## 13720         2009  735099082      96          7.8 Nominated
## 18085         2011  185770160     126          7.0 Nominated
## 15475         2010  825532764     148          8.1 Nominated
## 6796          2003  212011111     138          6.9 Nominated
## 13216         2008  333932083     166          7.3 Nominated
## 26537         2015  378858340     120          7.3 Nominated
## 31845         2015  532950503     156          7.3 Nominated

Budget - Histogram

The histogram of budget displays a positively skewed distribution with values reaching close to 250 million dollars. The mean budget value is ~24 million dollars.

mean_budget <- budget_noNAs %>%
  mutate(Label = paste0("µ: ",prettyNum((mean(budget)))))

ggplot(budget_noNAs, aes(x = budget, fill="grey")) + geom_histogram(colour = "white",bins = 100) + 
  labs(title = 'Histogram of Budget') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) + 
  scale_fill_manual(values = c("#696969")) +  guides(fill=FALSE) + 
  geom_vline(data=budget_noNAs, aes(xintercept=mean(budget)),linetype="dashed", size=1.3, colour="red", alpha=0.7) + 
  ylab("Frequency") +  geom_text(data = mean_budget, aes(x = mean(budget),  y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
  xlab("Budget (in millions)") + xlim(0, 250)

Budget - Comparing Winning and Nominated films

The histogram for Nominated follows the previous histogram with a positive skew. The boxplot for Nominated displays multiple outliers above 75 million dollars. The distribution for Won follows a similar pattern, however values rarely exceed 100 million dollars. The boxplot for Won returns several outliers above 50 million dollars.

theme_set(theme_gray())

p5 <- ggplot(budget_noNAs, aes(x = factor(1), y = budget)) +
  geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 250)) + theme(axis.title.y=element_blank()) +
  facet_grid(. ~ Winner)

p6 <- ggplot(budget_noNAs, aes(x = budget)) +
  geom_histogram(colour="white",aes(budget), fill="dodgerblue", alpha = 1/2,bins = 50) +
  scale_x_continuous(limits = c(0, 250)) + ggtitle("Distribution of Budget") + facet_grid(. ~ Winner) +
  theme(axis.title.x=element_blank())

plot_grid(p6, p5 + coord_flip() + 
            theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), 
          ncol=1, align="v", rel_heights = c(2,1))

Revenue - Outliers

Similar to budget, we previously converted 0 values of revenue to NA. We remove these observations when we investigate outliers. As mentioned before we found rather unusual minimum values when reviewing the last summary of the whole dataset. Values as low as $25. We explore this by calling the summarizeColumns function, here we see that the min value for revenue is 83,305 dollars. This is a suitable revenue, therefore we decide to keep this observation and review the boxplot for revenue. We also divide the revenue values by a million for better readability.

The boxlot of revenue displays several outliers over a revenue of 500 million dollars. Overall, the boxplot returns 24 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. The only films to had won Best Picture out of these maximum outlier values were “Titanic” and “The Lord of the Rings: The Return of the King”.

revenue_noNAs <- oscar_metadata[!is.na(oscar_metadata$revenue),]

summarizeColumns(revenue_noNAs)
##             name    type na         mean         disp     median
## 1 original_title  factor  0           NA 9.934426e-01         NA
## 2         budget numeric 17 2.558674e+01 3.717385e+01       14.0
## 3             id  factor  0           NA 9.967213e-01         NA
## 4   release_date  factor  0           NA 9.672131e-01         NA
## 5        revenue numeric  0 1.519939e+08 2.553001e+08 68706993.0
## 6        runtime numeric  0 1.314951e+02 2.819739e+01      126.0
## 7   vote_average numeric  0 7.289180e+00 5.688004e-01        7.3
## 8         Winner  factor  0           NA 2.229508e-01         NA
##            mad     min          max nlevs
## 1           NA     0.0          2.0   304
## 2 1.630860e+01     0.2        237.0     0
## 3           NA     0.0          1.0   305
## 4           NA     0.0         10.0    82
## 5 8.526425e+07 83305.0 2787965087.0     0
## 6 2.223900e+01    66.0        248.0     0
## 7 4.447800e-01     4.6          8.5     0
## 8           NA    68.0        237.0     2
oscar_metadata$revenue <- oscar_metadata$revenue/1000000
revenue_noNAs <- oscar_metadata[!is.na(oscar_metadata$revenue),]

boxplot(revenue_noNAs$revenue, main="Box Plot of Revenue", ylab="Revenue",col = "grey") 

outlier_revenue <- boxplot.stats(revenue_noNAs$revenue)$out
length(outlier_revenue)
## [1] 24
subset_revenue_outliers <- subset(oscar_metadata, oscar_metadata$revenue > min(outlier_revenue))
head(subset_revenue_outliers[order(subset_revenue_outliers$revenue, decreasing = TRUE),],10)
##                                          original_title budget    id
## 14547                                            Avatar    237 19995
## 1639                                            Titanic    200   597
## 7000      The Lord of the Rings: The Return of the King     94   122
## 15343                                       Toy Story 3    200 10193
## 5814              The Lord of the Rings: The Two Towers     79   121
## 4863  The Lord of the Rings: The Fellowship of the Ring     93   120
## 15475                                         Inception    160 27205
## 257                                           Star Wars     11    11
## 13720                                                Up    175 14160
## 21581                                           Gravity    105 49047
##       release_date   revenue runtime vote_average    Winner
## 14547         2009 2787.9651     162          7.2 Nominated
## 1639          1997 1845.0342     194          7.5       Won
## 7000          2003 1118.8890     201          8.1       Won
## 15343         2010 1066.9697     103          7.6 Nominated
## 5814          2002  926.2874     179          8.0 Nominated
## 4863          2001  871.3684     178          8.0 Nominated
## 15475         2010  825.5328     148          8.1 Nominated
## 257           1977  775.3980     121          8.1 Nominated
## 13720         2009  735.0991      96          7.8 Nominated
## 21581         2013  716.3927      91          7.3 Nominated

Revenue - Histogram

The histogram of revenue displays a positively skewed distribution with values reaching close to 2 billion dollars. The mean revenue value is ~152 million dollars. When comparing this mean to budget, we see that overall films tend to profit ~128 million dollars.

mean_revenue <- revenue_noNAs %>%
  mutate(Label = paste0("µ: ",prettyNum((mean(revenue)))))

ggplot(revenue_noNAs, aes(x = revenue, fill="grey")) + geom_histogram(colour = "white",bins = 100) + 
  labs(title = 'Histogram of Revenue') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) + 
  scale_fill_manual(values = c("#696969")) +  guides(fill=FALSE) + 
  geom_vline(data=revenue_noNAs, aes(xintercept=mean(revenue)),linetype="dashed", size=1.3, colour="red", alpha=0.7) + 
  ylab("Frequency") +  geom_text(data = mean_revenue, aes(x = mean(revenue),  y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
  xlab("Revenue (in millions)") + xlim(0, 2800)

Revenue - Comparing Winning and Nominated films

The histogram for Nominated follows the previous histogram with a positive skew. The boxplot for Nominated displays multiple outliers above 400 million dollars. The distribution for Won follows a similar pattern. The boxplot for Won returns three outliers above 500 million dollars.

theme_set(theme_gray())

p7 <- ggplot(revenue_noNAs, aes(x = factor(1), y = revenue)) +
  geom_boxplot(width = .50) + scale_y_continuous(limits = c(0, 2800)) + theme(axis.title.y=element_blank()) +
  facet_grid(. ~ Winner)

p8 <- ggplot(revenue_noNAs, aes(x = revenue)) +
  geom_histogram(colour="white",aes(revenue), fill="dodgerblue", alpha = 1/2,bins = 50) +
  scale_x_continuous(limits = c(0, 2800)) + ggtitle("Distribution of Revenue") + facet_grid(. ~ Winner) +
  theme(axis.title.x=element_blank())

plot_grid(p8, p7 + coord_flip() + 
            theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), 
          ncol=1, align="v", rel_heights = c(2,1))

Profit

We create a new variable, named profit. This is calculated by subtracting the budget from the revenue. Before creating profit, we have to remove all NA values. The summary belows confirms this. We also change the name of the updated dataset to oscar_metadata2.

The boxlot of profit displays several outliers over a profit of 500 million dollars. Overall, the boxplot returns 24 outliers. We explore the 10 maximum outliers and find that these values are all voted quite highly. Despite the amount of outliers, we decide to keep of all values as they appear to be legitimate. The only films to had won Best Picture out of these maximum outlier values were “Titanic”, “The Lord of the Rings: The Return of the King” and “Forrest Gump”.

The summarizeColumns function is used to view a summary of the dataset with the new variable. We see that profit has a negative minimum value, which would indicate films that did not make a profit after cinematic release. We explore these unsuccessful films. These films vary in release_date and have slightly lower vote_average values. None of these films won Best Picture.

oscar_metadata2 <- na.omit(oscar_metadata)
summarizeColumns(oscar_metadata2)
##             name    type na       mean        disp    median      mad
## 1 original_title  factor  0         NA   0.9930556        NA       NA
## 2         budget numeric  0  25.586745  37.1738463  14.00000 16.30860
## 3             id  factor  0         NA   0.9965278        NA       NA
## 4   release_date  factor  0         NA   0.9652778        NA       NA
## 5        revenue numeric  0 158.945729 260.9980755  75.99815 92.68516
## 6        runtime numeric  0 132.166667  28.7471253 127.00000 23.72160
## 7   vote_average numeric  0   7.297917   0.5785322   7.40000  0.51891
## 8         Winner  factor  0         NA   0.2326389        NA       NA
##     min      max nlevs
## 1  0.00    2.000   287
## 2  0.20  237.000     0
## 3  0.00    1.000   288
## 4  0.00   10.000    81
## 5  0.95 2787.965     0
## 6 66.00  248.000     0
## 7  4.60    8.500     0
## 8 67.00  221.000     2
oscar_metadata2$profit <- oscar_metadata2$revenue - oscar_metadata2$budget

boxplot(oscar_metadata2$profit, main="Box Plot of Profit", ylab="Profit",col = "grey") 

outlier_profit <- boxplot.stats(oscar_metadata2$profit)$out
length(outlier_profit)
## [1] 24
subset_profit_outliers <- subset(oscar_metadata2, oscar_metadata2$profit > min(outlier_profit))
head(subset_profit_outliers[order(subset_profit_outliers$profit, decreasing = TRUE),],10)
##                                          original_title budget    id
## 14547                                            Avatar    237 19995
## 1639                                            Titanic    200   597
## 7000      The Lord of the Rings: The Return of the King     94   122
## 15343                                       Toy Story 3    200 10193
## 5814              The Lord of the Rings: The Two Towers     79   121
## 4863  The Lord of the Rings: The Fellowship of the Ring     93   120
## 257                                           Star Wars     11    11
## 15475                                         Inception    160 27205
## 2647                                    The Sixth Sense     40   745
## 352                                        Forrest Gump     55    13
##       release_date   revenue runtime vote_average    Winner    profit
## 14547         2009 2787.9651     162          7.2 Nominated 2550.9651
## 1639          1997 1845.0342     194          7.5       Won 1645.0342
## 7000          2003 1118.8890     201          8.1       Won 1024.8890
## 15343         2010 1066.9697     103          7.6 Nominated  866.9697
## 5814          2002  926.2874     179          8.0 Nominated  847.2874
## 4863          2001  871.3684     178          8.0 Nominated  778.3684
## 257           1977  775.3980     121          8.1 Nominated  764.3980
## 15475         2010  825.5328     148          8.1 Nominated  665.5328
## 2647          1999  672.8063     107          7.7 Nominated  632.8063
## 352           1994  677.9454     142          8.2       Won  622.9454
summarizeColumns(oscar_metadata2)
##             name    type na       mean        disp    median      mad
## 1 original_title  factor  0         NA   0.9930556        NA       NA
## 2         budget numeric  0  25.586745  37.1738463  14.00000 16.30860
## 3             id  factor  0         NA   0.9965278        NA       NA
## 4   release_date  factor  0         NA   0.9652778        NA       NA
## 5        revenue numeric  0 158.945729 260.9980755  75.99815 92.68516
## 6        runtime numeric  0 132.166667  28.7471253 127.00000 23.72160
## 7   vote_average numeric  0   7.297917   0.5785322   7.40000  0.51891
## 8         Winner  factor  0         NA   0.2326389        NA       NA
## 9         profit numeric  0 133.358984 235.2069421  53.54244 70.89049
##         min      max nlevs
## 1   0.00000    2.000   287
## 2   0.20000  237.000     0
## 3   0.00000    1.000   288
## 4   0.00000   10.000    81
## 5   0.95000 2787.965     0
## 6  66.00000  248.000     0
## 7   4.60000    8.500     0
## 8  67.00000  221.000     2
## 9 -29.71009 2550.965     0
losers <- subset(oscar_metadata2, oscar_metadata2$profit < 0)
losers[order(losers$profit),]
##            original_title  budget    id release_date   revenue runtime
## 2890          The Insider  90.000  9008         1999  60.28991     157
## 9540          The Aviator 116.000  2567         2004 102.00000     170
## 2026      Doctor Dolittle  18.000 16081         1967   9.00000     152
## 2630          The Mission  24.500 11416         1986  17.21802     126
## 1188      The Right Stuff  27.000  9549         1983  21.50000     193
## 8556 Mutiny on the Bounty  19.000 11085         1962  13.68000     178
## 3917            The Alamo  12.000 11209         1960   7.90000     167
## 9973               Wilson   2.995 84084         1944   2.00000     154
##      vote_average    Winner     profit
## 2890          7.3 Nominated -29.710088
## 9540          7.0 Nominated -14.000000
## 2026          6.0 Nominated  -9.000000
## 2630          7.1 Nominated  -7.281977
## 1188          7.4 Nominated  -5.500000
## 8556          6.5 Nominated  -5.320000
## 3917          6.5 Nominated  -4.100000
## 9973          5.5 Nominated  -0.995000

Profit - Histogram

The histogram of profit displays a positively skewed distribution with values reaching over 2.5 billion dollars. The mean profit value is ~133 million dollars.

mean_profit <- oscar_metadata2 %>%
  mutate(Label = paste0("µ: ",prettyNum((mean(profit)))))

ggplot(oscar_metadata2, aes(x = profit, fill="grey")) + geom_histogram(colour = "white",bins = 100) +
  labs(title = 'Histogram of Profit') + theme_solarized() + theme(plot.title = element_text(hjust = 0.5, face="bold")) + 
  scale_fill_manual(values = c("#696969")) +  guides(fill=FALSE) + 
  geom_vline(data=oscar_metadata2, aes(xintercept=mean(profit)),linetype="dashed", size=1.3, colour="red", alpha=0.7) + 
  ylab("Frequency") +  geom_text(data = mean_profit, aes(x = mean(profit),  y = 65, label = Label), family = "sans", size = 4.5, hjust = -.2, colour = "darksalmon") +
  xlab("Profit (in millions)") + xlim(-30, 2600)
## Warning: Removed 1 rows containing missing values (geom_bar).

Profit - Comparing Winning and Nominated films

The histogram for Revenue follows the previous histogram with a positive skew. The boxplot for Nominated displays multiple outliers above 400 million dollars. The distribution for Won follows a similar pattern. The boxplot for Won returns three outliers above 500 million dollars.

theme_set(theme_gray())

p9 <- ggplot(oscar_metadata2, aes(x = factor(1), y = profit)) +
  geom_boxplot(width = .50) + scale_y_continuous(limits = c(-30, 2600)) + theme(axis.title.y=element_blank()) +
  facet_grid(. ~ Winner)

p10 <- ggplot(oscar_metadata2, aes(x = profit)) +
  geom_histogram(colour="white",aes(profit), fill="dodgerblue", alpha = 1/2,bins = 50) +
  scale_x_continuous(limits = c(-30, 2600)) + ggtitle("Distribution of Profit") + facet_grid(. ~ Winner) +
  theme(axis.title.x=element_blank())

plot_grid(p10, p9 + coord_flip() + 
            theme(axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()), 
          ncol=1, align="v", rel_heights = c(2,1))
## Warning: Removed 2 rows containing missing values (geom_bar).

Multivariate Visualisation

We explore correlations between corresponding numeric variables. We convert release_date to numeric for the purpose of this scatterplot matrix. We can use this visualisation to quickly scan for the strongest relationship between variables. The strongest linear correlation is between budget and revenue with a value of 0.731. This indicates films that spend more money to create the film, tend to gain more revenue. This also indicates that films that spent less can expect a lower revenue.

We expect to see correlations for profit with budget and revenue, as profit is dependent on these variables. There is slight correlation between budget and release_date with a value of 0.49. We infer that older films did not have the same budget as newer films.

oscar_metadata2$release_date <- as.numeric(as.vector(oscar_metadata2$release_date))

ggpairs(oscar_metadata2, columns = c(2,4:7,9),axisLabels = "internal")

Multivariate Visualisation - Comparing Winning and Nominated films

We segregrate the data by winners and nominees and create new subsets renamed respectively. We find that all correlations are higher when we focus on solely the award-winning films. Correlations with vote_average appear to have changed significantly as opposed to nominated films. We also see slightly increased correlations with runtime.

The strongest linear correlation for both plots is between budget and revenue. In comparison to the entire oscar_metadata2 dataset, the correlation within winners increases by 13% and the correlation within nominees decreases by 2%.

nominees <- subset(oscar_metadata2, oscar_metadata2$Winner == "Nominated")
winners <- subset(oscar_metadata2, oscar_metadata2$Winner == "Won")

ggpairs(nominees, columns = c(2,4:7,9),axisLabels = "internal",  title = "Nominees")

ggpairs(winners, columns = c(2,4:7,9),axisLabels = "internal", title = "Winners")