Capstone EDA

The Data

library(tidyverse)
library(caret)
library(DMwR2)
library(e1071)
library(ipred)
library(xgboost)
library(kableExtra)
library(ggthemes)
library(egg)
library(forcats)
library(summarytools)
library(forecast)
library(doParallel)

Reading in the data,

df <- read.csv("movies_clean_v4.csv")
kable(head(df,1)) |> kable_styling()
title year rated released runtime genre director writer actors plot language country awards poster ratings metascore imdbrating imdbvotes imdbid type dvd boxoffice production website response directorid actorid budget num_directors num_actors actor_rev5 actor_rev10 director_films5 actor_films5 actor_films10 director_rev10 director_films10 director_rev5
Spider-Man 2002 PG-13 2002-05-03 121 Action Sam Raimi Stan Lee, Steve Ditko, David Koepp Tobey Maguire, Kirsten Dunst, Willem Dafoe After being bitten by a genetically-modified spider, a shy teenager gains spider-like abilities that he uses to fight injustice as a masked superhero and face a vengeful enemy. English United States Nominated for 2 Oscars. 17 wins & 63 nominations total https://m.media-amazon.com/images/M/MV5BZDEyN2NhMjgtMjdhNi00MmNlLWE5YTgtZGE4MzNjMTRlMGEwXkEyXkFqcGdeQXVyNDUyOTg3Njg@._V1_SX300.jpg {‘Source’: ‘Internet Movie Database’, ‘Value’: ‘7.4/10’} 73 7.4 852,452 tt0145487 movie 25 Apr 2013 407022860 NA True [‘nm0000600’] [‘nm0001497’, ‘nm0000379’, ‘nm0000353’] 1.39e+08 1 3 0 0 0 0 0 0 0 0

First we’ll subset the desired columns,

colnames(df)
##  [1] "title"            "year"             "rated"            "released"        
##  [5] "runtime"          "genre"            "director"         "writer"          
##  [9] "actors"           "plot"             "language"         "country"         
## [13] "awards"           "poster"           "ratings"          "metascore"       
## [17] "imdbrating"       "imdbvotes"        "imdbid"           "type"            
## [21] "dvd"              "boxoffice"        "production"       "website"         
## [25] "response"         "directorid"       "actorid"          "budget"          
## [29] "num_directors"    "num_actors"       "actor_rev5"       "actor_rev10"     
## [33] "director_films5"  "actor_films5"     "actor_films10"    "director_rev10"  
## [37] "director_films10" "director_rev5"
cols <- c("year","rated","released","runtime","genre","language","country","boxoffice","budget","num_directors","num_actors","actor_films5","actor_films10","actor_rev5","actor_rev10","director_films5","director_films10","director_rev5","director_rev10")

df <- df |>
  select(all_of(cols))

kable(head(df)) |> kable_styling()
year rated released runtime genre language country boxoffice budget num_directors num_actors actor_films5 actor_films10 actor_rev5 actor_rev10 director_films5 director_films10 director_rev5 director_rev10
2002 PG-13 2002-05-03 121 Action English United States 407022860 1.39e+08 1 3 0 0 0 0 0 0 0 0
2002 PG 2002-05-16 142 Action English United States 310676740 1.15e+08 1 3 1 1 108638745 108638745 0 0 0 0
2002 PG 2002-11-15 161 Adventure English, Latin United Kingdom, United States 262641637 1.00e+08 1 3 0 0 0 0 0 0 0 0
2002 PG-13 2002-08-02 106 Drama English, Portuguese United States 227966634 7.20e+07 1 3 1 1 78122718 78122718 0 0 0 0
2002 PG 2002-08-02 95 Comedy English, Greek Canada, United States 241438208 5.00e+06 1 3 0 0 0 0 0 0 0 0
2002 PG-13 2002-12-18 179 Action English, Sindarin, Old English New Zealand, United States 342952511 9.40e+07 1 3 0 0 0 0 0 0 0 0

The following columns will be converted to factors:

rated - contains the film’s ratings.
genre - contains the film’s genre. num_actors - the number of leading actors per Imdb.
num_directors - the number of directors per Imdb.

The following column will be converted to date:

released - contains the film’s release date

df <- df |>
  mutate(rated = as.factor(rated),
         genre = as.factor(genre),
         num_actors = as.factor(num_actors),
         num_directors = as.factor(num_directors),
         released = as.Date(released))

Extracting the month from release date,

df <- df |>
  mutate(month = as.factor(months(released)))

Let’s see how many unique values there are in the language and country columnns,

First, language:

n_distinct(df$language)
## [1] 688

We take a look at the unique languages and notice that there are some languages that contain ““, and some that contain”None”. First we convert the empty strings to “None”, then we update the variable column: if a film contains the language English, we label it “English”, if the film contains the language “None”, we leave it as “None”, otherwise, we label it “Foreign”.

# replace empty strings with "None"
df$language <- ifelse(df$language == "", "None", df$language)

# convert language values to "English", "Foreign", or "None"
df$language <- as.factor(ifelse(grepl("English", df$language), 
                                "English",
                                ifelse(df$language == "None",
                                       "None","Foreign")))

kable(sort(table(df$language), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
English 3490
Foreign 157
None 10

Now we take a look at country.

n_distinct(df$country)
## [1] 618

One of the issues here, like with the language column, is that there is no hierarchy given - countries are listed in alphabetical order. As such, we’ll assume that if the United States is listed, the film can be considered domestic, whereas if United States is not listed, it will be considered foreign. We also observe that there are some missing observations that contain empty strings. We’ll fill those as “None”.

# fill empty rows
df$country <- ifelse(df$country == "", "None", df$country)

# recode the country variable
df$country <- as.factor(ifelse(grepl("United States", df$country), 
                               "United States",
                               ifelse(df$country == "None", 
                                      "None","Foreign")))

kable(sort(table(df$country), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
United States 3307
Foreign 346
None 4

Lastly, we examine our columns stored as double precision. We observe that the actor and director revenue columns contain some floats, so we leave those alone. We change the datatype for the budget column from double precision to integer.

df$budget <- as.integer(df$budget)

The data we are interested in is from 2012 on. We collected data beginning in 2002 in order to calculate the actor and director earnings for the previous 5- to 10- years. So now we subset the dataframe to only records from 2012 on and drop the year and released column altogether.

df <- df |>
  filter(year >= 2012) |>
  select(-c(year,released))

Exploratory Data Analysis

We now have a dataframe consisting of 1859 observations and 18 variables. Let’s take a look first at our missing values.

colnames(df)[colSums(is.na(df)) > 0]
## [1] "budget"

Only budget contains NAs,

colMeans(is.na(df)) %>% .[. > 0]
##    budget 
## 0.2383002

About 24% of the films are missing budget numbers. However, this missingness might prove informative:

ggplot(df, aes(y = ifelse(is.na(budget), "Without", "With"), x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue Distribution With/Without Budget") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

We can see that the films that do not have budget data demonstrate noticeably lower revenue which is out outcome variable, so we will keep the NAs for our model.

Let’s view the summary statistics for the dataset,

summary(df)
##        rated        runtime            genre        language   
##  R        :738   Min.   : 39.0   Action   :519   English:1732  
##  PG-13    :695   1st Qu.: 97.0   Comedy   :352   Foreign: 118  
##  PG       :271   Median :108.0   Drama    :295   None   :   9  
##  Not Rated: 74   Mean   :111.3   Biography:173                 
##           : 41   3rd Qu.:122.0   Animation:155                 
##  G        : 18   Max.   :195.0   Adventure: 91                 
##  (Other)  : 22                   (Other)  :274                 
##           country       boxoffice             budget          num_directors
##  Foreign      : 236   Min.   :   159014   Min.   :   100000   1:1667       
##  None         :   4   1st Qu.:  4941632   1st Qu.: 12000000   2: 167       
##  United States:1619   Median : 20777061   Median : 30000000   3:  25       
##                       Mean   : 54189923   Mean   : 52157266                
##                       3rd Qu.: 60093010   3rd Qu.: 68000000                
##                       Max.   :936662225   Max.   :356000000                
##                                           NA's   :443                      
##  num_actors  actor_films5    actor_films10     actor_rev5       
##  0:   6     Min.   : 0.000   Min.   : 0.00   Min.   :0.000e+00  
##  1:   6     1st Qu.: 1.000   1st Qu.: 2.00   1st Qu.:1.104e+07  
##  2:   1     Median : 4.000   Median : 8.00   Median :2.138e+08  
##  3:1846     Mean   : 5.455   Mean   :10.09   Mean   :4.326e+08  
##             3rd Qu.: 9.000   3rd Qu.:16.00   3rd Qu.:6.438e+08  
##             Max.   :27.000   Max.   :44.00   Max.   :4.470e+09  
##                                                                 
##   actor_rev10        director_films5 director_films10 director_rev5      
##  Min.   :0.000e+00   Min.   :0.000   Min.   : 0.00    Min.   :0.000e+00  
##  1st Qu.:4.814e+07   1st Qu.:0.000   1st Qu.: 0.00    1st Qu.:0.000e+00  
##  Median :4.503e+08   Median :0.000   Median : 1.00    Median :0.000e+00  
##  Mean   :7.592e+08   Mean   :0.759   Mean   : 1.53    Mean   :6.138e+07  
##  3rd Qu.:1.168e+09   3rd Qu.:1.000   3rd Qu.: 2.00    3rd Qu.:6.549e+07  
##  Max.   :7.316e+09   Max.   :9.000   Max.   :14.00    Max.   :2.174e+09  
##                                                                          
##  director_rev10            month    
##  Min.   :0.000e+00   August   :185  
##  1st Qu.:0.000e+00   October  :182  
##  Median :1.273e+07   November :172  
##  Mean   :1.244e+08   September:163  
##  3rd Qu.:1.504e+08   December :158  
##  Max.   :2.693e+09   February :155  
##                      (Other)  :844

Looking at only the numeric variables,

descr(df,
      stats = c("min","q1","med","mean","q3","sd","max"),
      round.digits = 2,
      transpose = TRUE,
      headings = FALSE,
      style = "rmarkdown") |>
  kable(format = "html", digits = 2) |>
  kable_styling(full_width = FALSE) |>
  scroll_box(height = "200px", width = "100%")
Min Q1 Median Mean Q3 Std.Dev Max
actor_films10 0 2 8 10.09 16 9.28 44
actor_films5 0 1 4 5.45 9 5.16 27
actor_rev10 0 48091713 450255348 759167254.42 1168551880 897987469.29 7316416397
actor_rev5 0 10726630 213778031 432617771.15 644587426 571830292.26 4469634688
boxoffice 159014 4936819 20777061 54189922.78 60311495 92859139.66 936662225
budget 100000 12000000 30000000 52157266.12 68000000 58391659.06 356000000
director_films10 0 0 1 1.53 2 1.99 14
director_films5 0 0 0 0.76 1 1.07 9
director_rev10 0 0 12727256 124370528.27 150394119 230007052.55 2693332806
director_rev5 0 0 0 61377871.21 65565279 135630420.74 2173799662
runtime 39 97 108 111.27 122 19.84 195

We can see that there is a wide range of values across the numeric variables. We see there is a film category that contains an empty string, so we convert those to “None”.

# replace empty strings with "None"
df$rated <- fct_recode(df$rated, "None" = "", .default = "Other")
## Warning: Unknown levels in `f`: Other
kable(sort(prop.table(table(df$rated)), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
R 0.3969876
PG-13 0.3738569
PG 0.1457773
Not Rated 0.0398063
None 0.0220549
G 0.0096826
TV-MA 0.0032275
Unrated 0.0032275
TV-14 0.0026896
TV-PG 0.0021517
16+ 0.0005379
NC-17 0.0000000
df |> 
  select(rated) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), "R/PG/PG-13", "Other")) |>
  table() |>
  prop.table() |>
  sort(decreasing = TRUE) |>
  kable(col.names=NULL) |>
  kable_styling() 
R/PG/PG-13 0.9166218
Other 0.0833782

Nearly 80% of all films in our dataset are rated R or PG-13.

df |>
  select(rated, boxoffice) |>
  mutate(rated = as.character(rated)) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
  ggplot(aes(x = boxoffice*0.000001, y = rated)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue By Rating") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

df |>
  select(rated, boxoffice) |>
  mutate(rated = as.character(rated)) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
  group_by(rated) |>
  summarise(median = median(boxoffice)) |>
  arrange(desc(median)) |>
  kable() |>
  kable_styling()
rated median
PG 46700633
PG-13 32015231
R 15841514
Other 2553002

For the purpose of this study, given that we are examining a film’s performance represented as a function of domestic box office, we restrict our dataset to films that contain valid ratings.

df <- df |>
  filter(rated %in% c("G","NC-17","PG","PG-13","R")) |>
  mutate(rated = droplevels(rated))

Nearly all films are in English (93.2%) or from the United States (87.1%).

kable(sort(round(prop.table(table(df$language)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
English 0.974
Foreign 0.025
None 0.001
kable(sort(round(prop.table(table(df$country)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
United States 0.924
Foreign 0.075
None 0.001
p1 = ggplot(df, aes(y = ifelse(language == "English", "English", "Other"), 
                    x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue by Language") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

p2 = ggplot(df, aes(y = ifelse(country == "United States", "United States", "Other"), 
                    x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue by Country") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

ggarrange(p1, p2, ncol=2)

Nearly all films list 3 leading actors (99.3%), and 1 director (89.7%).

kable(sort(round(prop.table(table(df$num_actors)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
3 0.995
1 0.003
0 0.001
2 0.001
kable(sort(round(prop.table(table(df$num_directors)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()
1 0.898
2 0.092
3 0.010

Looking at distribution of the genre variable,

df$genre |>
  table() |>
  prop.table() |>
  sort(decreasing = TRUE) |>
  kable(col.names = NULL) |>
  kable_styling()
Action 0.2729384
Comedy 0.1945412
Drama 0.1614402
Biography 0.0952381
Animation 0.0836237
Adventure 0.0516841
Horror 0.0493612
Crime 0.0412311
Documentary 0.0389082
Fantasy 0.0040650
Thriller 0.0029036
Mystery 0.0017422
Family 0.0005807
History 0.0005807
Music 0.0005807
Musical 0.0005807
Romance 0.0000000
Sci-Fi 0.0000000
Short 0.0000000
Talk-Show 0.0000000
df |>
  mutate(genre = as.character(genre)) |>
  mutate(genre = ifelse(genre %in% c("Action","Comedy","Drama"), genre, "Other")) |>
  select(genre) |>
  table() |>
  prop.table() |>
  kable(col.names = NULL) |>
  kable_styling()
Action 0.2729384
Comedy 0.1945412
Drama 0.1614402
Other 0.3710801

We take all genres that represent less than 1% of films and combine them into the category “Other”,

other_genres <- df |>
  count(genre) |>
  mutate(pct = n/sum(n)) |>
  filter(pct < 0.01) |>
  pull(genre)

df <- df |>
  mutate(genre = fct_collapse(genre, Other = other_genres),
         genre = droplevels(genre)) 

Let’s take a look at the distribution of the numeric variables,

df |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_density(fill="lightblue") +
  labs(title = "Distribution Plots - Numerical Predictors") +
  facet_wrap(~ key, scales = "free") +
  theme_few() +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5))

We can see that all of the numerical predictors are highly skewed - the runtime variable is the only one that is nearly normal in its distribution, though it is still right-skewed.

Now let’s take a look at the relationship between the numerical predictors and the response variable (boxoffice):

featurePlot(x = select(df, where(is.numeric), -boxoffice),
            y = df$boxoffice,
            plot = "scatter",
            jitter = TRUE,
            labels = c("",""),
            col = "lightblue",
            main = "Feature Plot - Numerical Predictors")

Some of the models we are using do not handle missing data well. So we create a categorical variable for the missing budget data to denote if the budget data is provided or not.

df <- df |>
  mutate(bud_avail = as.factor(ifelse(is.na(budget), "No", "Yes")))

Modeling

Modeling for this report continues here.