Capstone EDA

The Data

library(tidyverse)
library(caret)
library(DMwR2)
library(e1071)
library(ipred)
library(xgboost)
library(kableExtra)
library(ggthemes)
library(egg)
library(forcats)
library(summarytools)
library(forecast)
library(doParallel)

Reading in the data,

df <- read.csv("movies_clean_v4.csv")
kable(head(df,1)) |> kable_styling()

title	year	rated	released	runtime	genre	director	writer	actors	plot	language	country	awards	poster	ratings	metascore	imdbrating	imdbvotes	imdbid	type	dvd	boxoffice	production	website	response	directorid	actorid	budget	num_directors	num_actors	actor_rev5	actor_rev10	director_films5	actor_films5	actor_films10	director_rev10	director_films10	director_rev5
Spider-Man	2002	PG-13	2002-05-03	121	Action	Sam Raimi	Stan Lee, Steve Ditko, David Koepp	Tobey Maguire, Kirsten Dunst, Willem Dafoe	After being bitten by a genetically-modified spider, a shy teenager gains spider-like abilities that he uses to fight injustice as a masked superhero and face a vengeful enemy.	English	United States	Nominated for 2 Oscars. 17 wins & 63 nominations total	https://m.media-amazon.com/images/M/MV5BZDEyN2NhMjgtMjdhNi00MmNlLWE5YTgtZGE4MzNjMTRlMGEwXkEyXkFqcGdeQXVyNDUyOTg3Njg@._V1_SX300.jpg	{‘Source’: ‘Internet Movie Database’, ‘Value’: ‘7.4/10’}	73	7.4	852,452	tt0145487	movie	25 Apr 2013	407022860		NA	True	[‘nm0000600’]	[‘nm0001497’, ‘nm0000379’, ‘nm0000353’]	1.39e+08	1	3	0	0	0	0	0	0	0	0

First we’ll subset the desired columns,

colnames(df)

##  [1] "title"            "year"             "rated"            "released"        
##  [5] "runtime"          "genre"            "director"         "writer"          
##  [9] "actors"           "plot"             "language"         "country"         
## [13] "awards"           "poster"           "ratings"          "metascore"       
## [17] "imdbrating"       "imdbvotes"        "imdbid"           "type"            
## [21] "dvd"              "boxoffice"        "production"       "website"         
## [25] "response"         "directorid"       "actorid"          "budget"          
## [29] "num_directors"    "num_actors"       "actor_rev5"       "actor_rev10"     
## [33] "director_films5"  "actor_films5"     "actor_films10"    "director_rev10"  
## [37] "director_films10" "director_rev5"

cols <- c("year","rated","released","runtime","genre","language","country","boxoffice","budget","num_directors","num_actors","actor_films5","actor_films10","actor_rev5","actor_rev10","director_films5","director_films10","director_rev5","director_rev10")

df <- df |>
  select(all_of(cols))

kable(head(df)) |> kable_styling()

year	rated	released	runtime	genre	language	country	boxoffice	budget	num_directors	num_actors	actor_films5	actor_films10	actor_rev5	actor_rev10
2002	PG-13	2002-05-03	121	Action	English	United States	407022860	1.39e+08	1	3	0	0	0	0
2002	PG	2002-05-16	142	Action	English	United States	310676740	1.15e+08	1	3	1	1	108638745	108638745
2002	PG	2002-11-15	161	Adventure	English, Latin	United Kingdom, United States	262641637	1.00e+08	1	3	0	0	0	0
2002	PG-13	2002-08-02	106	Drama	English, Portuguese	United States	227966634	7.20e+07	1	3	1	1	78122718	78122718
2002	PG	2002-08-02	95	Comedy	English, Greek	Canada, United States	241438208	5.00e+06	1	3	0	0	0	0
2002	PG-13	2002-12-18	179	Action	English, Sindarin, Old English	New Zealand, United States	342952511	9.40e+07	1	3	0	0	0	0

The following columns will be converted to factors:

rated - contains the film’s ratings.
genre - contains the film’s genre. num_actors - the number of leading actors per Imdb.
num_directors - the number of directors per Imdb.

The following column will be converted to date:

released - contains the film’s release date

df <- df |>
  mutate(rated = as.factor(rated),
         genre = as.factor(genre),
         num_actors = as.factor(num_actors),
         num_directors = as.factor(num_directors),
         released = as.Date(released))

Extracting the month from release date,

df <- df |>
  mutate(month = as.factor(months(released)))

Let’s see how many unique values there are in the language and country columnns,

First, language:

n_distinct(df$language)

## [1] 688

We take a look at the unique languages and notice that there are some languages that contain ““, and some that contain”None”. First we convert the empty strings to “None”, then we update the variable column: if a film contains the language English, we label it “English”, if the film contains the language “None”, we leave it as “None”, otherwise, we label it “Foreign”.

# replace empty strings with "None"
df$language <- ifelse(df$language == "", "None", df$language)

# convert language values to "English", "Foreign", or "None"
df$language <- as.factor(ifelse(grepl("English", df$language), 
                                "English",
                                ifelse(df$language == "None",
                                       "None","Foreign")))

kable(sort(table(df$language), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

English	3490
Foreign	157
None	10

Now we take a look at country.

n_distinct(df$country)

## [1] 618

One of the issues here, like with the language column, is that there is no hierarchy given - countries are listed in alphabetical order. As such, we’ll assume that if the United States is listed, the film can be considered domestic, whereas if United States is not listed, it will be considered foreign. We also observe that there are some missing observations that contain empty strings. We’ll fill those as “None”.

# fill empty rows
df$country <- ifelse(df$country == "", "None", df$country)

# recode the country variable
df$country <- as.factor(ifelse(grepl("United States", df$country), 
                               "United States",
                               ifelse(df$country == "None", 
                                      "None","Foreign")))

kable(sort(table(df$country), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

United States	3307
Foreign	346
None	4

Lastly, we examine our columns stored as double precision. We observe that the actor and director revenue columns contain some floats, so we leave those alone. We change the datatype for the budget column from double precision to integer.

df$budget <- as.integer(df$budget)

The data we are interested in is from 2012 on. We collected data beginning in 2002 in order to calculate the actor and director earnings for the previous 5- to 10- years. So now we subset the dataframe to only records from 2012 on and drop the year and released column altogether.

df <- df |>
  filter(year >= 2012) |>
  select(-c(year,released))

Exploratory Data Analysis

We now have a dataframe consisting of 1859 observations and 18 variables. Let’s take a look first at our missing values.

colnames(df)[colSums(is.na(df)) > 0]

## [1] "budget"

Only budget contains NAs,

colMeans(is.na(df)) %>% .[. > 0]

##    budget 
## 0.2383002

About 24% of the films are missing budget numbers. However, this missingness might prove informative:

ggplot(df, aes(y = ifelse(is.na(budget), "Without", "With"), x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue Distribution With/Without Budget") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

We can see that the films that do not have budget data demonstrate noticeably lower revenue which is out outcome variable, so we will keep the NAs for our model.

Let’s view the summary statistics for the dataset,

summary(df)

##        rated        runtime            genre        language   
##  R        :738   Min.   : 39.0   Action   :519   English:1732  
##  PG-13    :695   1st Qu.: 97.0   Comedy   :352   Foreign: 118  
##  PG       :271   Median :108.0   Drama    :295   None   :   9  
##  Not Rated: 74   Mean   :111.3   Biography:173                 
##           : 41   3rd Qu.:122.0   Animation:155                 
##  G        : 18   Max.   :195.0   Adventure: 91                 
##  (Other)  : 22                   (Other)  :274                 
##           country       boxoffice             budget          num_directors
##  Foreign      : 236   Min.   :   159014   Min.   :   100000   1:1667       
##  None         :   4   1st Qu.:  4941632   1st Qu.: 12000000   2: 167       
##  United States:1619   Median : 20777061   Median : 30000000   3:  25       
##                       Mean   : 54189923   Mean   : 52157266                
##                       3rd Qu.: 60093010   3rd Qu.: 68000000                
##                       Max.   :936662225   Max.   :356000000                
##                                           NA's   :443                      
##  num_actors  actor_films5    actor_films10     actor_rev5       
##  0:   6     Min.   : 0.000   Min.   : 0.00   Min.   :0.000e+00  
##  1:   6     1st Qu.: 1.000   1st Qu.: 2.00   1st Qu.:1.104e+07  
##  2:   1     Median : 4.000   Median : 8.00   Median :2.138e+08  
##  3:1846     Mean   : 5.455   Mean   :10.09   Mean   :4.326e+08  
##             3rd Qu.: 9.000   3rd Qu.:16.00   3rd Qu.:6.438e+08  
##             Max.   :27.000   Max.   :44.00   Max.   :4.470e+09  
##                                                                 
##   actor_rev10        director_films5 director_films10 director_rev5      
##  Min.   :0.000e+00   Min.   :0.000   Min.   : 0.00    Min.   :0.000e+00  
##  1st Qu.:4.814e+07   1st Qu.:0.000   1st Qu.: 0.00    1st Qu.:0.000e+00  
##  Median :4.503e+08   Median :0.000   Median : 1.00    Median :0.000e+00  
##  Mean   :7.592e+08   Mean   :0.759   Mean   : 1.53    Mean   :6.138e+07  
##  3rd Qu.:1.168e+09   3rd Qu.:1.000   3rd Qu.: 2.00    3rd Qu.:6.549e+07  
##  Max.   :7.316e+09   Max.   :9.000   Max.   :14.00    Max.   :2.174e+09  
##                                                                          
##  director_rev10            month    
##  Min.   :0.000e+00   August   :185  
##  1st Qu.:0.000e+00   October  :182  
##  Median :1.273e+07   November :172  
##  Mean   :1.244e+08   September:163  
##  3rd Qu.:1.504e+08   December :158  
##  Max.   :2.693e+09   February :155  
##                      (Other)  :844

Looking at only the numeric variables,

descr(df,
      stats = c("min","q1","med","mean","q3","sd","max"),
      round.digits = 2,
      transpose = TRUE,
      headings = FALSE,
      style = "rmarkdown") |>
  kable(format = "html", digits = 2) |>
  kable_styling(full_width = FALSE) |>
  scroll_box(height = "200px", width = "100%")

	Min	Q1	Median	Mean	Q3	Std.Dev	Max
actor_films10	0	2	8	10.09	16	9.28	44
actor_films5	0	1	4	5.45	9	5.16	27
actor_rev10	0	48091713	450255348	759167254.42	1168551880	897987469.29	7316416397
actor_rev5	0	10726630	213778031	432617771.15	644587426	571830292.26	4469634688
boxoffice	159014	4936819	20777061	54189922.78	60311495	92859139.66	936662225
budget	100000	12000000	30000000	52157266.12	68000000	58391659.06	356000000
director_films10	0	0	1	1.53	2	1.99	14
director_films5	0	0	0	0.76	1	1.07	9
director_rev10	0	0	12727256	124370528.27	150394119	230007052.55	2693332806
director_rev5	0	0	0	61377871.21	65565279	135630420.74	2173799662
runtime	39	97	108	111.27	122	19.84	195

We can see that there is a wide range of values across the numeric variables. We see there is a film category that contains an empty string, so we convert those to “None”.

# replace empty strings with "None"
df$rated <- fct_recode(df$rated, "None" = "", .default = "Other")

## Warning: Unknown levels in `f`: Other

kable(sort(prop.table(table(df$rated)), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

R	0.3969876
PG-13	0.3738569
PG	0.1457773
Not Rated	0.0398063
None	0.0220549
G	0.0096826
TV-MA	0.0032275
Unrated	0.0032275
TV-14	0.0026896
TV-PG	0.0021517
16+	0.0005379
NC-17	0.0000000

df |> 
  select(rated) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), "R/PG/PG-13", "Other")) |>
  table() |>
  prop.table() |>
  sort(decreasing = TRUE) |>
  kable(col.names=NULL) |>
  kable_styling()

R/PG/PG-13	0.9166218
Other	0.0833782

Nearly 80% of all films in our dataset are rated R or PG-13.

df |>
  select(rated, boxoffice) |>
  mutate(rated = as.character(rated)) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
  ggplot(aes(x = boxoffice*0.000001, y = rated)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue By Rating") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

df |>
  select(rated, boxoffice) |>
  mutate(rated = as.character(rated)) |>
  mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
  group_by(rated) |>
  summarise(median = median(boxoffice)) |>
  arrange(desc(median)) |>
  kable() |>
  kable_styling()

rated	median
PG	46700633
PG-13	32015231
R	15841514
Other	2553002

For the purpose of this study, given that we are examining a film’s performance represented as a function of domestic box office, we restrict our dataset to films that contain valid ratings.

df <- df |>
  filter(rated %in% c("G","NC-17","PG","PG-13","R")) |>
  mutate(rated = droplevels(rated))

Nearly all films are in English (93.2%) or from the United States (87.1%).

kable(sort(round(prop.table(table(df$language)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

English	0.974
Foreign	0.025
None	0.001

kable(sort(round(prop.table(table(df$country)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

United States	0.924
Foreign	0.075
None	0.001

p1 = ggplot(df, aes(y = ifelse(language == "English", "English", "Other"), 
                    x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue by Language") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

p2 = ggplot(df, aes(y = ifelse(country == "United States", "United States", "Other"), 
                    x = boxoffice*0.000001)) + 
  geom_boxplot(fill = "lightblue") +
  labs(x = "(in millions)",
       y = "",
       title = "Revenue by Country") +
  theme_few() + 
  theme(plot.title = element_text(hjust = 0.5))

ggarrange(p1, p2, ncol=2)

Nearly all films list 3 leading actors (99.3%), and 1 director (89.7%).

kable(sort(round(prop.table(table(df$num_actors)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

3	0.995
1	0.003
0	0.001
2	0.001

kable(sort(round(prop.table(table(df$num_directors)),3), decreasing=TRUE), col.names=NULL) |>
  kable_styling()

1	0.898
2	0.092
3	0.010

Looking at distribution of the genre variable,

df$genre |>
  table() |>
  prop.table() |>
  sort(decreasing = TRUE) |>
  kable(col.names = NULL) |>
  kable_styling()

Action	0.2729384
Comedy	0.1945412
Drama	0.1614402
Biography	0.0952381
Animation	0.0836237
Adventure	0.0516841
Horror	0.0493612
Crime	0.0412311
Documentary	0.0389082
Fantasy	0.0040650
Thriller	0.0029036
Mystery	0.0017422
Family	0.0005807
History	0.0005807
Music	0.0005807
Musical	0.0005807
Romance	0.0000000
Sci-Fi	0.0000000
Short	0.0000000
Talk-Show	0.0000000

df |>
  mutate(genre = as.character(genre)) |>
  mutate(genre = ifelse(genre %in% c("Action","Comedy","Drama"), genre, "Other")) |>
  select(genre) |>
  table() |>
  prop.table() |>
  kable(col.names = NULL) |>
  kable_styling()

Action	0.2729384
Comedy	0.1945412
Drama	0.1614402
Other	0.3710801

We take all genres that represent less than 1% of films and combine them into the category “Other”,

other_genres <- df |>
  count(genre) |>
  mutate(pct = n/sum(n)) |>
  filter(pct < 0.01) |>
  pull(genre)

df <- df |>
  mutate(genre = fct_collapse(genre, Other = other_genres),
         genre = droplevels(genre))

Let’s take a look at the distribution of the numeric variables,

df |>
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(x = value)) +
  geom_density(fill="lightblue") +
  labs(title = "Distribution Plots - Numerical Predictors") +
  facet_wrap(~ key, scales = "free") +
  theme_few() +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(hjust = 0.5))

We can see that all of the numerical predictors are highly skewed - the runtime variable is the only one that is nearly normal in its distribution, though it is still right-skewed.

Now let’s take a look at the relationship between the numerical predictors and the response variable (boxoffice):

featurePlot(x = select(df, where(is.numeric), -boxoffice),
            y = df$boxoffice,
            plot = "scatter",
            jitter = TRUE,
            labels = c("",""),
            col = "lightblue",
            main = "Feature Plot - Numerical Predictors")

Some of the models we are using do not handle missing data well. So we create a categorical variable for the missing budget data to denote if the budget data is provided or not.

df <- df |>
  mutate(bud_avail = as.factor(ifelse(is.na(budget), "No", "Yes")))

Modeling

Modeling for this report continues here.