Capstone EDA
The Data
library(tidyverse)
library(caret)
library(DMwR2)
library(e1071)
library(ipred)
library(xgboost)
library(kableExtra)
library(ggthemes)
library(egg)
library(forcats)
library(summarytools)
library(forecast)
library(doParallel)
Reading in the data,
title | year | rated | released | runtime | genre | director | writer | actors | plot | language | country | awards | poster | ratings | metascore | imdbrating | imdbvotes | imdbid | type | dvd | boxoffice | production | website | response | directorid | actorid | budget | num_directors | num_actors | actor_rev5 | actor_rev10 | director_films5 | actor_films5 | actor_films10 | director_rev10 | director_films10 | director_rev5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Spider-Man | 2002 | PG-13 | 2002-05-03 | 121 | Action | Sam Raimi | Stan Lee, Steve Ditko, David Koepp | Tobey Maguire, Kirsten Dunst, Willem Dafoe | After being bitten by a genetically-modified spider, a shy teenager gains spider-like abilities that he uses to fight injustice as a masked superhero and face a vengeful enemy. | English | United States | Nominated for 2 Oscars. 17 wins & 63 nominations total | https://m.media-amazon.com/images/M/MV5BZDEyN2NhMjgtMjdhNi00MmNlLWE5YTgtZGE4MzNjMTRlMGEwXkEyXkFqcGdeQXVyNDUyOTg3Njg@._V1_SX300.jpg | {‘Source’: ‘Internet Movie Database’, ‘Value’: ‘7.4/10’} | 73 | 7.4 | 852,452 | tt0145487 | movie | 25 Apr 2013 | 407022860 | NA | True | [‘nm0000600’] | [‘nm0001497’, ‘nm0000379’, ‘nm0000353’] | 1.39e+08 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
First we’ll subset the desired columns,
## [1] "title" "year" "rated" "released"
## [5] "runtime" "genre" "director" "writer"
## [9] "actors" "plot" "language" "country"
## [13] "awards" "poster" "ratings" "metascore"
## [17] "imdbrating" "imdbvotes" "imdbid" "type"
## [21] "dvd" "boxoffice" "production" "website"
## [25] "response" "directorid" "actorid" "budget"
## [29] "num_directors" "num_actors" "actor_rev5" "actor_rev10"
## [33] "director_films5" "actor_films5" "actor_films10" "director_rev10"
## [37] "director_films10" "director_rev5"
cols <- c("year","rated","released","runtime","genre","language","country","boxoffice","budget","num_directors","num_actors","actor_films5","actor_films10","actor_rev5","actor_rev10","director_films5","director_films10","director_rev5","director_rev10")
df <- df |>
select(all_of(cols))
kable(head(df)) |> kable_styling()
year | rated | released | runtime | genre | language | country | boxoffice | budget | num_directors | num_actors | actor_films5 | actor_films10 | actor_rev5 | actor_rev10 | director_films5 | director_films10 | director_rev5 | director_rev10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 | PG-13 | 2002-05-03 | 121 | Action | English | United States | 407022860 | 1.39e+08 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2002 | PG | 2002-05-16 | 142 | Action | English | United States | 310676740 | 1.15e+08 | 1 | 3 | 1 | 1 | 108638745 | 108638745 | 0 | 0 | 0 | 0 |
2002 | PG | 2002-11-15 | 161 | Adventure | English, Latin | United Kingdom, United States | 262641637 | 1.00e+08 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2002 | PG-13 | 2002-08-02 | 106 | Drama | English, Portuguese | United States | 227966634 | 7.20e+07 | 1 | 3 | 1 | 1 | 78122718 | 78122718 | 0 | 0 | 0 | 0 |
2002 | PG | 2002-08-02 | 95 | Comedy | English, Greek | Canada, United States | 241438208 | 5.00e+06 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2002 | PG-13 | 2002-12-18 | 179 | Action | English, Sindarin, Old English | New Zealand, United States | 342952511 | 9.40e+07 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The following columns will be converted to factors:
rated
- contains the film’s ratings.
genre
- contains the film’s genre. num_actors
- the number of leading actors per Imdb.
num_directors
- the number of directors per Imdb.
The following column will be converted to date:
released
- contains the film’s release date
df <- df |>
mutate(rated = as.factor(rated),
genre = as.factor(genre),
num_actors = as.factor(num_actors),
num_directors = as.factor(num_directors),
released = as.Date(released))
Extracting the month from release date,
Let’s see how many unique values there are in the
language
and country
columnns,
First, language:
## [1] 688
We take a look at the unique languages and notice that there are some
languages that contain ““, and some that contain”None”. First we convert
the empty strings to “None”, then we update the variable
column: if a film contains the language English, we label it “English”,
if the film contains the language “None”, we leave it as “None”,
otherwise, we label it “Foreign”.
# replace empty strings with "None"
df$language <- ifelse(df$language == "", "None", df$language)
# convert language values to "English", "Foreign", or "None"
df$language <- as.factor(ifelse(grepl("English", df$language),
"English",
ifelse(df$language == "None",
"None","Foreign")))
kable(sort(table(df$language), decreasing=TRUE), col.names=NULL) |>
kable_styling()
English | 3490 |
Foreign | 157 |
None | 10 |
Now we take a look at country.
## [1] 618
One of the issues here, like with the language
column,
is that there is no hierarchy given - countries are listed in
alphabetical order. As such, we’ll assume that if the United States is
listed, the film can be considered domestic, whereas if United States is
not listed, it will be considered foreign. We also observe that there
are some missing observations that contain empty strings. We’ll fill
those as “None”.
# fill empty rows
df$country <- ifelse(df$country == "", "None", df$country)
# recode the country variable
df$country <- as.factor(ifelse(grepl("United States", df$country),
"United States",
ifelse(df$country == "None",
"None","Foreign")))
kable(sort(table(df$country), decreasing=TRUE), col.names=NULL) |>
kable_styling()
United States | 3307 |
Foreign | 346 |
None | 4 |
Lastly, we examine our columns stored as double precision. We observe
that the actor and director revenue columns contain some floats, so we
leave those alone. We change the datatype for the budget
column from double precision to integer.
The data we are interested in is from 2012 on. We collected data beginning in 2002 in order to calculate the actor and director earnings for the previous 5- to 10- years. So now we subset the dataframe to only records from 2012 on and drop the year and released column altogether.
Exploratory Data Analysis
We now have a dataframe consisting of 1859 observations and 18 variables. Let’s take a look first at our missing values.
## [1] "budget"
Only budget contains NAs,
## budget
## 0.2383002
About 24% of the films are missing budget numbers. However, this missingness might prove informative:
ggplot(df, aes(y = ifelse(is.na(budget), "Without", "With"), x = boxoffice*0.000001)) +
geom_boxplot(fill = "lightblue") +
labs(x = "(in millions)",
y = "",
title = "Revenue Distribution With/Without Budget") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
We can see that the films that do not have budget data demonstrate noticeably lower revenue which is out outcome variable, so we will keep the NAs for our model.
Let’s view the summary statistics for the dataset,
## rated runtime genre language
## R :738 Min. : 39.0 Action :519 English:1732
## PG-13 :695 1st Qu.: 97.0 Comedy :352 Foreign: 118
## PG :271 Median :108.0 Drama :295 None : 9
## Not Rated: 74 Mean :111.3 Biography:173
## : 41 3rd Qu.:122.0 Animation:155
## G : 18 Max. :195.0 Adventure: 91
## (Other) : 22 (Other) :274
## country boxoffice budget num_directors
## Foreign : 236 Min. : 159014 Min. : 100000 1:1667
## None : 4 1st Qu.: 4941632 1st Qu.: 12000000 2: 167
## United States:1619 Median : 20777061 Median : 30000000 3: 25
## Mean : 54189923 Mean : 52157266
## 3rd Qu.: 60093010 3rd Qu.: 68000000
## Max. :936662225 Max. :356000000
## NA's :443
## num_actors actor_films5 actor_films10 actor_rev5
## 0: 6 Min. : 0.000 Min. : 0.00 Min. :0.000e+00
## 1: 6 1st Qu.: 1.000 1st Qu.: 2.00 1st Qu.:1.104e+07
## 2: 1 Median : 4.000 Median : 8.00 Median :2.138e+08
## 3:1846 Mean : 5.455 Mean :10.09 Mean :4.326e+08
## 3rd Qu.: 9.000 3rd Qu.:16.00 3rd Qu.:6.438e+08
## Max. :27.000 Max. :44.00 Max. :4.470e+09
##
## actor_rev10 director_films5 director_films10 director_rev5
## Min. :0.000e+00 Min. :0.000 Min. : 0.00 Min. :0.000e+00
## 1st Qu.:4.814e+07 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.:0.000e+00
## Median :4.503e+08 Median :0.000 Median : 1.00 Median :0.000e+00
## Mean :7.592e+08 Mean :0.759 Mean : 1.53 Mean :6.138e+07
## 3rd Qu.:1.168e+09 3rd Qu.:1.000 3rd Qu.: 2.00 3rd Qu.:6.549e+07
## Max. :7.316e+09 Max. :9.000 Max. :14.00 Max. :2.174e+09
##
## director_rev10 month
## Min. :0.000e+00 August :185
## 1st Qu.:0.000e+00 October :182
## Median :1.273e+07 November :172
## Mean :1.244e+08 September:163
## 3rd Qu.:1.504e+08 December :158
## Max. :2.693e+09 February :155
## (Other) :844
Looking at only the numeric variables,
descr(df,
stats = c("min","q1","med","mean","q3","sd","max"),
round.digits = 2,
transpose = TRUE,
headings = FALSE,
style = "rmarkdown") |>
kable(format = "html", digits = 2) |>
kable_styling(full_width = FALSE) |>
scroll_box(height = "200px", width = "100%")
Min | Q1 | Median | Mean | Q3 | Std.Dev | Max | |
---|---|---|---|---|---|---|---|
actor_films10 | 0 | 2 | 8 | 10.09 | 16 | 9.28 | 44 |
actor_films5 | 0 | 1 | 4 | 5.45 | 9 | 5.16 | 27 |
actor_rev10 | 0 | 48091713 | 450255348 | 759167254.42 | 1168551880 | 897987469.29 | 7316416397 |
actor_rev5 | 0 | 10726630 | 213778031 | 432617771.15 | 644587426 | 571830292.26 | 4469634688 |
boxoffice | 159014 | 4936819 | 20777061 | 54189922.78 | 60311495 | 92859139.66 | 936662225 |
budget | 100000 | 12000000 | 30000000 | 52157266.12 | 68000000 | 58391659.06 | 356000000 |
director_films10 | 0 | 0 | 1 | 1.53 | 2 | 1.99 | 14 |
director_films5 | 0 | 0 | 0 | 0.76 | 1 | 1.07 | 9 |
director_rev10 | 0 | 0 | 12727256 | 124370528.27 | 150394119 | 230007052.55 | 2693332806 |
director_rev5 | 0 | 0 | 0 | 61377871.21 | 65565279 | 135630420.74 | 2173799662 |
runtime | 39 | 97 | 108 | 111.27 | 122 | 19.84 | 195 |
We can see that there is a wide range of values across the numeric variables. We see there is a film category that contains an empty string, so we convert those to “None”.
# replace empty strings with "None"
df$rated <- fct_recode(df$rated, "None" = "", .default = "Other")
## Warning: Unknown levels in `f`: Other
R | 0.3969876 |
PG-13 | 0.3738569 |
PG | 0.1457773 |
Not Rated | 0.0398063 |
None | 0.0220549 |
G | 0.0096826 |
TV-MA | 0.0032275 |
Unrated | 0.0032275 |
TV-14 | 0.0026896 |
TV-PG | 0.0021517 |
16+ | 0.0005379 |
NC-17 | 0.0000000 |
df |>
select(rated) |>
mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), "R/PG/PG-13", "Other")) |>
table() |>
prop.table() |>
sort(decreasing = TRUE) |>
kable(col.names=NULL) |>
kable_styling()
R/PG/PG-13 | 0.9166218 |
Other | 0.0833782 |
Nearly 80% of all films in our dataset are rated R or PG-13.
df |>
select(rated, boxoffice) |>
mutate(rated = as.character(rated)) |>
mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
ggplot(aes(x = boxoffice*0.000001, y = rated)) +
geom_boxplot(fill = "lightblue") +
labs(x = "(in millions)",
y = "",
title = "Revenue By Rating") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
df |>
select(rated, boxoffice) |>
mutate(rated = as.character(rated)) |>
mutate(rated = ifelse(rated %in% c("R","PG","PG-13"), rated, "Other")) |>
group_by(rated) |>
summarise(median = median(boxoffice)) |>
arrange(desc(median)) |>
kable() |>
kable_styling()
rated | median |
---|---|
PG | 46700633 |
PG-13 | 32015231 |
R | 15841514 |
Other | 2553002 |
For the purpose of this study, given that we are examining a film’s performance represented as a function of domestic box office, we restrict our dataset to films that contain valid ratings.
Nearly all films are in English (93.2%) or from the United States (87.1%).
kable(sort(round(prop.table(table(df$language)),3), decreasing=TRUE), col.names=NULL) |>
kable_styling()
English | 0.974 |
Foreign | 0.025 |
None | 0.001 |
kable(sort(round(prop.table(table(df$country)),3), decreasing=TRUE), col.names=NULL) |>
kable_styling()
United States | 0.924 |
Foreign | 0.075 |
None | 0.001 |
p1 = ggplot(df, aes(y = ifelse(language == "English", "English", "Other"),
x = boxoffice*0.000001)) +
geom_boxplot(fill = "lightblue") +
labs(x = "(in millions)",
y = "",
title = "Revenue by Language") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
p2 = ggplot(df, aes(y = ifelse(country == "United States", "United States", "Other"),
x = boxoffice*0.000001)) +
geom_boxplot(fill = "lightblue") +
labs(x = "(in millions)",
y = "",
title = "Revenue by Country") +
theme_few() +
theme(plot.title = element_text(hjust = 0.5))
ggarrange(p1, p2, ncol=2)
Nearly all films list 3 leading actors (99.3%), and 1 director (89.7%).
kable(sort(round(prop.table(table(df$num_actors)),3), decreasing=TRUE), col.names=NULL) |>
kable_styling()
3 | 0.995 |
1 | 0.003 |
0 | 0.001 |
2 | 0.001 |
kable(sort(round(prop.table(table(df$num_directors)),3), decreasing=TRUE), col.names=NULL) |>
kable_styling()
1 | 0.898 |
2 | 0.092 |
3 | 0.010 |
Looking at distribution of the genre
variable,
df$genre |>
table() |>
prop.table() |>
sort(decreasing = TRUE) |>
kable(col.names = NULL) |>
kable_styling()
Action | 0.2729384 |
Comedy | 0.1945412 |
Drama | 0.1614402 |
Biography | 0.0952381 |
Animation | 0.0836237 |
Adventure | 0.0516841 |
Horror | 0.0493612 |
Crime | 0.0412311 |
Documentary | 0.0389082 |
Fantasy | 0.0040650 |
Thriller | 0.0029036 |
Mystery | 0.0017422 |
Family | 0.0005807 |
History | 0.0005807 |
Music | 0.0005807 |
Musical | 0.0005807 |
Romance | 0.0000000 |
Sci-Fi | 0.0000000 |
Short | 0.0000000 |
Talk-Show | 0.0000000 |
df |>
mutate(genre = as.character(genre)) |>
mutate(genre = ifelse(genre %in% c("Action","Comedy","Drama"), genre, "Other")) |>
select(genre) |>
table() |>
prop.table() |>
kable(col.names = NULL) |>
kable_styling()
Action | 0.2729384 |
Comedy | 0.1945412 |
Drama | 0.1614402 |
Other | 0.3710801 |
We take all genres that represent less than 1% of films and combine them into the category “Other”,
other_genres <- df |>
count(genre) |>
mutate(pct = n/sum(n)) |>
filter(pct < 0.01) |>
pull(genre)
df <- df |>
mutate(genre = fct_collapse(genre, Other = other_genres),
genre = droplevels(genre))
Let’s take a look at the distribution of the numeric variables,
df |>
keep(is.numeric) |>
gather() |>
ggplot(aes(x = value)) +
geom_density(fill="lightblue") +
labs(title = "Distribution Plots - Numerical Predictors") +
facet_wrap(~ key, scales = "free") +
theme_few() +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(hjust = 0.5))
We can see that all of the numerical predictors are highly skewed - the
runtime
variable is the only one that is nearly normal in
its distribution, though it is still right-skewed.
Now let’s take a look at the relationship between the numerical
predictors and the response variable (boxoffice
):
featurePlot(x = select(df, where(is.numeric), -boxoffice),
y = df$boxoffice,
plot = "scatter",
jitter = TRUE,
labels = c("",""),
col = "lightblue",
main = "Feature Plot - Numerical Predictors")
Some of the models we are using do not handle missing data well. So we create a categorical variable for the missing budget data to denote if the budget data is provided or not.
Modeling
Modeling for this report continues here.