For this project, I have put in some requirements for the movies that I’m going to select. These movies are: 1. English 2. Ranked by Users (meaning it has some sort of start or number system) 3. Have more than 100,000 thousand reviews 4. Have at least 30 movies in the category
library("base64enc")
library("furrr")
library("data.table")
library("rvest")
library("tidyverse")
library("stringr")
library("dplyr")
library("rjson")
library("stats")
library("openintro")
library("infer")
library("heatmaply")
library(R.utils)
library("psych")
library("ggpubr")
library("rstatix")
genres <- list("Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Drama", "Family", "Fantasy", "Film-Noir", "History", "Horror", "Music", "Musical", "Mystery", "Romance", "Sci-Fi", "Short-Film", "Sport", "Superhero", "Thriller", "War", "Western")
# I like to measure the time it takes to run since sometimes it takes a while and I get distracted
start_time <- Sys.time()
# We are going to use the Furrr package to allow for multiple cores to get the data
plan(multisession(workers = 10))
#we are going to save the data as a list called 'titles'
titles <- future_map(genres, function(x) {
p = str_c('https://www.imdb.com/search/title/?genres=',tolower(x),'&start=1&count=250&languages=en&sort=user_rating,desc&title_type=feature&num_votes=100000,', sep="")
URL_p = read_html(p)
# Let's get the title id
title_id <- URL_p %>% html_nodes( ".lister-item-image > a > img") %>%
html_attr("data-tconst")
# Let's get the genre
genre <- rep(c(x), times = length(title_id))
# Let's get the rank
rank <- (1:length(title_id))
# Let's add it all to a list
my_list <- list(title_id, genre, rank)
})
end_time <- Sys.time()
paste("Your dataframe has been built and it took",round(end_time - start_time), "seconds to complete.")
## [1] "Your dataframe has been built and it took 6 seconds to complete."
# Let's turn the list to a Dataframe
title_list <- rbindlist(titles)
#Let's change the columns names
colnames(title_list) <- c("title_id", "genre", "rank")
#Let's close all the extra cores
future:::ClusterRegistry("stop")
Let’s see what we are working with
summary(as.factor(title_list[[2]]))
## Action Adventure Animation Biography Comedy Crime Drama
## 250 250 134 126 250 250 250
## Family Fantasy Film-Noir History Horror Music Musical
## 213 250 5 83 190 45 64
## Mystery Romance Sci-Fi Short-Film Sport Superhero Thriller
## 246 250 250 250 49 250 250
## War Western
## 85 29
It looks like everything but Noir has at least 30 entries. I’m going to leave it in but take any results from it as not serious.
And now we are going to scrape every page. Note, that while doing this I can into several issues with missing data, such as revenue and Metascore, for which I had to add logic to make sure they were on the page before I scraped it.
In addition, I originally ran it as a loop, but then found out about the Furrr package and realized I could map it and save a lot of time. However I had to limit the amount of cores in use because on the initial trials my IP was flagged as a bot and I was blacklisted.
If this were a larger and frequent data pull, it would need to be some sort of code that automatically cycled through different VPN IPs so you could get the data quicker and without being blocked.
#Make sure you saved this file in a directory for this to work
dataPath <- setwd("..")
if (file.exists(paste(dataPath,'/imdb.csv',sep = ""))) {
print("Already Exists")
movies <- read.csv(paste(dataPath,'/imdb.csv',sep = ""))
} else {
start_time <- Sys.time()
plan(multisession(workers = 7))
movie_list <- future_map(as.list(title_list$title_id), function(x) {
p = str_c('https://www.imdb.com/title/',x, sep="")
URL_p = read_html(p)
# Let's get the title
title <- URL_p %>% html_nodes( ".star-rating-widget") %>%
html_attr("data-title")
# Let's get the title id
title_id <- x
# Let's get the certificate
cert <- URL_p %>% html_nodes(".subtext") %>%
html_text() %>%
strsplit(" |\\\n")
certificate <- cert[[1]][[22]]
# Let's get the metascore
if (str_detect(URL_p,'Metascore')) {
metascore <- URL_p %>% html_nodes(".metacriticScore > span") %>%
html_text()
} else {
metascore <-""
}
# Let's get the revenue
if (str_detect(URL_p,'Gross USA')) {
revenue <- URL_p %>% html_nodes(xpath='//*[@id="titleDetails"]/div[9]/text()[2]') %>%
html_text()
revenue <- str_trim(revenue)
} else {
revenue <- ""
}
# In order to map to work, we need to save all that information in a list
my_list <- list(title, title_id, certificate ,metascore, revenue)
})
end_time <- Sys.time()
paste("Your dataframe has been built and it took",round(end_time - start_time), "minutes to complete.")
future:::ClusterRegistry("stop")
#Convert list to dataframe and change the column titles
movies <- rbindlist(movie_list)
colnames(movies) <- c("title",
"title_id",
"certificate",
"metascore",
"revenue")
# There were a few issues with the data, so we are going to get a unique list of the title ids and then fix any that were not pulled in correctly as well as fix some of the errors in noticed in each column.
#Get rid of all duplicate rows, since this is just for data mapping
nrow(movies)
movies <- movies %>% distinct(title_id, .keep_all = TRUE)
nrow(movies)
## Certificate seems to have an error, it did not pull in 'Not Ranked'
movies %>% distinct(certificate, .keep_all = TRUE)
l <- movies %>% filter(str_detect(certificate, "Not"))
l$certificate <- "Not Ranked"
clean <- movies %>% filter(!str_detect(certificate, "Not"))
movies <- rbind(clean, l)
## I also had an issue with some of the revenues
l <- movies %>% filter(!str_detect(revenue, "\\$"))
l$revenue <- ""
clean <- movies %>% filter(str_detect(revenue, "\\$"))
movies <- rbind(clean, l)
# And an issue where GP is there instead of PG
l <- movies %>% filter(str_detect(certificate, "GP"))
l$certificate <- "PG"
clean <- movies %>% filter(!str_detect(certificate, "GP"))
movies <- rbind(clean, l)
}
## [1] "Already Exists"
# Set WD
dataPath <- setwd("..")
if (file.exists(paste(dataPath,'/title_ratings.tsv',sep = ""))) {
print("Already Exists")
df_ratings <- read_tsv(paste(dataPath,'/title_ratings.tsv',sep = ""), na = "\\N", quote = '')
df_titles <- read_tsv(paste(dataPath,'/titles.tsv',sep = ""), na = "\\N", quote = '')
} else {
#Download files
download.file(url = "https://datasets.imdbws.com/title.ratings.tsv.gz",
mode = "wb",
destfile=file.path(dataPath, "title_ratings.tsv.gz"))
download.file(url = "https://datasets.imdbws.com/title.basics.tsv.gz",
mode = "wb",
destfile=file.path(dataPath, "titles.tsv.gz"))
setwd(dataPath)
gunzip("title_ratings.tsv.gz", remove = FALSE)
gunzip("titles.tsv.gz", remove = FALSE)
df_ratings <- read_tsv(paste(dataPath,'/title_ratings.tsv',sep = ""), na = "\\N", quote = '')
df_titles <- read_tsv(paste(dataPath,'/titles.tsv',sep = ""), na = "\\N", quote = '')
}
## [1] "Already Exists"
Now I am going to select my data and combine it all into one large dataset
df_titles <- df_titles %>% select(tconst, startYear, runtimeMinutes)
movies_clean <- title_list %>% left_join(df_titles, by = c("title_id" = "tconst"))
movies_clean <- movies_clean %>% left_join(df_ratings, by = c("title_id" = "tconst"))
movies_clean <- movies_clean %>% left_join(movies, by = c("title_id" = "title_id"))
rm(df_ratings)
rm(df_titles)
movies_clean <- movies_clean %>% select(title_id, title, genre, rank,averageRating, numVotes, metascore, revenue, certificate, startYear, runtimeMinutes)
Now let’s get some summary statistics on our data
movies_clean <- movies_clean %>% mutate_if(sapply(movies_clean, is.character), as.factor)
summary(movies_clean)
## title_id title genre rank
## tt0103639: 9 Aladdin : 13 Action : 250 Min. : 1.0
## tt0129167: 9 Beauty and the Beast: 11 Adventure: 250 1st Qu.: 47.0
## tt2380307: 9 The Lion King : 10 Comedy : 250 Median :104.0
## tt0312004: 8 Coco : 9 Crime : 250 Mean :110.8
## tt2096673: 8 The Iron Giant : 9 Drama : 250 3rd Qu.:171.5
## tt2948356: 8 (Other) :3951 Fantasy : 250 Max. :250.0
## (Other) :3968 NA's : 16 (Other) :2519
## averageRating numVotes metascore revenue
## Min. :2.800 Min. : 99691 72 : 114 : 288
## 1st Qu.:7.200 1st Qu.: 147222 66 : 113 $210,460,015: 9
## Median :7.600 Median : 241593 65 : 109 $217,350,219: 9
## Mean :7.532 Mean : 367309 68 : 108 $23,180,087 : 9
## 3rd Qu.:8.000 3rd Qu.: 469186 76 : 108 $341,268,248: 8
## Max. :9.300 Max. :2385745 (Other):3451 (Other) :3680
## NA's : 16 NA's : 16
## certificate startYear runtimeMinutes
## R :1707 Min. :1921 Min. : 64.0
## PG-13 :1045 1st Qu.:1995 1st Qu.:101.0
## PG : 789 Median :2005 Median :116.0
## G : 264 Mean :2000 Mean :119.1
## Approved: 73 3rd Qu.:2013 3rd Qu.:132.0
## (Other) : 125 Max. :2021 Max. :242.0
## NA's : 16
hist(movies_clean$averageRating)
It looks like overall, the data is fine for statistical analysis, we’ll just have to check it at an individual level
Now we can see that the average rating overall is 7.533 and the median is 7.6 and overall, the data is left skewed but normally distributed. So now we have to check each genre for three conditions to make sure we can analyze them for bias.
ggplot(movies_clean, aes(x = averageRating)) +
geom_histogram(fill = "white", colour = "black", bins = 10) +
facet_grid(~ genre ~ ., scales = "free_y", margins = T)
histBy(averageRating ~ genre, data=movies_clean) #formula input
+ All the movies genres seem to follow a normal distribution 3. Constant variance.
par(las=2)
boxPlot(movies_clean$averageRating, fact = movies_clean$genre,
col = COL[1,2], ylab = "Average Movie Rating")
+ It looks like all the movies have constant variance.
Now that we validated that the conditions are met, let’s run the analysis.
one.way <- aov(averageRating ~ genre, data = movies_clean)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 22 640.9 29.130 106.3 <2e-16 ***
## Residuals 3996 1095.4 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s run this with a bootstrap to see if we come to the same conclusion
library("lmboot")
moviesAnova <- ANOVA.boot(averageRating ~ genre, B = 1000, type = "residual", wild.dist = "normal",
seed = 3112 , data = movies_clean, keep.boot.resp = FALSE)
## Warning in ANOVA.boot(averageRating ~ genre, B = 1000, type = "residual", : Number of bootstrap samples is recommended to be more than the number of observations.
moviesAnova$`p-values`
## [1] 0
There are 0 p values that fail to reject the null hypothesis, so there is most definitely bias in the data.
Now that we know and have confirmed that there is bias, the next thing to do is run a ‘pairwise t test’ to see where the values differ. Basically, the ‘pairwise t test’ tests each movie against each other movie with the null hypothesis that says that the true mean difference is 0. In this case, there is no difference between the ratings for each genre. I’m going to make sure there is a bonferroni correction on it since there are a lot of genres and a therefore a lot of pairs.
The bonferroni correction is a method to counteract the problem of multiple comparisons.Statistical hypothesis testing is based on rejecting the null hypothesis if the likelihood of the observed data under the null hypotheses is low. If multiple hypotheses are tested, the chance of observing a rare event increases, and therefore, the likelihood of incorrectly rejecting a null hypothesis (i.e., making a Type I error) increases.
The Bonferroni correction suggests that the p-value for each test must be equal to its alpha divided by the number of tests performed.
pwc <- movies_clean %>%
pairwise_t_test(
averageRating ~ genre, pool.sd = FALSE,
p.adjust.method = "bonferroni")
pwc_matches <- pwc %>% filter(p > .05) %>% arrange(desc(p))
pwc_matches
This shows all the pairs between the movies and their P-value. Let’s take a look at that as a boxplot.
pwc1 = pwc_matches %>% select(p,group1)
pwc2 = pwc_matches %>% select(p,group2)
names(pwc2)[2] <- "group1"
pwc_groups <- rbind(pwc1,pwc2)
pwc_groups$group1 <- as.factor(pwc_groups$group1)
ggboxplot(pwc_groups, x = "group1", y = "p", add = "point") +
rotate_x_text()
par(las=2)
barplot(sort(summary(pwc_groups$group1), decreasing = T))
Cinemascore
In order to get this data I had to call the ‘unofficial API’. To be transparent, there is nothing on the website that prevents me from legally doing this although this definitely crosses the line between ‘on the internet’ and a little unethical. Since it is for academic purposes, my moral compass is going to let it slide.
How it was done.
While digging through the website, I noticed that for each search term it required a minimum of two letters then the website would send these letters encoded to an API which would send back the movie title with the rating information.
Once sent, there was a link that you could access in the developer tools that showed the results in JSON
I realized that the encoded words were simply a cipher by trying out several combinations of letter patters (ab: YWI=, ba: YmE=) and realizing there was a pattern happening on the front end
Since it was happening on the front end, I then dug into the JS to find the method of encryption which was Base64
Running a few letters and matching them with the site I confirmed it was the Base64 cipher.
I figured it would be easier to get all movies rather than create a list and search for them. So I created a list that contained all two letter combinations possible.
cinemascore <- data.frame(title=character(),
grade=character(),
year=character())
combo <- list()
for (i in letters[1:26]){
for (j in letters[1:26]) {
combo <- append(combo, str_c(i,j, sep = ""))
}
}
dataPath <- setwd("..")
if (file.exists(paste(dataPath,'/Cinamascore.csv',sep = ""))) {
print("Already Exists")
cinemascore = read.csv2('C:/Users/humme/Google Drive/CUNY Classes/Data 607/Projects/Final Project/Cinamascore.csv', sep = ",")
} else {
for (i in unique(combo)) {
# Read in the names as base64
query = str_c("https://api.cinemascore.com/guest/search/title/", base64encode(charToRaw(i)), sep="")
# get the webpage content
Desc <- read_html(query) %>% html_text()
# Convert that data to a list from JSON
data <- fromJSON(Desc)
# Turn the nested list into a dataframe
cinema <- as.data.frame(do.call(rbind, data))
# Append the data to the
if (cinema$TITLE != 'No Results') {
cinemascore <- rbind(cinemascore, cinema)
}
}
cinemascore <- distinct(cinemascore)
cinemascore <- as.data.frame(lapply(cinemascore, unlist))
cinemascore <-data.frame(lapply(cinemascore, factor))
g <- c("A+","A","A-","B+", "B", "B-","C+","C","C-","D+","D", "D-","F")
Score <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
Grade <- data.frame(g,Score)
cinemascore <- cinemascore %>% left_join(Grade, by = c("GRADE" = "g"))
}
## [1] "Already Exists"
cinemascore %>% ggplot(aes(x=Score)) +
geom_bar()
Let’s combine the dataframes
#Match title style with the movies tab
cinemascore$TITLE <- str_to_title(cinemascore$TITLE)
#make it so that both databases can be joined
cinemascore$title_match <- str_replace(cinemascore$TITLE, ", The","")
movies_clean$title_match <- str_replace(movies_clean$title, "The ","")
#make it so that we are not working on individual movies
movies_unique <- movies_clean %>% distinct(title_id, .keep_all = T)
# Join tables
movies_all <- movies_unique %>% left_join(cinemascore, by = "title_match")
# Filter out unneeded columns
movies_all <- movies_all %>% select(-TITLE, -GRADE, -YEAR, -title_match)
movies_all$metascore <- as.numeric(as.character(movies_all$metascore))
## Warning: NAs introduced by coercion
#Now let's normalize the data
movies_all$imdb_norm <- heatmaply::normalize(movies_all$averageRating, range = c(0, 1))
movies_all$meta_norm <- heatmaply::normalize(movies_all$metascore, range = c(0, 1))
movies_all$cine_norm <- heatmaply::normalize(movies_all$Score, range = c(0, 1))
#Finally, let's filter the data for movies that exist in all three, and also just metascore and Imdb (since there are more)
movies_imc <- movies_all %>% filter(cine_norm >= 0 & meta_norm >= 0)
movies_im <- movies_all %>% filter( meta_norm >= 0)
movies_im$difference <- movies_im$imdb_norm - movies_im$meta_norm
mean(movies_im$difference)
## [1] 0.06037913
The average movie is 6 points higher on IMDB, let’s see what it looks like for each genre
movies_im %>% group_by(genre) %>% summarise(Average_Difference = mean(difference)) %>% arrange(desc(Average_Difference))
Let’s graph it so that we can see it better.
par(las=2)
boxPlot(movies_im$difference, fact = movies_im$genre,
col = COL[1,2], ylab = "Average Movie Rating Difference: IMDB vs Metascore")
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
#### IMDB and Cinemascore Analysis
For all movies that match up between all three movie databases, I want to see the mean differences between the databases
## difference between IMDB and Metacritic
movies_imc$difference_IM <- movies_imc$imdb_norm - movies_imc$meta_norm
## difference between IMDB and Cinemascore
movies_imc$difference_IC <- movies_imc$imdb_norm - movies_imc$cine_norm
## difference between Metacritic and Cinemascore
movies_imc$difference_MC <- movies_imc$meta_norm - movies_imc$cine_norm
mean(movies_imc$difference_IM)
## [1] 0.07737769
mean(movies_imc$difference_IC)
## [1] 0.4123628
mean(movies_imc$difference_M)
## [1] 0.3349851
Interestingly, it looks like the biggest difference is between the Cinemascore and IMDB (because there was not as many movies that aligned for the three movie databases, the difference between IMDB and Metascore has changed), let’s dig deeper:
par(las=2)
boxPlot(movies_imc$difference_IC, fact = movies_imc$genre,
col = COL[1,2], ylab = "Rating Difference: IMDB vs Cinemascore")
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
Let’s see if this is a trend overall or a historic trend
ggplot(movies_imc, aes(x=startYear, y=difference_IC)) +
geom_point()
Since this is ‘IMDB - Cinemascore’ the positive reflects movies where the reviews were higher rated on IMDB rather than by Cinemascore. Possible reason for this is that IMDB reviewers do not necessarily reflect everyday Americans like the Cinemascore does.
Let’s see if there is any correlation between revenue and any of the columns
movies_clean_analysis = movies_clean
movies_clean_analysis$revenue <- str_replace(movies_clean_analysis$revenue, "\\$","")
movies_clean_analysis$revenue <- str_replace_all(movies_clean_analysis$revenue, ",","")
movies_clean_analysis$revenue <- as.numeric(as.character(movies_clean_analysis$revenue))
is.numeric(movies_clean_analysis$revenue)
## [1] TRUE
movies_clean_analysis$metascore <- as.numeric(as.character(movies_clean_analysis$metascore))
## Warning: NAs introduced by coercion
m_full <- lm(revenue ~ genre + rank + averageRating + numVotes + metascore
+ certificate + startYear, data = movies_clean_analysis)
summary(m_full)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.210722e+09 3.016046e+08 -10.6454680 4.390218e-26
## genreAdventure 4.484039e+06 1.018746e+07 0.4401527 6.598525e-01
## genreAnimation 3.519146e+07 1.401923e+07 2.5102275 1.210815e-02
## genreBiography -3.627599e+07 1.289124e+07 -2.8140036 4.918993e-03
## genreComedy -3.247354e+07 1.021574e+07 -3.1787764 1.491313e-03
## genreCrime -3.019845e+07 1.024592e+07 -2.9473622 3.225224e-03
## genreDrama -6.687728e+07 1.088402e+07 -6.1445359 8.878986e-10
## genreFamily 4.796150e+06 1.208565e+07 0.3968465 6.915038e-01
## genreFantasy -9.354689e+06 1.052255e+07 -0.8890135 3.740543e-01
## genreHistory -2.417603e+07 1.550214e+07 -1.5595286 1.189578e-01
## genreHorror -2.491945e+07 1.320335e+07 -1.8873582 5.919122e-02
## genreMusic -1.934022e+07 1.962072e+07 -0.9857040 3.243435e-01
## genreMusical 2.375422e+07 1.782891e+07 1.3323432 1.828304e-01
## genreMystery -3.516312e+07 1.075526e+07 -3.2693898 1.087769e-03
## genreRomance -4.286892e+07 1.030715e+07 -4.1591452 3.267708e-05
## genreSci-Fi -6.404347e+06 1.015297e+07 -0.6307858 5.282200e-01
## genreShort-Film -6.563109e+07 1.127230e+07 -5.8223315 6.302735e-09
## genreSport -3.075350e+07 1.922103e+07 -1.5999921 1.096866e-01
## genreSuperhero -6.563109e+07 1.127230e+07 -5.8223315 6.302735e-09
## genreThriller -3.841325e+07 1.017359e+07 -3.7757802 1.620394e-04
## genreWar -5.533795e+06 1.536245e+07 -0.3602158 7.187066e-01
## genreWestern -9.476416e+06 2.453848e+07 -0.3861859 6.993815e-01
## rank 2.095542e+05 5.283318e+04 3.9663380 7.438208e-05
## averageRating -2.141435e+07 7.096301e+06 -3.0176782 2.564733e-03
## numVotes 2.113088e+02 7.858563e+00 26.8889856 1.723048e-145
## metascore 3.490563e+05 1.657417e+05 2.1060260 3.527003e-02
## certificateG -8.459628e+06 2.883928e+07 -0.2933370 7.692812e-01
## certificateNC-17 -1.335184e+08 8.312035e+07 -1.6063269 1.082884e-01
## certificateNot Ranked -1.315532e+08 3.801418e+07 -3.4606357 5.450632e-04
## certificatePassed 3.904891e+07 3.788738e+07 1.0306573 3.027697e-01
## certificatePG -6.684651e+06 2.869573e+07 -0.2329493 8.158138e-01
## certificatePG-13 -6.162168e+07 2.907186e+07 -2.1196334 3.410410e-02
## certificateR -1.340483e+08 2.879727e+07 -4.6548957 3.356388e-06
## startYear 1.736489e+06 1.462160e+05 11.8761874 6.033102e-32
Interesting, it looks like the two biggest factors for revenue are the number of votes it has received (probably the more popular a movie is the more money it receives) and year, which may be because the revenue numbers weren’t normalized.
The worst predictors were certificate and genre (with the exception of short-film, superhero, and drama)
Let’s see if there is any correlation between average rating and any of the columns
m_full2 <- lm(averageRating ~ genre + rank + revenue + numVotes + metascore
+ certificate + startYear, data = movies_clean_analysis)
summary(m_full2)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.429205e+01 6.721304e-01 21.2638043 1.054244e-94
## genreAdventure 5.093727e-02 2.368597e-02 2.1505253 3.157899e-02
## genreAnimation -6.472015e-01 3.083988e-02 -20.9858647 1.998979e-92
## genreBiography -3.240652e-01 2.954112e-02 -10.9699715 1.425760e-27
## genreComedy 6.028523e-03 2.379865e-02 0.2533136 8.000401e-01
## genreCrime -4.448692e-02 2.385325e-02 -1.8650259 6.225774e-02
## genreDrama 4.047431e-01 2.455601e-02 16.4824415 6.209334e-59
## genreFamily -5.479737e-01 2.661799e-02 -20.5865897 3.404255e-89
## genreFantasy -3.109620e-01 2.393685e-02 -12.9909330 9.302626e-38
## genreHistory -6.341825e-01 3.451989e-02 -18.3715078 3.553693e-72
## genreHorror -9.409480e-01 2.650499e-02 -35.5007860 1.739542e-237
## genreMusic -7.569146e-01 4.390416e-02 -17.2401578 4.232467e-64
## genreMusical -7.451230e-01 3.961719e-02 -18.8080758 2.093771e-75
## genreMystery -4.276857e-01 2.403983e-02 -17.7907124 5.624173e-68
## genreRomance -2.512413e-01 2.367384e-02 -10.6126116 6.178654e-26
## genreSci-Fi -2.217862e-01 2.333512e-02 -9.5043976 3.524315e-21
## genreShort-Film 4.538601e-01 2.525454e-02 17.9714242 2.853992e-69
## genreSport -8.292318e-01 4.258003e-02 -19.4746659 1.863271e-80
## genreSuperhero 4.538601e-01 2.525454e-02 17.9714242 2.853992e-69
## genreThriller 7.483689e-02 2.368180e-02 3.1601017 1.590007e-03
## genreWar -5.953797e-01 3.435846e-02 -17.3284719 1.027205e-64
## genreWestern -8.996366e-01 5.511716e-02 -16.3222606 7.242229e-58
## rank -5.179365e-03 8.855990e-05 -58.4843102 0.000000e+00
## revenue -1.158992e-10 3.840676e-11 -3.0176782 2.564733e-03
## numVotes 2.363066e-07 1.962177e-08 12.0430810 8.720623e-33
## metascore 8.095884e-03 3.618668e-04 22.3725519 5.096855e-104
## certificateG 1.617980e-01 6.703970e-02 2.4134650 1.585046e-02
## certificateNC-17 1.404941e-01 1.934269e-01 0.7263421 4.676755e-01
## certificateNot Ranked 2.484132e-01 8.848624e-02 2.8073658 5.021292e-03
## certificatePassed 8.857746e-02 8.814254e-02 1.0049344 3.149949e-01
## certificatePG 1.837219e-01 6.668966e-02 2.7548790 5.900422e-03
## certificatePG-13 1.642478e-01 6.762032e-02 2.4289706 1.518960e-02
## certificateR 1.991095e-01 6.711186e-02 2.9668299 3.028345e-03
## startYear -3.427057e-03 3.419921e-04 -10.0208659 2.460347e-23
So the most significant factors here are the genre, rank (the most, so much so that it the computer cannot tell it’s not 0), number of votes, metascore, and start year (apparently the older the movie the better). Interestingly, revenue was significant but not that much, while the certificates again had no impact.