Semester One Capstone

Overview

The goal of this project is twofold to complete the requirements of Data 606 and 607. For this, it will be a combination of data engineering and data science. I will be pulling the top 200 (if available) movies for each genre in IMDB, then the same movies from other movie review sites, Metacritic, and Cinemascore. The goal is to see if IMDB is biased towards a particular genre, to see if metacritic aligns with IMDB, and Cinemascore, a company that prides itself in estimating ratings when movies come out.

For this project, I have put in some requirements for the movies that I’m going to select. These movies are: 1. English 2. Ranked by Users (meaning it has some sort of start or number system) 3. Have more than 100,000 thousand reviews 4. Have at least 30 movies in the category

The Plan

Extract the Movie Data and Run an ANOVA analysis

IMDB & Metacritic: We are going to scrape the web for the top 100 in each genre
We are going to download the data about movies from IMDB and match it with the titles from the website
Run a Statistical Analysis to see whether these websites are biased towards certain movies

Extra Analysis

We are going to utilize the Cinemascore API to get the movies from there
We are going to explore the data and see what insights we can find in the data

For Example, Cinemascore is a professional survey group that estimates the ratings of movies, let’s see how their estimations hold up among the online reviews.

Load necesary packages

library("base64enc")
library("furrr")
library("data.table")
library("rvest")
library("tidyverse")
library("stringr")
library("dplyr")
library("rjson")
library("stats")
library("openintro")
library("infer")
library("heatmaply")
library(R.utils)
library("psych")
library("ggpubr")
library("rstatix")

Extract the Movie Data and Run an ANOVA analysis

IMDB data

Make a list of the genres we are going to be using to pull in the movie data

genres <- list("Action",    "Adventure",    "Animation",    "Biography",    "Comedy",   "Crime",    "Drama",    "Family",   "Fantasy",  "Film-Noir",    "History",  "Horror",   "Music",    "Musical",  "Mystery",  "Romance",  "Sci-Fi",   "Short-Film",   "Sport",    "Superhero",    "Thriller", "War",  "Western")

Now we have to get a list of all the movie ID numbers

# I like to measure the time it takes to run since sometimes it takes a while and I get distracted
start_time <- Sys.time()
# We are going to use the Furrr package to allow for multiple cores to get the data
plan(multisession(workers = 10))

#we are going to save the data as a list called 'titles'
titles <- future_map(genres, function(x) {
  p = str_c('https://www.imdb.com/search/title/?genres=',tolower(x),'&start=1&count=250&languages=en&sort=user_rating,desc&title_type=feature&num_votes=100000,', sep="")
  URL_p = read_html(p)
  
    # Let's get the title id
  title_id <- URL_p  %>% html_nodes( ".lister-item-image > a > img") %>% 
      html_attr("data-tconst")
    # Let's get the genre
  genre <- rep(c(x), times = length(title_id))
    # Let's get the rank
  rank <- (1:length(title_id))
    # Let's add it all to a list
  my_list <- list(title_id, genre, rank)
  
  
})
end_time <- Sys.time()
paste("Your dataframe has been built and it took",round(end_time - start_time), "seconds to complete.")

## [1] "Your dataframe has been built and it took 6 seconds to complete."

# Let's turn the list to a Dataframe
title_list <- rbindlist(titles)

#Let's change the columns names
colnames(title_list) <- c("title_id", "genre", "rank") 

#Let's close all the extra cores
future:::ClusterRegistry("stop")

Let’s see what we are working with

summary(as.factor(title_list[[2]]))

##     Action  Adventure  Animation  Biography     Comedy      Crime      Drama 
##        250        250        134        126        250        250        250 
##     Family    Fantasy  Film-Noir    History     Horror      Music    Musical 
##        213        250          5         83        190         45         64 
##    Mystery    Romance     Sci-Fi Short-Film      Sport  Superhero   Thriller 
##        246        250        250        250         49        250        250 
##        War    Western 
##         85         29

It looks like everything but Noir has at least 30 entries. I’m going to leave it in but take any results from it as not serious.

We are going to scrape every movie page and collect some data about each movie.

And now we are going to scrape every page. Note, that while doing this I can into several issues with missing data, such as revenue and Metascore, for which I had to add logic to make sure they were on the page before I scraped it.

In addition, I originally ran it as a loop, but then found out about the Furrr package and realized I could map it and save a lot of time. However I had to limit the amount of cores in use because on the initial trials my IP was flagged as a bot and I was blacklisted.

If this were a larger and frequent data pull, it would need to be some sort of code that automatically cycled through different VPN IPs so you could get the data quicker and without being blocked.

#Make sure you saved this file in a directory for this to work
dataPath <- setwd("..")
if (file.exists(paste(dataPath,'/imdb.csv',sep = ""))) {
  print("Already Exists")
  movies <- read.csv(paste(dataPath,'/imdb.csv',sep = ""))
} else {
  start_time <- Sys.time()
  
  plan(multisession(workers = 7))
  
  movie_list <- future_map(as.list(title_list$title_id), function(x) {
    
        p = str_c('https://www.imdb.com/title/',x, sep="")
          URL_p = read_html(p)
      
      
      # Let's get the title
        title <- URL_p  %>% html_nodes( ".star-rating-widget") %>% 
          html_attr("data-title")
        
        # Let's get the title id
        title_id <- x
    
        # Let's get the certificate
        cert <- URL_p %>% html_nodes(".subtext") %>% 
          html_text() %>%
          strsplit(" |\\\n")
        certificate <- cert[[1]][[22]]
      
        # Let's get the metascore
        if (str_detect(URL_p,'Metascore')) {
          metascore <- URL_p %>% html_nodes(".metacriticScore > span") %>% 
            html_text() 
        } else {
          metascore <-""
        }
      
      # Let's get the revenue
        if (str_detect(URL_p,'Gross USA')) {
          revenue <- URL_p %>% html_nodes(xpath='//*[@id="titleDetails"]/div[9]/text()[2]') %>% 
          html_text() 
          revenue <- str_trim(revenue)
        } else {
          revenue <- ""
        }
      # In order to map to work, we need to save all that information in a list
        my_list <- list(title, title_id, certificate ,metascore, revenue)
  })
  end_time <- Sys.time()
  paste("Your dataframe has been built and it took",round(end_time - start_time), "minutes to complete.")

  future:::ClusterRegistry("stop")

  #Convert list to dataframe and change the column titles
  movies <- rbindlist(movie_list)
  colnames(movies) <- c("title",
                      "title_id",
                       "certificate", 
                       "metascore",
                       "revenue")

  # There were a few issues with the data, so we are going to get a unique list of the title ids and then fix any that were not pulled in correctly      as well as fix some of the errors in noticed in each column.
  
  #Get rid of all duplicate rows, since this is just for data mapping
  nrow(movies)
  movies <- movies  %>% distinct(title_id, .keep_all = TRUE)
  nrow(movies)
  
  ## Certificate seems to have an error, it did not pull in 'Not Ranked'
  movies %>% distinct(certificate, .keep_all = TRUE)
  
  l <- movies %>% filter(str_detect(certificate, "Not")) 
  l$certificate <- "Not Ranked"
  clean <- movies %>% filter(!str_detect(certificate, "Not"))
  
  movies <- rbind(clean, l)
  
  ## I also had an issue with some of the revenues
  l <- movies %>% filter(!str_detect(revenue, "\\$")) 
  l$revenue <- ""
  clean <- movies %>% filter(str_detect(revenue, "\\$"))
  
  movies <- rbind(clean, l)
  
  # And an issue where GP is there instead of PG
  l <- movies %>% filter(str_detect(certificate, "GP")) 
  l$certificate <- "PG"
  clean <- movies %>% filter(!str_detect(certificate, "GP"))
  
  movies <- rbind(clean, l)
}

## [1] "Already Exists"

Now, I’m going to pull from IMDB’s movie database that has more information about each movie and match it up with what we have already. I pulled two things, information about the titles and also information about the ratings

# Set WD
dataPath <- setwd("..")

if (file.exists(paste(dataPath,'/title_ratings.tsv',sep = ""))) {
  print("Already Exists")
  df_ratings <- read_tsv(paste(dataPath,'/title_ratings.tsv',sep = ""), na = "\\N", quote = '')
  df_titles <- read_tsv(paste(dataPath,'/titles.tsv',sep = ""), na = "\\N", quote = '')
} else {
  
  #Download files
  download.file(url = "https://datasets.imdbws.com/title.ratings.tsv.gz",
    mode = "wb",
    destfile=file.path(dataPath, "title_ratings.tsv.gz"))

  download.file(url = "https://datasets.imdbws.com/title.basics.tsv.gz",
    mode = "wb",
    destfile=file.path(dataPath, "titles.tsv.gz"))

  setwd(dataPath)

  gunzip("title_ratings.tsv.gz", remove = FALSE)
  gunzip("titles.tsv.gz", remove = FALSE)

  df_ratings <- read_tsv(paste(dataPath,'/title_ratings.tsv',sep = ""), na = "\\N", quote = '')
  df_titles <- read_tsv(paste(dataPath,'/titles.tsv',sep = ""), na = "\\N", quote = '')
}

## [1] "Already Exists"

Now I am going to select my data and combine it all into one large dataset

df_titles <- df_titles %>% select(tconst, startYear, runtimeMinutes)

movies_clean <-  title_list %>% left_join(df_titles, by = c("title_id" = "tconst"))

movies_clean <- movies_clean %>% left_join(df_ratings, by = c("title_id" = "tconst"))

movies_clean <- movies_clean %>% left_join(movies, by = c("title_id" = "title_id"))

rm(df_ratings)
rm(df_titles)

movies_clean <- movies_clean %>% select(title_id, title, genre, rank,averageRating, numVotes, metascore, revenue, certificate, startYear, runtimeMinutes)

ANOVA Test

Write hypotheses for evaluating whether the average number of ratings varies across all 23 genres.

H naught: the average numbers all equal each other
H alt: there is at least one average number not equal to the others

Now let’s get some summary statistics on our data

movies_clean <- movies_clean %>%  mutate_if(sapply(movies_clean, is.character), as.factor)
summary(movies_clean)

##       title_id                     title            genre           rank      
##  tt0103639:   9   Aladdin             :  13   Action   : 250   Min.   :  1.0  
##  tt0129167:   9   Beauty and the Beast:  11   Adventure: 250   1st Qu.: 47.0  
##  tt2380307:   9   The Lion King       :  10   Comedy   : 250   Median :104.0  
##  tt0312004:   8   Coco                :   9   Crime    : 250   Mean   :110.8  
##  tt2096673:   8   The Iron Giant      :   9   Drama    : 250   3rd Qu.:171.5  
##  tt2948356:   8   (Other)             :3951   Fantasy  : 250   Max.   :250.0  
##  (Other)  :3968   NA's                :  16   (Other)  :2519                  
##  averageRating      numVotes         metascore            revenue    
##  Min.   :2.800   Min.   :  99691   72     : 114               : 288  
##  1st Qu.:7.200   1st Qu.: 147222   66     : 113   $210,460,015:   9  
##  Median :7.600   Median : 241593   65     : 109   $217,350,219:   9  
##  Mean   :7.532   Mean   : 367309   68     : 108   $23,180,087 :   9  
##  3rd Qu.:8.000   3rd Qu.: 469186   76     : 108   $341,268,248:   8  
##  Max.   :9.300   Max.   :2385745   (Other):3451   (Other)     :3680  
##                                    NA's   :  16   NA's        :  16  
##    certificate     startYear    runtimeMinutes 
##  R       :1707   Min.   :1921   Min.   : 64.0  
##  PG-13   :1045   1st Qu.:1995   1st Qu.:101.0  
##  PG      : 789   Median :2005   Median :116.0  
##  G       : 264   Mean   :2000   Mean   :119.1  
##  Approved:  73   3rd Qu.:2013   3rd Qu.:132.0  
##  (Other) : 125   Max.   :2021   Max.   :242.0  
##  NA's    :  16

hist(movies_clean$averageRating)

It looks like overall, the data is fine for statistical analysis, we’ll just have to check it at an individual level

Now we can see that the average rating overall is 7.533 and the median is 7.6 and overall, the data is left skewed but normally distributed. So now we have to check each genre for three conditions to make sure we can analyze them for bias.

Independence

I’m going to add the assumption here that each movie movie was reviewed by one person and that there were no reviews from multiple persons.

Approximately normal

ggplot(movies_clean, aes(x = averageRating)) +
  geom_histogram(fill = "white", colour = "black", bins = 10) +
  facet_grid(~ genre ~ ., scales = "free_y", margins = T)

histBy(averageRating ~ genre, data=movies_clean) #formula input

+ All the movies genres seem to follow a normal distribution 3. Constant variance.

par(las=2)
boxPlot(movies_clean$averageRating, fact = movies_clean$genre, 
        col = COL[1,2], ylab = "Average Movie Rating")

+ It looks like all the movies have constant variance.

Now that we validated that the conditions are met, let’s run the analysis.

one.way <- aov(averageRating ~ genre, data = movies_clean)

summary(one.way)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## genre         22  640.9  29.130   106.3 <2e-16 ***
## Residuals   3996 1095.4   0.274                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the than the default significance level of 0.05, therefore we reject the null hypothesis, there is bias in this data towards specific genres.

Let’s run this with a bootstrap to see if we come to the same conclusion

library("lmboot")
moviesAnova <- ANOVA.boot(averageRating ~ genre, B = 1000, type = "residual", wild.dist = "normal", 
            seed = 3112 , data = movies_clean, keep.boot.resp = FALSE)

## Warning in ANOVA.boot(averageRating ~ genre, B = 1000, type = "residual", : Number of bootstrap samples is recommended to be more than the number of observations.

moviesAnova$`p-values`

## [1] 0

There are 0 p values that fail to reject the null hypothesis, so there is most definitely bias in the data.

Now that we know and have confirmed that there is bias, the next thing to do is run a ‘pairwise t test’ to see where the values differ. Basically, the ‘pairwise t test’ tests each movie against each other movie with the null hypothesis that says that the true mean difference is 0. In this case, there is no difference between the ratings for each genre. I’m going to make sure there is a bonferroni correction on it since there are a lot of genres and a therefore a lot of pairs.

The bonferroni correction is a method to counteract the problem of multiple comparisons.Statistical hypothesis testing is based on rejecting the null hypothesis if the likelihood of the observed data under the null hypotheses is low. If multiple hypotheses are tested, the chance of observing a rare event increases, and therefore, the likelihood of incorrectly rejecting a null hypothesis (i.e., making a Type I error) increases.

The Bonferroni correction suggests that the p-value for each test must be equal to its alpha divided by the number of tests performed.

pwc <- movies_clean %>% 
  pairwise_t_test(
    averageRating ~ genre, pool.sd = FALSE,
    p.adjust.method = "bonferroni")

pwc_matches <- pwc %>% filter(p > .05) %>% arrange(desc(p))

pwc_matches

This shows all the pairs between the movies and their P-value. Let’s take a look at that as a boxplot.

pwc1 = pwc_matches %>% select(p,group1)
pwc2 = pwc_matches %>% select(p,group2)
names(pwc2)[2] <- "group1"

pwc_groups <- rbind(pwc1,pwc2)
pwc_groups$group1 <- as.factor(pwc_groups$group1) 


ggboxplot(pwc_groups, x = "group1", y = "p", add = "point") +
  rotate_x_text()

par(las=2)
barplot(sort(summary(pwc_groups$group1), decreasing = T))

Extra Analysis

Cinemascore

In order to get this data I had to call the ‘unofficial API’. To be transparent, there is nothing on the website that prevents me from legally doing this although this definitely crosses the line between ‘on the internet’ and a little unethical. Since it is for academic purposes, my moral compass is going to let it slide.

How it was done.

While digging through the website, I noticed that for each search term it required a minimum of two letters then the website would send these letters encoded to an API which would send back the movie title with the rating information.
Once sent, there was a link that you could access in the developer tools that showed the results in JSON
I realized that the encoded words were simply a cipher by trying out several combinations of letter patters (ab: YWI=, ba: YmE=) and realizing there was a pattern happening on the front end
Since it was happening on the front end, I then dug into the JS to find the method of encryption which was Base64
Running a few letters and matching them with the site I confirmed it was the Base64 cipher.
I figured it would be easier to get all movies rather than create a list and search for them. So I created a list that contained all two letter combinations possible.

cinemascore <- data.frame(title=character(),
                       grade=character(),
                       year=character()) 


combo <- list()
for (i in letters[1:26]){
  for (j in letters[1:26]) {
    combo <- append(combo, str_c(i,j, sep = ""))  
  }
}

dataPath <- setwd("..")

if (file.exists(paste(dataPath,'/Cinamascore.csv',sep = ""))) {
  print("Already Exists")
  cinemascore = read.csv2('C:/Users/humme/Google Drive/CUNY Classes/Data 607/Projects/Final Project/Cinamascore.csv', sep = ",")
} else {
    for (i in unique(combo)) {
    # Read in the names as base64
    query = str_c("https://api.cinemascore.com/guest/search/title/", base64encode(charToRaw(i)), sep="")
    # get the webpage content
    Desc <- read_html(query)  %>%  html_text()
    # Convert that data to a list from JSON
    data <- fromJSON(Desc)
    # Turn the nested list into a dataframe
    cinema <- as.data.frame(do.call(rbind, data))
    # Append the data to the
    if (cinema$TITLE != 'No Results') {
      cinemascore <- rbind(cinemascore, cinema)
    }  
  }
  
  cinemascore <- distinct(cinemascore)
  cinemascore <- as.data.frame(lapply(cinemascore, unlist))
  cinemascore <-data.frame(lapply(cinemascore, factor))
  
  
  g <- c("A+","A","A-","B+", "B", "B-","C+","C","C-","D+","D", "D-","F")
  Score <- c(1,2,3,4,5,6,7,8,9,10,11,12,13)
  Grade <- data.frame(g,Score)
  
  
  cinemascore <- cinemascore %>% left_join(Grade, by = c("GRADE" = "g"))
  
}

## [1] "Already Exists"

cinemascore %>% ggplot(aes(x=Score)) +
  geom_bar()

Let’s combine the dataframes

#Match title style with the movies tab
cinemascore$TITLE <- str_to_title(cinemascore$TITLE)

#make it so that both databases can be joined
cinemascore$title_match <- str_replace(cinemascore$TITLE, ", The","")

movies_clean$title_match <- str_replace(movies_clean$title, "The ","")

#make it so that we are not working on individual movies
movies_unique <- movies_clean %>% distinct(title_id, .keep_all = T)

# Join tables
movies_all <-  movies_unique %>% left_join(cinemascore, by = "title_match")

# Filter out unneeded columns
movies_all <- movies_all %>% select(-TITLE, -GRADE, -YEAR, -title_match)

movies_all$metascore <- as.numeric(as.character(movies_all$metascore))

## Warning: NAs introduced by coercion

#Now let's normalize the data
movies_all$imdb_norm <- heatmaply::normalize(movies_all$averageRating, range = c(0, 1))
movies_all$meta_norm <- heatmaply::normalize(movies_all$metascore, range = c(0, 1))
movies_all$cine_norm <- heatmaply::normalize(movies_all$Score, range = c(0, 1))

#Finally, let's filter the data for movies that exist in all three, and also just metascore and Imdb (since there are more)

movies_imc <- movies_all %>% filter(cine_norm >= 0 & meta_norm >= 0)

movies_im <- movies_all %>% filter( meta_norm >= 0)

IMDB and MetaScore Analysis

movies_im$difference <- movies_im$imdb_norm - movies_im$meta_norm
mean(movies_im$difference)

## [1] 0.06037913

The average movie is 6 points higher on IMDB, let’s see what it looks like for each genre

movies_im %>% group_by(genre) %>% summarise(Average_Difference = mean(difference)) %>% arrange(desc(Average_Difference))

Let’s graph it so that we can see it better.

par(las=2)
boxPlot(movies_im$difference, fact = movies_im$genre, 
        col = COL[1,2], ylab = "Average Movie Rating Difference: IMDB vs Metascore")

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

#### IMDB and Cinemascore Analysis

For all movies that match up between all three movie databases, I want to see the mean differences between the databases

## difference between IMDB and Metacritic
movies_imc$difference_IM <- movies_imc$imdb_norm - movies_imc$meta_norm
## difference between IMDB and Cinemascore
movies_imc$difference_IC <- movies_imc$imdb_norm - movies_imc$cine_norm
## difference between Metacritic and Cinemascore
movies_imc$difference_MC <- movies_imc$meta_norm - movies_imc$cine_norm
mean(movies_imc$difference_IM)

## [1] 0.07737769

mean(movies_imc$difference_IC)

## [1] 0.4123628

mean(movies_imc$difference_M)

## [1] 0.3349851

Interestingly, it looks like the biggest difference is between the Cinemascore and IMDB (because there was not as many movies that aligned for the three movie databases, the difference between IMDB and Metascore has changed), let’s dig deeper:

par(las=2)
boxPlot(movies_imc$difference_IC, fact = movies_imc$genre, 
        col = COL[1,2], ylab = "Rating Difference: IMDB vs Cinemascore")

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

Let’s see if this is a trend overall or a historic trend

ggplot(movies_imc, aes(x=startYear, y=difference_IC)) +
  geom_point()

Since this is ‘IMDB - Cinemascore’ the positive reflects movies where the reviews were higher rated on IMDB rather than by Cinemascore. Possible reason for this is that IMDB reviewers do not necessarily reflect everyday Americans like the Cinemascore does.

IMDB Analysis

Let’s see if there is any correlation between revenue and any of the columns

movies_clean_analysis = movies_clean
movies_clean_analysis$revenue <- str_replace(movies_clean_analysis$revenue, "\\$","")
movies_clean_analysis$revenue <- str_replace_all(movies_clean_analysis$revenue, ",","")
movies_clean_analysis$revenue <- as.numeric(as.character(movies_clean_analysis$revenue))
is.numeric(movies_clean_analysis$revenue)

## [1] TRUE

movies_clean_analysis$metascore <- as.numeric(as.character(movies_clean_analysis$metascore))

## Warning: NAs introduced by coercion

m_full <- lm(revenue ~ genre + rank + averageRating + numVotes + metascore 
             + certificate + startYear, data = movies_clean_analysis)
summary(m_full)$coefficient

##                            Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)           -3.210722e+09 3.016046e+08 -10.6454680  4.390218e-26
## genreAdventure         4.484039e+06 1.018746e+07   0.4401527  6.598525e-01
## genreAnimation         3.519146e+07 1.401923e+07   2.5102275  1.210815e-02
## genreBiography        -3.627599e+07 1.289124e+07  -2.8140036  4.918993e-03
## genreComedy           -3.247354e+07 1.021574e+07  -3.1787764  1.491313e-03
## genreCrime            -3.019845e+07 1.024592e+07  -2.9473622  3.225224e-03
## genreDrama            -6.687728e+07 1.088402e+07  -6.1445359  8.878986e-10
## genreFamily            4.796150e+06 1.208565e+07   0.3968465  6.915038e-01
## genreFantasy          -9.354689e+06 1.052255e+07  -0.8890135  3.740543e-01
## genreHistory          -2.417603e+07 1.550214e+07  -1.5595286  1.189578e-01
## genreHorror           -2.491945e+07 1.320335e+07  -1.8873582  5.919122e-02
## genreMusic            -1.934022e+07 1.962072e+07  -0.9857040  3.243435e-01
## genreMusical           2.375422e+07 1.782891e+07   1.3323432  1.828304e-01
## genreMystery          -3.516312e+07 1.075526e+07  -3.2693898  1.087769e-03
## genreRomance          -4.286892e+07 1.030715e+07  -4.1591452  3.267708e-05
## genreSci-Fi           -6.404347e+06 1.015297e+07  -0.6307858  5.282200e-01
## genreShort-Film       -6.563109e+07 1.127230e+07  -5.8223315  6.302735e-09
## genreSport            -3.075350e+07 1.922103e+07  -1.5999921  1.096866e-01
## genreSuperhero        -6.563109e+07 1.127230e+07  -5.8223315  6.302735e-09
## genreThriller         -3.841325e+07 1.017359e+07  -3.7757802  1.620394e-04
## genreWar              -5.533795e+06 1.536245e+07  -0.3602158  7.187066e-01
## genreWestern          -9.476416e+06 2.453848e+07  -0.3861859  6.993815e-01
## rank                   2.095542e+05 5.283318e+04   3.9663380  7.438208e-05
## averageRating         -2.141435e+07 7.096301e+06  -3.0176782  2.564733e-03
## numVotes               2.113088e+02 7.858563e+00  26.8889856 1.723048e-145
## metascore              3.490563e+05 1.657417e+05   2.1060260  3.527003e-02
## certificateG          -8.459628e+06 2.883928e+07  -0.2933370  7.692812e-01
## certificateNC-17      -1.335184e+08 8.312035e+07  -1.6063269  1.082884e-01
## certificateNot Ranked -1.315532e+08 3.801418e+07  -3.4606357  5.450632e-04
## certificatePassed      3.904891e+07 3.788738e+07   1.0306573  3.027697e-01
## certificatePG         -6.684651e+06 2.869573e+07  -0.2329493  8.158138e-01
## certificatePG-13      -6.162168e+07 2.907186e+07  -2.1196334  3.410410e-02
## certificateR          -1.340483e+08 2.879727e+07  -4.6548957  3.356388e-06
## startYear              1.736489e+06 1.462160e+05  11.8761874  6.033102e-32

Interesting, it looks like the two biggest factors for revenue are the number of votes it has received (probably the more popular a movie is the more money it receives) and year, which may be because the revenue numbers weren’t normalized.

The worst predictors were certificate and genre (with the exception of short-film, superhero, and drama)

Let’s see if there is any correlation between average rating and any of the columns

m_full2 <- lm(averageRating ~ genre + rank + revenue + numVotes + metascore 
             + certificate + startYear, data = movies_clean_analysis)
summary(m_full2)$coefficient

##                            Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)            1.429205e+01 6.721304e-01  21.2638043  1.054244e-94
## genreAdventure         5.093727e-02 2.368597e-02   2.1505253  3.157899e-02
## genreAnimation        -6.472015e-01 3.083988e-02 -20.9858647  1.998979e-92
## genreBiography        -3.240652e-01 2.954112e-02 -10.9699715  1.425760e-27
## genreComedy            6.028523e-03 2.379865e-02   0.2533136  8.000401e-01
## genreCrime            -4.448692e-02 2.385325e-02  -1.8650259  6.225774e-02
## genreDrama             4.047431e-01 2.455601e-02  16.4824415  6.209334e-59
## genreFamily           -5.479737e-01 2.661799e-02 -20.5865897  3.404255e-89
## genreFantasy          -3.109620e-01 2.393685e-02 -12.9909330  9.302626e-38
## genreHistory          -6.341825e-01 3.451989e-02 -18.3715078  3.553693e-72
## genreHorror           -9.409480e-01 2.650499e-02 -35.5007860 1.739542e-237
## genreMusic            -7.569146e-01 4.390416e-02 -17.2401578  4.232467e-64
## genreMusical          -7.451230e-01 3.961719e-02 -18.8080758  2.093771e-75
## genreMystery          -4.276857e-01 2.403983e-02 -17.7907124  5.624173e-68
## genreRomance          -2.512413e-01 2.367384e-02 -10.6126116  6.178654e-26
## genreSci-Fi           -2.217862e-01 2.333512e-02  -9.5043976  3.524315e-21
## genreShort-Film        4.538601e-01 2.525454e-02  17.9714242  2.853992e-69
## genreSport            -8.292318e-01 4.258003e-02 -19.4746659  1.863271e-80
## genreSuperhero         4.538601e-01 2.525454e-02  17.9714242  2.853992e-69
## genreThriller          7.483689e-02 2.368180e-02   3.1601017  1.590007e-03
## genreWar              -5.953797e-01 3.435846e-02 -17.3284719  1.027205e-64
## genreWestern          -8.996366e-01 5.511716e-02 -16.3222606  7.242229e-58
## rank                  -5.179365e-03 8.855990e-05 -58.4843102  0.000000e+00
## revenue               -1.158992e-10 3.840676e-11  -3.0176782  2.564733e-03
## numVotes               2.363066e-07 1.962177e-08  12.0430810  8.720623e-33
## metascore              8.095884e-03 3.618668e-04  22.3725519 5.096855e-104
## certificateG           1.617980e-01 6.703970e-02   2.4134650  1.585046e-02
## certificateNC-17       1.404941e-01 1.934269e-01   0.7263421  4.676755e-01
## certificateNot Ranked  2.484132e-01 8.848624e-02   2.8073658  5.021292e-03
## certificatePassed      8.857746e-02 8.814254e-02   1.0049344  3.149949e-01
## certificatePG          1.837219e-01 6.668966e-02   2.7548790  5.900422e-03
## certificatePG-13       1.642478e-01 6.762032e-02   2.4289706  1.518960e-02
## certificateR           1.991095e-01 6.711186e-02   2.9668299  3.028345e-03
## startYear             -3.427057e-03 3.419921e-04 -10.0208659  2.460347e-23

So the most significant factors here are the genre, rank (the most, so much so that it the computer cannot tell it’s not 0), number of votes, metascore, and start year (apparently the older the movie the better). Interestingly, revenue was significant but not that much, while the certificates again had no impact.

Semester One Capstone

Joshua Hummell

3/19/2021

Overview

The Plan

Extract the Movie Data and Run an ANOVA analysis

Extra Analysis

Load necesary packages

Extract the Movie Data and Run an ANOVA analysis

IMDB data

ANOVA Test

Extra Analysis

IMDB and MetaScore Analysis

IMDB Analysis