Project Introduction

Entertainment provides a way to escape the monotony of everyday life. The beauty of entertainment is there are so many different forms through which to either be externally or internally entertained - for me TV and Movies have always provided a form of consistent entertainment. I see magic in film and television, and the magic is being able to existent temporarily in an entirely separate world. To be transported into some alternate reality, and experience a story of either some made up characters, or historical events through a creative lens. I do not consider myself a movie critic, nor do I have extensive knowledge in the makings of a movie, but the act of sitting down with friends or family, making a bag of popcorn, and having a movie night is a crucial way for me to decompress from everyday life that helps keep me sane, so when presented with an opportunity to choose a topic for a Final Project in an Analytics Programming class, I knew that I wanted to do something regarding entertainment, and movies seemed like a captivating topic that has had such a profound impact on my life.

The reality of the status of movies now is that the amount of choices is insane: not only do we have an overwhelming amount of titles to choose from, there is a whole variety of different streaming services through which these overwhelming amount of titles exist. When my friends and I are trying to decide on a movie to watch, the first question is always “Well what streaming service do we want?” This brings me to the point of this project: through an objective lens, how does the “success” of different streaming service’s movies compare? It is challenging to be objective about something as intrinsically subjective as enjoyment of a film. Often, I will watch a movie that has received terrible reviews just out of curiosity, and thoroughly enjoy the movie. Luckily, we have a variety of platforms that have already made the “success” of a movie objective, such as iMDB, TMDB, and Rotten Tomatoes. These platforms create a culmination of many people’s reviews and feelings about a certain film, and have aggregated them into something objective such as a rating from 1-10. The quantitative data used in this data set utilizes scores from iMDB and TMDB. To limit my scope to a reasonable degree, the streaming services I will be attempting to compare across are the 3 services that have the largest subscriber count: Netflix, Amazon Prime, and Disney+.

Structured Data Used

The structured data used for this project was found on Kaggle, all wrangled and created via the same source (thank you to Kaggle user Diego Enrique). I will be utilizing 3 different tables, one for each of the 3 streaming services. Much to my delight, each of these tables contain identical columns of data. This allows me to, for the purposes of efficiency, create a super table containing all 3 of the original data tables with an added column for which service the particular movie is on. I will conduct analysis using all 4 of these tables. For each Kaggle site, there are 2 tables listed: credits.csv and titles.csv. Only the titles.csv tables were used for this analysis.

Links to Data Tables:

Netflix Data

Amazon Prime Data

Disney+ Data

Data Dictionary

This data dictionary applies to all 3 of the tables

id: Title ID found on JustWatch, which acts as a primary key for the data

title: Title of the Movie/TV Show

show_type: Is it a MOVIE or TV SHOW

description: A brief description

release_year: The release year

age_certification: PG-13, PG etc

runtime: length of the movie from start to finish

genres: genre or list of genres associated

production_countries: Country(ies) that helped in production of the film

seasons: # of seasons if a show (not used in analysis for this data since we only focused on movies)

imdb_score: score on IMDB (Internet Movie Data Base)

imdb_votes: number of people that voted on any imdb score for that film

tmdb_score: score on TMDB (The Movie Data Base)

tmdb_popularity: ranking model based on a multitude of factors including votes and views for the day (learn more here)

Statistical Analysis

Summary Statistics

For a brief overview of the two quantitative variables in our data, I have included the mean, median, and Standard Deviation for both the iMDB and tMDB scores. These are the results:

Streaming Service Average IMDB Movie Rating Average TMDB Movie Rating Median IMDB Movie Rating Median TMDB Score Standard Deviation IMDB Movie Rating Standard Deviation TMDB Movie Rating
Amazon 5.78 5.81 5.9 5.99 1.3 1.43
Disney 6.45 6.62 6.5 6.70 1.0 1.03
Netflix 6.27 6.33 6.4 6.47 1.1 1.17

From these brief summary statistics, there are a few things that stood out to me. First, it is interesting that for all 3 of the streaming services, the average and median scores coming from TMDB are marginally higher than the IMDB score. This could potentially be a result of TMDB being a less popular site which is used more for movie fanatics, and the typical TMDB user might enjoy movies slightly more on average than the typical IMDB user. Both the average and median movie ratings for Disney+ are higher than the other 2 streaming services, which could be interpreted as the average movie on Disney+ in this dataset is slightly higher than the average movie from the other 2. Amazon has the highest standard deviation, which means that there is more variation in the quality of film from Amazon than Disney+ or Netflix. This could also have some correlation to the quantity of movies found in these 3 data-sets: Amazon has 9,322 films, Netflix has 3,831, and Disney only has 1,314.

IMDB Ratings Distribution

My next analytical tool to view the cumulative data across these 3 platforms is a simple histogram. From this, we can see that the average score is around that 6/10 marker that we saw in the previous data table. The histogram is fairly nomally distributed, with a slight right-leaning skew, which I’m interpreting as a higher frequency of movies that are on the higher end of this ten point spectrum. There also seems to be fewer movies that score in that 9-10 range, and more movies on the far lower range.

International Production Diversity
service American-Made Movies Foreign Movies % of Foreign Movies
Amazon 5420 3902 41.86
Disney 1202 112 8.52
Netflix 1438 2393 62.46

There is a field in this data-set that I found really interesting, called Production_Countries. The above visualization shows the percentage of movies that are produced in any way within the United States, versus those produced entirely outside of the US, and the results are fascinating to me. Netflix by a significant margin has the largest relative concentration of international films, whereas Disney is almost exclusively producing their films domestically.

Sci-Fi and Fantasy Genre Comparison

My two favorite genres of films are Sci-Fi and Fantasy. In this visualization, I singled out these two genres using a binary variable, with the right box-plot showing films that have either a Sci-Fi or a Fantasy genre associated with the film. I wanted to see if my personal enjoyment of these two genres is consistent with the ratings that these genres of movies have received from IMDB. The mean score for all other movies is slightly higher than the mean score for my two favorite genres, so on average Sci-Fi and Fantasy movies score marginally lower. The IQR for Sci-Fi and Fantasy genres seems to be higher as well, meaning that there is slightly more variation among the scores for Sci-Fi and Fantasy movies. There is also 7 outlying movies on the bottom end of the IMDB score spectrum within the Sci-Fi movies, and not a single outlier on the higher scoring end which I find quite interesting.

Sci-Fi and Fantasy Cross-Service Comparison

Sticking with this comparative analysis of Sci-Fi and Fantasy movies compared to the rest of the films, I wanted to investigate the average IMDB rating across the 3 different streaming service. For all other genres, the disparity in scores is really small with the average ratings all hovering around 6. For movies within those 2 genres, however, there is a little bit more of a disparity (namely within Amazon). Amazon seems to have the worst selection of Sci-Fi and Fantasy movies, with Disney being at the top of this list. Putting this into perspective with the films for each of them, this isn’t very surprising. Netflix as of February 1st, 2023 acquired the rights to host the Lord of the Rings movies on their site, which give them a massive boost to these ratings, as all 3 of the movies fall within the top 5 rated movies Sci-Fi and Fantasy movies on IMDB (Rocketry: The Nambi Effect and The Empire Strikes Back make up the other 2 on the top 5 list). Disney+ being the leader in this list is also not very surprising, as Sci-Fi is absolutely their jam, hosting the entire Star Wars and Marvel series.

Movie Review Scrapping and Sentiment Scoring

To dive a little deeper into what the public opinion is, I scrapped the top 25 reviews off of IMDB for each of these 6 films. “Top 25” is determined without any filters or sorting applied manually, and is entirely based on IMDB’s internal sorting algorithm. For each review, the data scrapped includes the name of the review, the date the review was posted, the movie they reviewed, and all of the text contained within their review. Sticking with my favoritism of Sci-Fi and Fantasy, as well as to limit the scope of this analysis to be respectful to IMDB’s page and not scrape too much data, I have focused in on the highest and lowest scoring Sci-Fi and Fantasy film for each streaming service. The resulting movies whose reviews will be analyzed are as follows:

Netflix:

Highest Rated = The Lord of the Rings: Return of the King (9/10 Stars)

Lowest Rated = Aerials (1.5/10 Stars)

Amazon Prime Video:

Highest Rated = Rocketry: The Nambi Effect (8.8/10 Stars)

Lowest Rated = Finding Jesus (1.1/10 Stars)

Disney+:

Highest Rated = Star Wars: The Empire Strikes Back (8.7/10 Stars)

Lowest Rated = Kazaam (3.1/10 Stars)

Length of Reviews

The initial analysis I was curious in was how the success of the movie correlates with how much people have to say about the film. Since the 6 movies up for investigation are the two opposite ends of the spectrum, I’m curious if people have more to say about movies they watched and hated, or about movies they watched and loved.

Wow, okay so people have much more to say about the movies on the higher end of this scale. The power of fandom could have a significant impact on these results. Star Wars and Lord of the Rings have two of the strongest fanbases among Sci-Fi and Fantasy movies. Both of these are very established and well-known films, so even for those who aren’t avid fans there is still a lot of discussion and information on both of them. Interestingly, Kazaam has a few hundred words per review higher than the third high rated film. This might be due to the fact that Shaquille O’Neil is the star actor for the Kazaam movie, who could be a polarizing movie star for some causing them to talk more at length about the film. Rocketry: The Nambi Effect is also a movie released in 2022, so there might not be an established fanbase for this movie just yet.

NRC Sentiment Scoring

To analyze the actual content of these reviews a little bit more, I wanted to conduct some sentiment scoring that would investigate the emotion contained within these reviews. To do this, I will be using the NRC lexicon. For those unfamiliar, the NRC lexicon is a list of English words, and for each of these words there is a correlated human emotion or emotions (anger, fear, anticipation, trust, surprise, sadness, joy, disgust) and a correlated sentiment(positive, negative). To start of this analysis, I just wanted to look at the most reoccurring words out of all these reviews.

word n
luke 89
story 81
empire 68
time 64
characters 56
vader 56
trilogy 55
acting 52
scenes 52
battle 50

Star Wars is absolutely dominating this section. A lot of the Star Wars reviewers needed to touch on Luke, Vader, and the Empire which isn’t surprising considering that The Empire Strikes back is the movie with Luke and Vader’s fight scene. We do see some more generic words in this section, although the lower starred movies don’t seem to have enough consistency across the reviews to make it on this list.

The real intrigue of the NRC lexicon is the emotional scoring of texts, so this is what I wanted to look into next. First, lets just get a basic overview of what the sentiment dispersion is among these reviews.

NRC Overview
sentiment Count Percent of scoreable words
positive 1732 22.31
negative 994 12.80
trust 928 11.95
anticipation 874 11.26
joy 831 10.70
fear 625 8.05
anger 551 7.10
sadness 476 6.13
disgust 388 5.00
surprise 366 4.71

We can see that there are significantly more positive words used in these reviews than negative. As we just learned in this document, people have much more to say for a good movie than a bad one. With reviews for Star Wars and LoTR dominating the word count, this provides much more of an explanation as to why we have many more positive sentiments than negative. Words associated with trust and anticipation occur the most frequently, which makes sense as effective movies will hold the viewer in anticipation, and trust is a major theme of The Empire Strikes Back. This analysis is more meaningful when seen split up by movie, so I wanted to dive a little bit deeper into the emotions in reviews for each of these movies individually.

NRC per Movie

For each of the 3 highly rated movies, the amount of positive words vastly outweighs the negative words, which is to be expected. For Finding Jesus and Aerials, we can see there are more negative words used in these reviews than there are positive, which again makes sense that people are typically using positive words for movies they enjoyed versus movies that they didn’t. Surprisingly, Kazaam has more positive words used, so maybe this movie was enjoyed slightly more enjoyed than the score suggests. Aerials has fear as the most heavily correlated emotion, which makes sense in context of this being an alien invasion thriller and I find that fascinating we can see a reflection of the genre in the reviews. Finding Jesus anticipation and anger as the two words most correlated with the movie, which seems to be the least enjoyed movie out of the 3 of these.

Conclusions

From this investigation into films, we have learned that Disney+ has the highest average IMDB rating for Sci-Fi and Fantasy movies compared to Netflix and Amazon, although Sci-Fi and Fantasy movies as a whole have lower IMDB ratings than other genres. Out of the 3 service providers, Netflix has the highest proportion of internationally sourced movies, with Disney+ producing almost all of their films domestically. Using IMDB movie review data and sentiment scoring, we found that people are more likely to talk at length about films they love than films they dislike. We also analyzed the different emotions and typical words that people used in reviews of Kazaam, Finding Jesus, Aerials, LoTR: The Return of the King, Star Wars: The Empire Strikes Back, and Rocketry: The Nambi Effect. Thank you so much for taking the time to read this, I hope you enjoyed and found some of this interesting as I did!!