Python Review of IMDB Data

R Markdown

This is my analysis of the IMDB movie database from the Kaggle site here: https://www.kaggle.com/PromptCloudHQ/imdb-data#IMDB-Movie-Data.csv, which gathers information on the 1000 movies from 2006 to 2016. I will be doing analysis on this data through visualizations.

Information about the Data

First, let’s see how many rows and columns are included in the data as well as what types of data is in each column.

## (1000, 12)

## Rank                    int64
## Title                  object
## Genre                  object
## Description            object
## Director               object
## Actors                 object
## Year                    int64
## Runtime (Minutes)       int64
## Rating                float64
## Votes                   int64
## Revenue (Millions)    float64
## Metascore             float64
## dtype: object

Let’s see what the first couple of rows of data look like

##    Rank                    Title  ... Revenue (Millions) Metascore
## 0     1  Guardians of the Galaxy  ...             333.13      76.0
## 1     2               Prometheus  ...             126.46      65.0
## 2     3                    Split  ...             138.12      62.0
## 3     4                     Sing  ...             270.32      59.0
## 4     5            Suicide Squad  ...             325.02      40.0
## 
## [5 rows x 12 columns]

and if there are any NA values

## Rank                    0
## Title                   0
## Genre                   0
## Description             0
## Director                0
## Actors                  0
## Year                    0
## Runtime (Minutes)       0
## Rating                  0
## Votes                   0
## Revenue (Millions)    128
## Metascore              64
## dtype: int64

drop those NA values and rerun to make sure they are not in the data any more as they will have an impact on our analysis

## Rank                  0
## Title                 0
## Genre                 0
## Description           0
## Director              0
## Actors                0
## Year                  0
## Runtime (Minutes)     0
## Rating                0
## Votes                 0
## Revenue (Millions)    0
## Metascore             0
## dtype: int64

First Analysis on Directors Total Revenues

Now, we are going to dig into the data by looking at the Top 10 Directors by Revenue as well as the average films to see who these directors are and what kind of revenues they are bringing in both in total and on average.

## <BarContainer object of 10 artists>

From this analysis you can see that JJ Abrams is the top Total Revenue producing Director as well as the second highest Average Revenue per film he directed. The Highest Average is Joss Whedon, who is 6th on the list showing that he’s had some high producing movies.

Movies by Genre Analysis

Another aspect of the data is the Genres that each movie is related to. Most movies are grouped into several genres, so we will split these genres out for each movie increasing the total number of records that we are analyzing. Our first visualization will be to see by year, what is the most popular type of movie genre out in the theaters.

## (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]), <a list of 11 Text xticklabel objects>)

From this visualization, it is a bit overwhelming to see all of the different colors from the many genres that are included in this analysis. You can see the amount of genres jumps in 2016. That could be from the amount of genres used to describe movies and not the total amount of movies itself. And most movies are moving in line through the years through 2016 when they all jump in size. Overall the visualization doesn’t tell us too much about the genres and movies in the past 10 years.

Total and Average Revenues by Genres

Digging some more into these genres and the money they produce, let’s take a different view of the data. This time with a bar graph that shows the Total Revenues on one side and the Average Revenues on the other side.

## [<matplotlib.axis.XTick object at 0x1a228096d0>, <matplotlib.axis.XTick object at 0x1a2284fe10>, <matplotlib.axis.XTick object at 0x1a228478d0>, <matplotlib.axis.XTick object at 0x1a22888990>, <matplotlib.axis.XTick object at 0x1a22888c90>, <matplotlib.axis.XTick object at 0x1a228f6410>, <matplotlib.axis.XTick object at 0x1a228f6cd0>, <matplotlib.axis.XTick object at 0x1a228f6f50>, <matplotlib.axis.XTick object at 0x1a229006d0>, <matplotlib.axis.XTick object at 0x1a229009d0>]

## [Text(0,0,'Adventure'), Text(0,0,'Action'), Text(0,0,'Drama'), Text(0,0,'Comedy'), Text(0,0,'Sci-Fi'), Text(0,0,'Fantasy'), Text(0,0,'Thriller'), Text(0,0,'Animation'), Text(0,0,'Crime'), Text(0,0,'Family')]

From this visualization, we can see that Adventure and Action movies are the most prevalent, and highest grossing, which is what I would have presumed. They have decent average per movie of revenue, which is good. Usually these are probably profitable movies. What is intersting is that Animation movies, while not high in total, have a much higher Average of Revenue per movie. This must mean that they are popular and profitable.

Genres by Heatmap

Another visualization we will look at is the heatmap which color codes the amount of the value in the view. Below is a heatmap of the average revenues by genre by year. This might help us dig into that Animation spike we saw in the previous visualization.

From this heatmap, we can see that animation in 2013 was a very high grossing year. There are some other highlights on here like Sport in 2009. In general it looks like Animation had some very good years for top grossing movies. Action and Adventure, which are high in Total Gross dollars do not have a lot of high average years, especially in 2016.

To get detail on the 2013 animation movies, below is a snapshot of which movies those were, as well as the Sport in 2009.

##                     Title    Genre_x  ...  Revenue (Millions) Metascore
## 415                Frozen  Animation  ...                 400      74.0
## 980   Monsters University  Animation  ...                 268      65.0
## 1799      Despicable Me 2  Animation  ...                 368      62.0
## 
## [3 rows x 13 columns]

##               Title Genre_x  Rank  ...   Votes Revenue (Millions) Metascore
## 745  The Blind Side   Sport   311  ...  237221                255      53.0
## 
## [1 rows x 13 columns]

Word Cloud

Now for the last visualization, we’ll take a look at the word cloud of the description text for the top 25 movies to see if there are any interesting trends in words used for very popular movies.

## <wordcloud.wordcloud.WordCloud object at 0x1a22a45650>

## (-0.5, 399.5, 199.5, -0.5)

From this view, there are some intersting words on here. Stop seems to be the biggest word, along with learn, new, world, fight, dark, help, and heroes. It would seem like the popular movies have a protagonist and an antagonist who must be stopped through a fight. Some specific words like Kinght, Joker and Ultron speak to the these antagonists.

Conclusion

Overall, I think there are some interesting visualizations that were produced that can show some insight into the data that would be hard to uncover by just looking at the data without visualizations. If there were additional columns to the these movies, such as months/days, you could see seasonality of when the most popular movies are put out into the theaters.

Python Review of IMDB Data

Eric Rhoades

April 1 2020

R Markdown