• 1 Introduction
  • 2 Data Preparation
    • 2.1 Data Input
    • 2.2 Data Inspection
    • 2.3 Data Cleansing & Coercions
  • 3 Data Visualization and Analysis
    • 3.1 Top 5 High Gross Movies
    • 3.2 Top 5 High Rated Movies
    • 3.3 Top 5 Highest Gross & Rating Movies
  • 4 Conclusion and Recommendation

1 Introduction

IMDB- is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.

Data Description

At this project we will use IMDB Dataset of top 1000 movies and tv shows from https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

The datasets contains the following information:

  1. Poster_Link - Link of the poster that imdb using
  2. Series_Title - Name of the movie
  3. Released_Year - Year at which that movie released
  4. Certificate - Certificate earned by that movie
  5. Runtime - Total runtime of the movie
  6. Genre - Genre of the movie
  7. IMDB_Rating - Rating of the movie at IMDB site
  8. Overview - Mini story/ summary
  9. Meta_score - Score earned by the movie
  10. Director - Name of the Director
  11. Star1 - Name of the Stars 1
  12. Star2 - Name of the Stars 2
  13. Star3 - Name of the Stars 3
  14. Star4 - Name of the Stars 4
  15. Noofvotes - Total number of votes
  16. Gross - Money earned by that movie

Data Shape

  • number of rows: 1000
  • number of columns: 16

Problem Statement

What type of movies are getting good rating and gross amount?

2 Data Preparation

2.1 Data Input

Import data csv. Make sure our data placed in the same folder with our R project data.

imdb <- read.csv("data_input/imdb_top_1000.csv")
imdb <- subset(imdb, select = -c(Poster_Link,Overview)) #drop "Poster Link" & "Overview" columns
head(imdb)
ABCDEFGHIJ0123456789
 
 
Series_Title
<chr>
Released_Year
<chr>
Certificate
<chr>
Runtime
<chr>
1The Shawshank Redemption1994A142 min
2The Godfather1972A175 min
3The Dark Knight2008UA152 min
4The Godfather: Part II1974A202 min
512 Angry Men1957U96 min
6The Lord of the Rings: The Return of the King2003U201 min

2.2 Data Inspection

str(imdb) #chek data types of each columns
#> 'data.frame':    1000 obs. of  14 variables:
#>  $ Series_Title : chr  "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
#>  $ Released_Year: chr  "1994" "1972" "2008" "1974" ...
#>  $ Certificate  : chr  "A" "A" "UA" "A" ...
#>  $ Runtime      : chr  "142 min" "175 min" "152 min" "202 min" ...
#>  $ Genre        : chr  "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
#>  $ IMDB_Rating  : num  9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
#>  $ Meta_score   : int  80 100 84 90 96 94 94 94 74 66 ...
#>  $ Director     : chr  "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
#>  $ Star1        : chr  "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
#>  $ Star2        : chr  "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
#>  $ Star3        : chr  "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
#>  $ Star4        : chr  "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
#>  $ No_of_Votes  : int  2343110 1620367 2303232 1129952 689845 1642758 1826188 1213505 2067042 1854740 ...
#>  $ Gross        : chr  "28,341,469" "134,966,411" "534,858,444" "57,300,000" ...
colSums(is.na(imdb)) #chek number of missing value of each columns
#>  Series_Title Released_Year   Certificate       Runtime         Genre 
#>             0             0             0             0             0 
#>   IMDB_Rating    Meta_score      Director         Star1         Star2 
#>             0           157             0             0             0 
#>         Star3         Star4   No_of_Votes         Gross 
#>             0             0             0             0

From our inspection we can conclude:

  1. imdb data contain 1,000 rows and 14 columns after we drop 2 columns (“Poster Link” & “Overview”).
  2. In column 15, Gross should have numeric data type, so it must be changed before the data processing.
  3. Metascore is an average score given by the film critic and there are 157 movies that do not have a Metascore.

2.3 Data Cleansing & Coercions

From Data Inspection, we need to change Gross data type not into numeric by using as.numeric(). But before we can do that, we need to drop the comma symbol by using gsub()

imdb$Gross <- as.numeric(gsub(",","",imdb$Gross))

str(imdb)
#> 'data.frame':    1000 obs. of  14 variables:
#>  $ Series_Title : chr  "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
#>  $ Released_Year: chr  "1994" "1972" "2008" "1974" ...
#>  $ Certificate  : chr  "A" "A" "UA" "A" ...
#>  $ Runtime      : chr  "142 min" "175 min" "152 min" "202 min" ...
#>  $ Genre        : chr  "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
#>  $ IMDB_Rating  : num  9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
#>  $ Meta_score   : int  80 100 84 90 96 94 94 94 74 66 ...
#>  $ Director     : chr  "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
#>  $ Star1        : chr  "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
#>  $ Star2        : chr  "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
#>  $ Star3        : chr  "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
#>  $ Star4        : chr  "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
#>  $ No_of_Votes  : int  2343110 1620367 2303232 1129952 689845 1642758 1826188 1213505 2067042 1854740 ...
#>  $ Gross        : num  2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...

Missing value check:

colSums(is.na(imdb))
#>  Series_Title Released_Year   Certificate       Runtime         Genre 
#>             0             0             0             0             0 
#>   IMDB_Rating    Meta_score      Director         Star1         Star2 
#>             0           157             0             0             0 
#>         Star3         Star4   No_of_Votes         Gross 
#>             0             0             0           169

Missing value appears after changing the data type in Gross. This happens because the null object was previously counted as a character.

Because we cannot analyze films without Gross, we need to find the missing value and remove it from the dataframe using -which().

imdb <- imdb[-which(is.na(imdb$Gross)),]

colSums(is.na(imdb)) #check if there's still any missing value
#>  Series_Title Released_Year   Certificate       Runtime         Genre 
#>             0             0             0             0             0 
#>   IMDB_Rating    Meta_score      Director         Star1         Star2 
#>             0            81             0             0             0 
#>         Star3         Star4   No_of_Votes         Gross 
#>             0             0             0             0

Check the data shape:

dim(imdb)
#> [1] 831  14

Finally, we have imdb data that contain 831 rows and 14 columns which ready to be processed and analyzed

3 Data Visualization and Analysis

Libraries needed to perform data visualization:

library(ggplot2)
library(scales)

3.1 Top 5 High Gross Movies

  • What 5 top movies are getting highest gross amount?
top5_gross <- imdb[order(imdb$Gross,decreasing = T),][1:5,]
top5_gross
ABCDEFGHIJ0123456789
 
 
Series_Title
<chr>
Released_Year
<chr>
Certificate
<chr>
Runtime
<chr>
478Star Wars: Episode VII - The Force Awakens2015U138 min
60Avengers: Endgame2019UA181 min
624Avatar2009UA162 min
61Avengers: Infinity War2018UA149 min
653Titanic1997UA194 min
summary(top5_gross)
#>  Series_Title       Released_Year      Certificate          Runtime         
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>     Genre            IMDB_Rating     Meta_score     Director        
#>  Length:5           Min.   :7.80   Min.   :68.0   Length:5          
#>  Class :character   1st Qu.:7.80   1st Qu.:75.0   Class :character  
#>  Mode  :character   Median :7.90   Median :78.0   Mode  :character  
#>                     Mean   :8.06   Mean   :76.8                     
#>                     3rd Qu.:8.40   3rd Qu.:80.0                     
#>                     Max.   :8.40   Max.   :83.0                     
#>     Star1              Star2              Star3              Star4          
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>   No_of_Votes          Gross          
#>  Min.   : 809955   Min.   :659325379  
#>  1st Qu.: 834477   1st Qu.:678815482  
#>  Median : 860823   Median :760507625  
#>  Mean   : 934068   Mean   :778736742  
#>  3rd Qu.:1046089   3rd Qu.:858373000  
#>  Max.   :1118998   Max.   :936662225
ggplot(top5_gross, aes(x = Gross,y = reorder(Series_Title, Gross))) +
  geom_col(aes(fill = Gross), show.legend = F) +
  labs(title = "Top 5 Movies based on Gross Revenue",
       x = "Gross Revenue",
       y = NULL) +
  geom_label(aes(label = comma(Gross)), hjust = 1.05) +
  scale_fill_gradient(low = "red", high = "black") +
  theme_minimal()

From the data visualization above, we can conclude:

  1. Top 5 High Gross Movies have an average gross of $778,736,742, with an average rating of 8.06.

  2. Star Wars has the highest gross at $936,662,225.

  3. The success of Star Wars is well explained by the investopedia:

    When Star Wars: Episode VII—The Force Awakens was released in 2015, after a gap of 10 years, Walmart ran an advertisement campaign titled “A New Generation of Fans is Born.” The retail behemoth had it right. The longevity and endurance of the Star Wars franchise mean that it spans a broad age range from a 4-year-old kid, newly introduced to the franchise through its latest movie installment, to the 70-year-old Grandpa reliving its thrills with the original. Targeting multiple demographics is also a smart strategy. It evokes a range of emotions—nostalgia, happiness, excitement—in fans.

    source: https://www.investopedia.com/articles/investing/102215/why-star-wars-franchise-so-valuable.asp

  4. 3 out of 5 films are franchise or series films (Star Wars, and Marvel Cinematic Universe - Avengers). In general, franchises film have a large number of communities and fan base, so that can be a high contribution to the gross.

  5. Almost all films were released after 2000, where the number of cinemas and access to films is increasing every year. This is also a factor in the high gross amount.

  6. How about Avatar? The key to Avatar’s success lies in the year it was released in 2009, where in that year CGI was an extraordinary thing when compared to 2018 and above.

    Avatar came out in 2009 and took the world by surprise at just how far technology had come. It showed everyone (whether they were in the industry or audience) what levels CGI could be taken to. In 2021, it’s no big deal when a movie is mostly computer-generated, but many would argue that Avatar is the film that gave way for every film that followed it to push technology further than before. James Cameron said that this was a movie that he had been planning, creating, and waiting to make for years. What held him up for so long?

    source: https://gamerant.com/avatar-regain-spot-highest-grossing-film/

3.2 Top 5 High Rated Movies

  • What 5 top movies are getting highest rating?
top5_rating <- imdb[order(imdb$IMDB_Rating,decreasing = T),][1:5,]
top5_rating
ABCDEFGHIJ0123456789
 
 
Series_Title
<chr>
Released_Year
<chr>
Certificate
<chr>
Runtime
<chr>
Genre
<chr>
IMDB_Rating
<dbl>
1The Shawshank Redemption1994A142 minDrama9.3
2The Godfather1972A175 minCrime, Drama9.2
3The Dark Knight2008UA152 minAction, Crime, Drama9.0
4The Godfather: Part II1974A202 minCrime, Drama9.0
512 Angry Men1957U96 minCrime, Drama9.0
summary(top5_rating)
#>  Series_Title       Released_Year      Certificate          Runtime         
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>     Genre            IMDB_Rating    Meta_score    Director        
#>  Length:5           Min.   :9.0   Min.   : 80   Length:5          
#>  Class :character   1st Qu.:9.0   1st Qu.: 84   Class :character  
#>  Mode  :character   Median :9.0   Median : 90   Mode  :character  
#>                     Mean   :9.1   Mean   : 90                     
#>                     3rd Qu.:9.2   3rd Qu.: 96                     
#>                     Max.   :9.3   Max.   :100                     
#>     Star1              Star2              Star3              Star4          
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>   No_of_Votes          Gross          
#>  Min.   : 689845   Min.   :  4360000  
#>  1st Qu.:1129952   1st Qu.: 28341469  
#>  Median :1620367   Median : 57300000  
#>  Mean   :1617301   Mean   :151965265  
#>  3rd Qu.:2303232   3rd Qu.:134966411  
#>  Max.   :2343110   Max.   :534858444
ggplot(top5_rating, aes(x = IMDB_Rating,y = reorder(Series_Title, IMDB_Rating))) +
  geom_col(aes(fill = IMDB_Rating), show.legend = F) +
  labs(title = "Top 5 Movies based on IMDB Rating",
       x = "IMDB Rating",
       y = NULL) +
  geom_label(aes(label = IMDB_Rating), hjust = 1.05) +
  scale_fill_gradient(low = "red", high = "black") +
  theme_minimal()

From the data visualization above, we can conclude:

  1. Top 5 High Rated Movies have an average gross of $151,965,265, with an average rating of 9.1.
  2. The Shawshank Redemption has the highest Rating at 9.3.
  3. As in Top 5 High Gross Film, the franchise film also has a high rating, such as The God Father and Batman (The Dark Knight). So the community and fan base becomes one of the important factors in contributing to film rating and revenue.
  4. Unlike Top 5 High Gross, almost all films were released in the year under 2000. Why are old movies rated so high? This can happen because of the lack of a spectrum of reviews on the film. As we know, film review sites such as IMDB were only launched in 1993, and were popular in 1996 and above, so before that the spectrum of reviews for films was very limited.

3.3 Top 5 Highest Gross & Rating Movies

Based on data on Average Rating, the average value of film reviews fell each year, and start to level out a bit more in 1996, 1998, and 1999. Those are the respective launch years for IMDb, Rotten Tomatoes, and Metacritic. To get a more precise analysis result, the imdb data will be generated for 1996 and above.

imdb_slice <- imdb[imdb$Released_Year > 1996,]
head(imdb_slice)
ABCDEFGHIJ0123456789
 
 
Series_Title
<chr>
Released_Year
<chr>
Certificate
<chr>
Runtime
<chr>
3The Dark Knight2008UA152 min
6The Lord of the Rings: The Return of the King2003U201 min
9Inception2010UA148 min
10Fight Club1999A139 min
11The Lord of the Rings: The Fellowship of the Ring2001U178 min
14The Lord of the Rings: The Two Towers2002UA179 min
  • What is the IMDB rating range on the data?
unique(imdb_slice$IMDB_Rating)
#>  [1] 9.0 8.9 8.8 8.7 8.6 8.5 8.4 8.3 8.2 8.1 8.0 7.9 7.8 7.7 7.6

Based on the results above, the IMDB rating is in the range of 7.6 to 9.0. If it is assumed that the high rating is 8.5, then the top 5 films with the highest gross can be identified through the following command:

imdb_rate <- imdb_slice[imdb_slice$IMDB_Rating > 8.5,]
imdb_top5 <- imdb_rate[order(imdb_rate$Gross,decreasing = T),][1:5,]
imdb_top5
ABCDEFGHIJ0123456789
 
 
Series_Title
<chr>
Released_Year
<chr>
Certificate
<chr>
Runtime
<chr>
3The Dark Knight2008UA152 min
6The Lord of the Rings: The Return of the King2003U201 min
14The Lord of the Rings: The Two Towers2002UA179 min
11The Lord of the Rings: The Fellowship of the Ring2001U178 min
9Inception2010UA148 min
summary(imdb_top5)
#>  Series_Title       Released_Year      Certificate          Runtime         
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>     Genre            IMDB_Rating     Meta_score     Director        
#>  Length:5           Min.   :8.70   Min.   :74.0   Length:5          
#>  Class :character   1st Qu.:8.80   1st Qu.:84.0   Class :character  
#>  Mode  :character   Median :8.80   Median :87.0   Mode  :character  
#>                     Mean   :8.84   Mean   :86.2                     
#>                     3rd Qu.:8.90   3rd Qu.:92.0                     
#>                     Max.   :9.00   Max.   :94.0                     
#>     Star1              Star2              Star3              Star4          
#>  Length:5           Length:5           Length:5           Length:5          
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>   No_of_Votes          Gross          
#>  Min.   :1485555   Min.   :292576195  
#>  1st Qu.:1642758   1st Qu.:315544750  
#>  Median :1661481   Median :342551365  
#>  Mean   :1832014   Mean   :372675332  
#>  3rd Qu.:2067042   3rd Qu.:377845905  
#>  Max.   :2303232   Max.   :534858444
ggplot(imdb_top5, aes(x = Gross,y = reorder(Series_Title, Gross))) +
  geom_col(aes(fill = Gross), show.legend = F) +
  labs(title = "Top 5 Movies based on Rating & Gross",
       x = "Gross Revenue",
       y = NULL) +
  geom_label(aes(label = comma(Gross)), hjust = 1.05) +
  scale_fill_gradient(low = "red", high = "black") +
  theme_minimal()

From the data visualization above, we can conclude:

  1. 5 top movies with highest rating and gross amount have an average gross of $372,675,332, with an average rating of 8.84.

  2. The Dark Knight (Batman Franchise) has the highest rating and gross. The success of this film cannot be separated from the role of one of the actors, Heath Ledger who has been voted as the best Joker portrayal of the character of all time at LADbible.

  3. The genre of top 5 films is Action.

  4. Both The Lord of the Rings and Inception were directed by Christopher Nolan.

  5. It’s surprising that the Star Wars and Avengers franchises don’t make the top 5. Although it has a high gross, it turns out that the film did not get good reviews.

  6. Animation and CGI don’t seem to be enough to get high ratings, a film needs a rich plot and a lot of character development to attract audiences like The Lord of the Rings has done.

    The Lord of the Rings movies gave some characters more development and others a greater role in the story. The movies, too, therefore, deserve a lot of credit for bringing the story to a new audience.

    sources: https://gamerant.com/lotr-why-still-popular/

4 Conclusion and Recommendation

1. What type of movies are getting good rating and gross amount?

  • The Dark Knight, The Lord of The Rings, and Inception are the highest-rated and grossing films with an average gross of $372,675,332 and a rating of 8.84.
  • The Dark Knight took first place with a gross of $534,858,444 and a rating of 9.0, and was followed by 3 films The Lord of the Rings and Inception. The five films have the Action genre and were released between 2000 and 2010. The absence of films above 2010 could be due to the increasing spectrum of audience reviews and the narrower scope innovation for films.
  • While some of the success factors of the five films can come from the actors, and directors such as Christopher Nolan who directed the films The Dark Knight and Inception, and Heath Ledger who was nominated as the actor best portrayal to play Joker in the Batman franchise. Another factor that is no less important is the story line and character development as done by The Lord of The Rings. Although it doesn’t have a bigger gross than the Star Wars or Avengers films, The Lord of The Rings was able to prove its popularity by entering 3 of the top 5 films with the highest rating and gross.

2. Recommendation for Further Analysis

  • Film analysis can be limited by the year the film was released to avoid inflation and the development of film technology factors.
  • Film analysis can be done for genres other than Action Films.