IMDB- is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
Data Description
At this project we will use IMDB Dataset of top 1000 movies and tv shows from https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
The datasets contains the following information:
Data Shape
Problem Statement
What type of movies are getting good rating and gross amount?
Import data csv. Make sure our data placed in the same folder with our R project data.
<- read.csv("data_input/imdb_top_1000.csv")
imdb <- subset(imdb, select = -c(Poster_Link,Overview)) #drop "Poster Link" & "Overview" columns
imdb head(imdb)
Series_Title <chr> | Released_Year <chr> | Certificate <chr> | Runtime <chr> | ||
---|---|---|---|---|---|
1 | The Shawshank Redemption | 1994 | A | 142 min | |
2 | The Godfather | 1972 | A | 175 min | |
3 | The Dark Knight | 2008 | UA | 152 min | |
4 | The Godfather: Part II | 1974 | A | 202 min | |
5 | 12 Angry Men | 1957 | U | 96 min | |
6 | The Lord of the Rings: The Return of the King | 2003 | U | 201 min |
str(imdb) #chek data types of each columns
#> 'data.frame': 1000 obs. of 14 variables:
#> $ Series_Title : chr "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
#> $ Released_Year: chr "1994" "1972" "2008" "1974" ...
#> $ Certificate : chr "A" "A" "UA" "A" ...
#> $ Runtime : chr "142 min" "175 min" "152 min" "202 min" ...
#> $ Genre : chr "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
#> $ IMDB_Rating : num 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
#> $ Meta_score : int 80 100 84 90 96 94 94 94 74 66 ...
#> $ Director : chr "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
#> $ Star1 : chr "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
#> $ Star2 : chr "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
#> $ Star3 : chr "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
#> $ Star4 : chr "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
#> $ No_of_Votes : int 2343110 1620367 2303232 1129952 689845 1642758 1826188 1213505 2067042 1854740 ...
#> $ Gross : chr "28,341,469" "134,966,411" "534,858,444" "57,300,000" ...
colSums(is.na(imdb)) #chek number of missing value of each columns
#> Series_Title Released_Year Certificate Runtime Genre
#> 0 0 0 0 0
#> IMDB_Rating Meta_score Director Star1 Star2
#> 0 157 0 0 0
#> Star3 Star4 No_of_Votes Gross
#> 0 0 0 0
From our inspection we can conclude:
imdb
data contain 1,000 rows and 14 columns after we drop 2 columns (“Poster Link” & “Overview”).Gross
should have numeric data type, so it must be changed before the data processing.From Data Inspection, we need to change Gross
data type not into numeric by using as.numeric()
. But before we can do that, we need to drop the comma symbol by using gsub()
$Gross <- as.numeric(gsub(",","",imdb$Gross))
imdb
str(imdb)
#> 'data.frame': 1000 obs. of 14 variables:
#> $ Series_Title : chr "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
#> $ Released_Year: chr "1994" "1972" "2008" "1974" ...
#> $ Certificate : chr "A" "A" "UA" "A" ...
#> $ Runtime : chr "142 min" "175 min" "152 min" "202 min" ...
#> $ Genre : chr "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
#> $ IMDB_Rating : num 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
#> $ Meta_score : int 80 100 84 90 96 94 94 94 74 66 ...
#> $ Director : chr "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
#> $ Star1 : chr "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
#> $ Star2 : chr "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
#> $ Star3 : chr "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
#> $ Star4 : chr "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
#> $ No_of_Votes : int 2343110 1620367 2303232 1129952 689845 1642758 1826188 1213505 2067042 1854740 ...
#> $ Gross : num 2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...
Missing value check:
colSums(is.na(imdb))
#> Series_Title Released_Year Certificate Runtime Genre
#> 0 0 0 0 0
#> IMDB_Rating Meta_score Director Star1 Star2
#> 0 157 0 0 0
#> Star3 Star4 No_of_Votes Gross
#> 0 0 0 169
Missing value appears after changing the data type in Gross
. This happens because the null object was previously counted as a character
.
Because we cannot analyze films without Gross
, we need to find the missing value and remove it from the dataframe using -which()
.
<- imdb[-which(is.na(imdb$Gross)),]
imdb
colSums(is.na(imdb)) #check if there's still any missing value
#> Series_Title Released_Year Certificate Runtime Genre
#> 0 0 0 0 0
#> IMDB_Rating Meta_score Director Star1 Star2
#> 0 81 0 0 0
#> Star3 Star4 No_of_Votes Gross
#> 0 0 0 0
Check the data shape:
dim(imdb)
#> [1] 831 14
Finally, we have imdb
data that contain 831 rows and 14 columns which ready to be processed and analyzed
Libraries needed to perform data visualization:
library(ggplot2)
library(scales)
<- imdb[order(imdb$Gross,decreasing = T),][1:5,]
top5_gross top5_gross
Series_Title <chr> | Released_Year <chr> | Certificate <chr> | Runtime <chr> | ||
---|---|---|---|---|---|
478 | Star Wars: Episode VII - The Force Awakens | 2015 | U | 138 min | |
60 | Avengers: Endgame | 2019 | UA | 181 min | |
624 | Avatar | 2009 | UA | 162 min | |
61 | Avengers: Infinity War | 2018 | UA | 149 min | |
653 | Titanic | 1997 | UA | 194 min |
summary(top5_gross)
#> Series_Title Released_Year Certificate Runtime
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Genre IMDB_Rating Meta_score Director
#> Length:5 Min. :7.80 Min. :68.0 Length:5
#> Class :character 1st Qu.:7.80 1st Qu.:75.0 Class :character
#> Mode :character Median :7.90 Median :78.0 Mode :character
#> Mean :8.06 Mean :76.8
#> 3rd Qu.:8.40 3rd Qu.:80.0
#> Max. :8.40 Max. :83.0
#> Star1 Star2 Star3 Star4
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> No_of_Votes Gross
#> Min. : 809955 Min. :659325379
#> 1st Qu.: 834477 1st Qu.:678815482
#> Median : 860823 Median :760507625
#> Mean : 934068 Mean :778736742
#> 3rd Qu.:1046089 3rd Qu.:858373000
#> Max. :1118998 Max. :936662225
ggplot(top5_gross, aes(x = Gross,y = reorder(Series_Title, Gross))) +
geom_col(aes(fill = Gross), show.legend = F) +
labs(title = "Top 5 Movies based on Gross Revenue",
x = "Gross Revenue",
y = NULL) +
geom_label(aes(label = comma(Gross)), hjust = 1.05) +
scale_fill_gradient(low = "red", high = "black") +
theme_minimal()
From the data visualization above, we can conclude:
Top 5 High Gross Movies have an average gross of $778,736,742, with an average rating of 8.06.
Star Wars has the highest gross at $936,662,225.
The success of Star Wars is well explained by the investopedia:
When Star Wars: Episode VII—The Force Awakens was released in 2015, after a gap of 10 years, Walmart ran an advertisement campaign titled “A New Generation of Fans is Born.” The retail behemoth had it right. The longevity and endurance of the Star Wars franchise mean that it spans a broad age range from a 4-year-old kid, newly introduced to the franchise through its latest movie installment, to the 70-year-old Grandpa reliving its thrills with the original. Targeting multiple demographics is also a smart strategy. It evokes a range of emotions—nostalgia, happiness, excitement—in fans.
source: https://www.investopedia.com/articles/investing/102215/why-star-wars-franchise-so-valuable.asp
3 out of 5 films are franchise or series films (Star Wars, and Marvel Cinematic Universe - Avengers). In general, franchises film have a large number of communities and fan base, so that can be a high contribution to the gross.
Almost all films were released after 2000, where the number of cinemas and access to films is increasing every year. This is also a factor in the high gross amount.
How about Avatar? The key to Avatar’s success lies in the year it was released in 2009, where in that year CGI was an extraordinary thing when compared to 2018 and above.
Avatar came out in 2009 and took the world by surprise at just how far technology had come. It showed everyone (whether they were in the industry or audience) what levels CGI could be taken to. In 2021, it’s no big deal when a movie is mostly computer-generated, but many would argue that Avatar is the film that gave way for every film that followed it to push technology further than before. James Cameron said that this was a movie that he had been planning, creating, and waiting to make for years. What held him up for so long?
source: https://gamerant.com/avatar-regain-spot-highest-grossing-film/
<- imdb[order(imdb$IMDB_Rating,decreasing = T),][1:5,]
top5_rating top5_rating
Series_Title <chr> | Released_Year <chr> | Certificate <chr> | Runtime <chr> | Genre <chr> | IMDB_Rating <dbl> | ||
---|---|---|---|---|---|---|---|
1 | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | |
2 | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | |
3 | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | |
4 | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | |
5 | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 |
summary(top5_rating)
#> Series_Title Released_Year Certificate Runtime
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Genre IMDB_Rating Meta_score Director
#> Length:5 Min. :9.0 Min. : 80 Length:5
#> Class :character 1st Qu.:9.0 1st Qu.: 84 Class :character
#> Mode :character Median :9.0 Median : 90 Mode :character
#> Mean :9.1 Mean : 90
#> 3rd Qu.:9.2 3rd Qu.: 96
#> Max. :9.3 Max. :100
#> Star1 Star2 Star3 Star4
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> No_of_Votes Gross
#> Min. : 689845 Min. : 4360000
#> 1st Qu.:1129952 1st Qu.: 28341469
#> Median :1620367 Median : 57300000
#> Mean :1617301 Mean :151965265
#> 3rd Qu.:2303232 3rd Qu.:134966411
#> Max. :2343110 Max. :534858444
ggplot(top5_rating, aes(x = IMDB_Rating,y = reorder(Series_Title, IMDB_Rating))) +
geom_col(aes(fill = IMDB_Rating), show.legend = F) +
labs(title = "Top 5 Movies based on IMDB Rating",
x = "IMDB Rating",
y = NULL) +
geom_label(aes(label = IMDB_Rating), hjust = 1.05) +
scale_fill_gradient(low = "red", high = "black") +
theme_minimal()
From the data visualization above, we can conclude:
Based on data on Average Rating, the average value of film reviews fell each year, and start to level out a bit more in 1996, 1998, and 1999. Those are the respective launch years for IMDb, Rotten Tomatoes, and Metacritic. To get a more precise analysis result, the imdb
data will be generated for 1996 and above.
<- imdb[imdb$Released_Year > 1996,]
imdb_slice head(imdb_slice)
Series_Title <chr> | Released_Year <chr> | Certificate <chr> | Runtime <chr> | ||
---|---|---|---|---|---|
3 | The Dark Knight | 2008 | UA | 152 min | |
6 | The Lord of the Rings: The Return of the King | 2003 | U | 201 min | |
9 | Inception | 2010 | UA | 148 min | |
10 | Fight Club | 1999 | A | 139 min | |
11 | The Lord of the Rings: The Fellowship of the Ring | 2001 | U | 178 min | |
14 | The Lord of the Rings: The Two Towers | 2002 | UA | 179 min |
unique(imdb_slice$IMDB_Rating)
#> [1] 9.0 8.9 8.8 8.7 8.6 8.5 8.4 8.3 8.2 8.1 8.0 7.9 7.8 7.7 7.6
Based on the results above, the IMDB rating is in the range of 7.6 to 9.0. If it is assumed that the high rating is 8.5, then the top 5 films with the highest gross can be identified through the following command:
<- imdb_slice[imdb_slice$IMDB_Rating > 8.5,]
imdb_rate <- imdb_rate[order(imdb_rate$Gross,decreasing = T),][1:5,]
imdb_top5 imdb_top5
Series_Title <chr> | Released_Year <chr> | Certificate <chr> | Runtime <chr> | ||
---|---|---|---|---|---|
3 | The Dark Knight | 2008 | UA | 152 min | |
6 | The Lord of the Rings: The Return of the King | 2003 | U | 201 min | |
14 | The Lord of the Rings: The Two Towers | 2002 | UA | 179 min | |
11 | The Lord of the Rings: The Fellowship of the Ring | 2001 | U | 178 min | |
9 | Inception | 2010 | UA | 148 min |
summary(imdb_top5)
#> Series_Title Released_Year Certificate Runtime
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Genre IMDB_Rating Meta_score Director
#> Length:5 Min. :8.70 Min. :74.0 Length:5
#> Class :character 1st Qu.:8.80 1st Qu.:84.0 Class :character
#> Mode :character Median :8.80 Median :87.0 Mode :character
#> Mean :8.84 Mean :86.2
#> 3rd Qu.:8.90 3rd Qu.:92.0
#> Max. :9.00 Max. :94.0
#> Star1 Star2 Star3 Star4
#> Length:5 Length:5 Length:5 Length:5
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> No_of_Votes Gross
#> Min. :1485555 Min. :292576195
#> 1st Qu.:1642758 1st Qu.:315544750
#> Median :1661481 Median :342551365
#> Mean :1832014 Mean :372675332
#> 3rd Qu.:2067042 3rd Qu.:377845905
#> Max. :2303232 Max. :534858444
ggplot(imdb_top5, aes(x = Gross,y = reorder(Series_Title, Gross))) +
geom_col(aes(fill = Gross), show.legend = F) +
labs(title = "Top 5 Movies based on Rating & Gross",
x = "Gross Revenue",
y = NULL) +
geom_label(aes(label = comma(Gross)), hjust = 1.05) +
scale_fill_gradient(low = "red", high = "black") +
theme_minimal()
From the data visualization above, we can conclude:
5 top movies with highest rating and gross amount have an average gross of $372,675,332, with an average rating of 8.84.
The Dark Knight (Batman Franchise) has the highest rating and gross. The success of this film cannot be separated from the role of one of the actors, Heath Ledger who has been voted as the best Joker portrayal of the character of all time at LADbible.
The genre of top 5 films is Action.
Both The Lord of the Rings and Inception were directed by Christopher Nolan.
It’s surprising that the Star Wars and Avengers franchises don’t make the top 5. Although it has a high gross, it turns out that the film did not get good reviews.
Animation and CGI don’t seem to be enough to get high ratings, a film needs a rich plot and a lot of character development to attract audiences like The Lord of the Rings has done.
The Lord of the Rings movies gave some characters more development and others a greater role in the story. The movies, too, therefore, deserve a lot of credit for bringing the story to a new audience.
1. What type of movies are getting good rating and gross amount?
2. Recommendation for Further Analysis