The Internet Movie Database (abbreviated IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews. We got an idea to analyse Top 250 movies at IMDB. We have few questions about genre, top gross budget movie etc.
IMDB provides list of Top 250 movies and web scraping is also getting quite famous. R is our favorite language due to ease of use and possess rich repository of packages. Rvest is the famous package in R which is used for web scraping. To simplify further, we have used Select Gadget add-on in Chrome which provides CSS as per information is required.
To start with, IMDB Top 250 data is loaded into a csv file - ‘IMDB Top 250.csv’ and our analysis kick off with reading every movie link from .csv file and extract additional information from IMDB website.
First, we need to install all required packages and load respective libraries.
We are using rvest package to extract the data along with XML package,ggplot2 to plot the graph and stringr package for string manipulation operations.
library("rvest")
## Warning: package 'rvest' was built under R version 3.2.5
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.5
library("XML")
## Warning: package 'XML' was built under R version 3.2.5
##
## Attaching package: 'XML'
## The following object is masked from 'package:rvest':
##
## xml
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.2.5
library("stringr")
## Warning: package 'stringr' was built under R version 3.2.5
Here, we are good to go to load top 250 movie data set from .csv file. Afer loading, we also want to analyse data set bit more in detail for better understanding before performing any analysis e.g. number of observations and variables, column names, data type of variables, few observations as a sample etc.
setwd("E:\\RDataSet")
top250 <- read.csv("IMDB Top 250.csv", header =TRUE, stringsAsFactors = F)
dim(top250)
## [1] 250 6
head(top250)
## movie.name
## 1 The Shawshank Redemption
## 2 The Godfather
## 3 The Godfather: Part II
## 4 The Dark Knight
## 5 12 Angry Men
## 6 Schindler's List
## movie.cast
## 1 Frank Darabont (dir.), Tim Robbins, Morgan Freeman
## 2 Francis Ford Coppola (dir.), Marlon Brando, Al Pacino
## 3 Francis Ford Coppola (dir.), Al Pacino, Robert De Niro
## 4 Christopher Nolan (dir.), Christian Bale, Heath Ledger
## 5 Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb
## 6 Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes
## movie.link
## 1 http://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1
## 2 http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2
## 3 http://www.imdb.com/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3
## 4 http://www.imdb.com/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4
## 5 http://www.imdb.com/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_5
## 6 http://www.imdb.com/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_6
## year votes rating
## 1 1994 1717585 9.2
## 2 1972 1174018 9.2
## 3 1974 804491 9.0
## 4 2008 1703840 8.9
## 5 1957 457238 8.9
## 6 1993 879547 8.9
We are fetching the details of all the movies released between 1996 and 1998 including both the year.We are only outputting the name of the movie, year of release and rating.
attach(top250)
top250[(year<=1998 & year>=1996) ,c("movie.name","year","rating")]
## movie.name year rating
## 26 La vita è bella 1997 8.6
## 29 Saving Private Ryan 1998 8.5
## 31 American History X 1998 8.5
## 65 Mononoke-hime 1997 8.4
## 97 L.A. Confidential 1997 8.3
## 107 Good Will Hunting 1997 8.2
## 119 Bacheha-Ye aseman 1997 8.2
## 136 Lock, Stock and Two Smoking Barrels 1998 8.2
## 151 The Big Lebowski 1998 8.1
## 155 Trainspotting 1996 8.1
## 157 Fargo 1996 8.1
## 206 The Truman Show 1998 8.0
We are using CSS Selector to extract the value of different attributes of a movie on IMDB. But the structure of the web page is not consistent for each movie and also few of attributes are missing for some of the movies such as Movie Tagline, Movie Budget and Movie Gross Budget, so in order to tackle those scenarios we have used the MissingVal Funtion which print ‘NA’ if there is no data for a particular attribute of a movie.
MissingVal<-function(arg){
if(length(arg)==0){
arg<-"NA"
}
return(arg)
}
In order to perform our anaysis, we need genre, director, stars, year, total budget and gross budget. Tagline and Storyline ( partial) will help us to understand about movie better. So, we will pull these details too. Each movie have more than one genre.So, we have captured all the genre for a movie by having one row for each genre.So, The final dataset that we would be constructing can have rows more than 250.
movie.df<-data.frame(NULL)
for(i in 1:250){
#html page of each movie
movie.page<-read_html(as.character(top250$movie.link[i]))
#movie name from dataset top250
movie.name<-top250$movie.name[i]
#movie_director
movie.director<-html_nodes(movie.page,".summary_text+ .credit_summary_item .itemprop")
director<-paste(html_text(movie.director),collapse=",") #concatenating the director's name by ","
#movie_cast
cast<-html_nodes(movie.page,".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop")
cast<-paste(html_text(cast),collapse = " ,")
#movie_tagline
movie.tagline <- html_nodes(movie.page,"#titleStoryLine .txt-block:nth-child(8)")
movie.tagline<-MissingVal(
gsub("\\s+$","", #removing spaces at the starting of the tagline
gsub("(.*Taglines:\n|See more.*)","", #removing extra characters from tagline
html_text(movie.tagline)
)))
#get the storyline of the movies
movie.storyline<-html_node(movie.page,"#titleStoryLine p")
movie.storyline<-MissingVal(
gsub("\n|\\s+$","", #removing spaces at the starting
gsub("Written by.*","", #removing extra characters at the end
html_text(movie.storyline)
)))
#movie_genre
movie.genre<-html_nodes(movie.page,".canwrap a")
movie.genre<-movie.genre[grepl("/genre/",sapply(html_attrs(movie.genre),`[[`,'href'))]
movie.genre<-str_trim(html_text(movie.genre))
#movie_budget
movie.budget <- html_nodes(movie.page,"#titleDetails .txt-block")
budgetPos<-which(grepl("Budget:", movie.budget))
GrossPos<-which(grepl("Gross:", movie.budget))
movieBudget<-html_text(movie.budget[budgetPos])
movieGross<-html_text(movie.budget[GrossPos])
budgetMatch<-regexpr("(\\$|£|INR\\s+)(\\d+,)+\\d+",movieBudget)
movie.budget<-MissingVal(regmatches(movieBudget,budgetMatch))
grossMatch<-regexpr("(\\$|£|INR\\s+)(\\d+,)+\\d+",movieGross)
movie.gross<-MissingVal(regmatches(movieGross,grossMatch))
dft<-data.frame(movie.name,director,cast,movie.tagline,movie.genre,movie.storyline,movie.budget,movie.gross)
movie.df<-rbind(movie.df,dft)
}
Saving the data extracted in a csv file.
write.csv(movie.df,'IMDB_Movie_Dataframe.csv',row.names = T)
We are interested in count of movies of a particular genre in the top 250 IMDB movie list. So, we would be creating a table which contains the frequency of movies of each genre in top 250 IMDB movie list.
table(movie.df$movie.genre)
##
## Crime Drama Action Thriller Biography History Adventure
## 56 176 36 63 26 17 62
## Fantasy Western Sci-Fi Comedy Mystery Family War
## 33 10 31 42 36 24 30
## Animation Romance Horror Music Musical Film-Noir Sport
## 20 24 5 2 5 7 7
- We are also interested in finding out which genre of movies featured maximum time in top 250 IMDB movie lis& which are very less.
The plot shows that Drama Genre occur significantly more often than Action in top 250 Movie list in IMDB.Music & sport genre movies are significantly less as compare to other genre movies in top 250 Movie list in IMDB.
Plotting an graph around the average budget of movie genre wise to analyse which genre movie has more budget values.But as we have budget & gross missing for few of the movies we are ignoring those movies.
The bargraph shows that Action, Animation, sci-fi,adventure & fantasy movies have high budget. That is true also as these movies uses latest technology while filming & special effects which actually requires more budgeting.
Plotting an graph around the average gross of movies genre wise would give us an insight of which genre movie has higher gross in the top 250 Movie list.
So, we can conclude that Action,Animation,sci-fi, fantasy movies have higher gross than the Romantic movies.