TOP 250 IMDB Movies

Motivation for this Analysis

The Internet Movie Database (abbreviated IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews. We got an idea to analyse Top 250 movies at IMDB. We have few questions about genre, top gross budget movie etc.

Problem Context and Formulation

IMDB provides list of Top 250 movies and web scraping is also getting quite famous. R is our favorite language due to ease of use and possess rich repository of packages. Rvest is the famous package in R which is used for web scraping. To simplify further, we have used Select Gadget add-on in Chrome which provides CSS as per information is required.

To start with, IMDB Top 250 data is loaded into a csv file - ‘IMDB Top 250.csv’ and our analysis kick off with reading every movie link from .csv file and extract additional information from IMDB website.

First, we need to install all required packages and load respective libraries.

Packages required

We are using rvest package to extract the data along with XML package,ggplot2 to plot the graph and stringr package for string manipulation operations.

library("rvest")

## Warning: package 'rvest' was built under R version 3.2.5

## Loading required package: xml2

## Warning: package 'xml2' was built under R version 3.2.5

library("XML")

## Warning: package 'XML' was built under R version 3.2.5

## 
## Attaching package: 'XML'

## The following object is masked from 'package:rvest':
## 
##     xml

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.2.5

library("stringr")

## Warning: package 'stringr' was built under R version 3.2.5

Here, we are good to go to load top 250 movie data set from .csv file. Afer loading, we also want to analyse data set bit more in detail for better understanding before performing any analysis e.g. number of observations and variables, column names, data type of variables, few observations as a sample etc.

setwd("E:\\RDataSet")

top250 <- read.csv("IMDB Top 250.csv", header  =TRUE, stringsAsFactors = F)

dim(top250)

## [1] 250   6

head(top250)

##                 movie.name
## 1 The Shawshank Redemption
## 2            The Godfather
## 3   The Godfather: Part II
## 4          The Dark Knight
## 5             12 Angry Men
## 6         Schindler's List
##                                               movie.cast
## 1     Frank Darabont (dir.), Tim Robbins, Morgan Freeman
## 2  Francis Ford Coppola (dir.), Marlon Brando, Al Pacino
## 3 Francis Ford Coppola (dir.), Al Pacino, Robert De Niro
## 4 Christopher Nolan (dir.), Christian Bale, Heath Ledger
## 5          Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb
## 6    Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes
##                                                                                                                                                               movie.link
## 1 http://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1
## 2 http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2
## 3 http://www.imdb.com/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3
## 4 http://www.imdb.com/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4
## 5 http://www.imdb.com/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_5
## 6 http://www.imdb.com/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0HNPENS738XVV5RCMREM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_6
##   year   votes rating
## 1 1994 1717585    9.2
## 2 1972 1174018    9.2
## 3 1974  804491    9.0
## 4 2008 1703840    8.9
## 5 1957  457238    8.9
## 6 1993  879547    8.9

movies released between 1996 and 1998

We are fetching the details of all the movies released between 1996 and 1998 including both the year.We are only outputting the name of the movie, year of release and rating.

attach(top250)
top250[(year<=1998 & year>=1996) ,c("movie.name","year","rating")]

##                              movie.name year rating
## 26                      La vita è bella 1997    8.6
## 29                  Saving Private Ryan 1998    8.5
## 31                   American History X 1998    8.5
## 65                        Mononoke-hime 1997    8.4
## 97                    L.A. Confidential 1997    8.3
## 107                   Good Will Hunting 1997    8.2
## 119                   Bacheha-Ye aseman 1997    8.2
## 136 Lock, Stock and Two Smoking Barrels 1998    8.2
## 151                    The Big Lebowski 1998    8.1
## 155                       Trainspotting 1996    8.1
## 157                               Fargo 1996    8.1
## 206                     The Truman Show 1998    8.0

MissingVal Function

We are using CSS Selector to extract the value of different attributes of a movie on IMDB. But the structure of the web page is not consistent for each movie and also few of attributes are missing for some of the movies such as Movie Tagline, Movie Budget and Movie Gross Budget, so in order to tackle those scenarios we have used the MissingVal Funtion which print ‘NA’ if there is no data for a particular attribute of a movie.

MissingVal<-function(arg){
  if(length(arg)==0){
    arg<-"NA"
  }
  return(arg)
}

Data Extraction

In order to perform our anaysis, we need genre, director, stars, year, total budget and gross budget. Tagline and Storyline ( partial) will help us to understand about movie better. So, we will pull these details too. Each movie have more than one genre.So, we have captured all the genre for a movie by having one row for each genre.So, The final dataset that we would be constructing can have rows more than 250.

movie.df<-data.frame(NULL)

for(i in 1:250){
  
  #html page of each movie
  movie.page<-read_html(as.character(top250$movie.link[i]))
  
  #movie name from dataset top250
  movie.name<-top250$movie.name[i]
  
  #movie_director
  movie.director<-html_nodes(movie.page,".summary_text+ .credit_summary_item .itemprop")
  director<-paste(html_text(movie.director),collapse=",") #concatenating the director's name by ","
  
  #movie_cast
  cast<-html_nodes(movie.page,".credit_summary_item~ .credit_summary_item+ .credit_summary_item .itemprop")
  cast<-paste(html_text(cast),collapse = " ,")
  
  #movie_tagline
  movie.tagline <- html_nodes(movie.page,"#titleStoryLine .txt-block:nth-child(8)")
  movie.tagline<-MissingVal(
                            gsub("\\s+$","", #removing spaces at the starting of the tagline
                                      gsub("(.*Taglines:\n|See more.*)","", #removing extra characters from tagline
                                                 html_text(movie.tagline)
                                                 )))
  
  #get the storyline of the movies
  movie.storyline<-html_node(movie.page,"#titleStoryLine p")
  movie.storyline<-MissingVal(
                              gsub("\n|\\s+$","", #removing spaces at the starting
                                   gsub("Written by.*","", #removing extra characters at the end
                                           html_text(movie.storyline)
                                        )))
  
  #movie_genre
  movie.genre<-html_nodes(movie.page,".canwrap a")
  movie.genre<-movie.genre[grepl("/genre/",sapply(html_attrs(movie.genre),`[[`,'href'))]
  movie.genre<-str_trim(html_text(movie.genre))
  
  
  #movie_budget
  movie.budget <- html_nodes(movie.page,"#titleDetails .txt-block")
  budgetPos<-which(grepl("Budget:", movie.budget))
  GrossPos<-which(grepl("Gross:", movie.budget))
  movieBudget<-html_text(movie.budget[budgetPos])
  movieGross<-html_text(movie.budget[GrossPos])
  
  budgetMatch<-regexpr("(\\$|£|INR\\s+)(\\d+,)+\\d+",movieBudget)
  movie.budget<-MissingVal(regmatches(movieBudget,budgetMatch))
  
  grossMatch<-regexpr("(\\$|£|INR\\s+)(\\d+,)+\\d+",movieGross)
  movie.gross<-MissingVal(regmatches(movieGross,grossMatch))
  
  
  dft<-data.frame(movie.name,director,cast,movie.tagline,movie.genre,movie.storyline,movie.budget,movie.gross)
  movie.df<-rbind(movie.df,dft)
  
}

Saving the data extracted in a csv file.

write.csv(movie.df,'IMDB_Movie_Dataframe.csv',row.names = T)

movie_count versus Genres

We are interested in count of movies of a particular genre in the top 250 IMDB movie list. So, we would be creating a table which contains the frequency of movies of each genre in top 250 IMDB movie list.

table(movie.df$movie.genre)

## 
##     Crime     Drama    Action  Thriller Biography   History Adventure 
##        56       176        36        63        26        17        62 
##   Fantasy   Western    Sci-Fi    Comedy   Mystery    Family       War 
##        33        10        31        42        36        24        30 
## Animation   Romance    Horror     Music   Musical Film-Noir     Sport 
##        20        24         5         2         5         7         7

We are also interested in finding out which genre of movies featured maximum time in top 250 IMDB movie lis& which are very less.

The plot shows that Drama Genre occur significantly more often than Action in top 250 Movie list in IMDB.Music & sport genre movies are significantly less as compare to other genre movies in top 250 Movie list in IMDB.

Plotting an graph around the average budget of movie genre wise to analyse which genre movie has more budget values.But as we have budget & gross missing for few of the movies we are ignoring those movies.

The bargraph shows that Action, Animation, sci-fi,adventure & fantasy movies have high budget. That is true also as these movies uses latest technology while filming & special effects which actually requires more budgeting.

Plotting an graph around the average gross of movies genre wise would give us an insight of which genre movie has higher gross in the top 250 Movie list.

So, we can conclude that Action,Animation,sci-fi, fantasy movies have higher gross than the Romantic movies.