Analyzing TMDB Dataset v1.0

Dataset source:https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

R Markdown

tmdb_5000_movies <- read.csv("tmdb_5000_movies.csv")
moviedata<-tmdb_5000_movies

Extracting Genre from Movie data, using the first genre since other instances have huge amounts of NA values.

Similarly extracting singular values from Production, Keywords, Languages etc.

We check the prevalance of NAs in the following cells and hence choose singular extractions for these options.

Data Cleaning

Genre<-moviedata$genres
Genre<-as.data.frame(Genre)
Genre<-separate(Genre, col = Genre, into=c("1","2","3","4","5","6"))
##View(Genre)
moviedata$genres<-Genre$`5`
##View(moviedata)
Keywords<-moviedata$keywords
Keywords<-as.data.frame(Keywords)
Keywords<-separate(Keywords, col = Keywords, into=c("1","2","3","4","5","6"))
moviedata$keywords<-Keywords$`5`
##View(moviedata)
Production<-moviedata$production_companies
Production<-as.data.frame(Production)
##View(Production)
test<-separate(Production, col = Production, into=c("1","2","3","4","5","6"), sep = ":")
##View(test)
table(is.na(test[,4]))

## 
## FALSE  TRUE 
##  3386  1417

table(is.na(test[,2]))

## 
## FALSE  TRUE 
##  4452   351

Production<-separate(Production, col = Production, into=c("1","2","3","4","5","6"), sep = ":")
##View(Production)
moviedata$production_companies<-Production$`2`

head(moviedata, n=1)

##      budget genres                    homepage    id keywords
## 1 237000000 Action http://www.avatarmovie.com/ 19995  culture
##   original_language original_title
## 1                en         Avatar
##                                                                                                                                                                          overview
## 1 In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.
##   popularity             production_companies
## 1   150.4376  "Ingenious Film Partners", "id"
##                                                                                         production_countries
## 1 [{"iso_3166_1": "US", "name": "United States of America"}, {"iso_3166_1": "GB", "name": "United Kingdom"}]
##   release_date    revenue runtime
## 1   2009-12-10 2787965087     162
##                                                                         spoken_languages
## 1 [{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\\u00f1ol"}]
##     status                     tagline  title vote_average vote_count
## 1 Released Enter the World of Pandora. Avatar          7.2      11800

Quantitative Information

Let’s now separate textual and quantitative data from the Movie’s file.

Quantitivedata<-sqldf("select budget, genres, keywords, original_language, original_title, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, vote_average, vote_count from moviedata")

Textual Information

Textualdata<-sqldf("select genres, keywords, original_title, overview, tagline from moviedata")

Explorartory analysis of movie ratings

##Explorartory analysis of movie ratings
ggplot(moviedata, aes(moviedata$vote_average)) + geom_density(color="darkblue", fill="cornflowerblue")

Exploratory analysis of Budget

##Exploratory analysis of Budget
ggplot(moviedata, aes(moviedata$budget)) + geom_density(color="darkblue", fill="cornflowerblue")

Exploratory analysis of Revenue

##Explorartory analysis of revenue 
ggplot(moviedata, aes(moviedata$revenue)) + geom_density(color="darkblue", fill="cornflowerblue")

Exploratory analysis of Runtime

##Explorartory analysis of runtime 
ggplot(moviedata, aes(moviedata$runtime)) + geom_density(color="darkblue", fill="cornflowerblue")

Rubric Questions

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this? We are exploring the TMDB dataset from kaggle and we plan to do a combined analysis which includes Quantitative analysis as well as Textual analysis.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed) We will try understanding how different variables affect the various quantitative variables in the dataset and scoop insights from the same. For example understanding how genre and keywords affect the movie budget, revenue, ratings etc. While for textual analysis we will be attempting to identify how sentiments (analysed using the keywords, summary and tagline) of a movie affect the budget, runtime and ratings.

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem. A combination of textual and quantitative analysis will help us derive insights in a holistic manner. (Though a textual analysis of the movie content would have been ‘IDEAL’)

1.4 Explain how your analysis will help the consumer of your analysis. The analysis can be used by production houses, directors and movie investors to analyse the success of a movie based on it’s runtime/keywords/context

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis. library(stringr) library(tidyr) library(ggplot2) library(TM) library(sqldf)

2.2 Messages and warnings resulting from loading the package are suppressed.

echo=FALSE indicates that the code will not be shown in the final document similarly: warings = FALSE and Message: FALSE

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package)

library(stringr) String manipulation, separation and subsetting library(tidyr) Data manipulation and subsetting library(ggplot2) Creating beautiful visualizations in ggplot2 library(TM) Text mining library(sqldf) Using sql for easy data manipulation and subsetting

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

The dataset used here has details about movies in terms of their genres, year of release, language, revenue etc, and has been obtained from here.

3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).

The data being used was generated using The Movie Database API. The intent behind generating this database was to be able to produce information about the various variables that define a movie and the way they affect the movie’s success. The database has 20 variables. The columns ‘keywords’ and ‘genres’ have inconsistent number of values, i.e. a big chunk of the rows has a single value for genre, whereas the rest have multiple values. While cleaning, this leaves a big number of NAs in the resulting table. The same applies to ‘keywords’.

3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.

We have imported the dataset from a csv file using read.csv and then cleaned the columns of genres - key words - languages - production houses and countries.

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

We use the function: head(data, n=1) to showcase the cleaned dataset.

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

Summary for every variable in a tabular form.

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

A combination of textual and quantitative analysis will help us derive insights in a holistic manner. We will contruct a model to evaluate the success of a movie with different combinations of budget - keywords - sentiment - ratings etc.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Correlation plot, density plot, scatter plot. Possible use of Dendogram, and tables to represent results from text analysis.

4.3 What do you not know how to do right now that you need to learn to answer your questions?

One of the ideas that we have for exploring the data involves an analysis of the overview and tagline fields in the database. To do so, we would require an understanding and applicable knowledge of Text mining using R.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

We are open to pivoting into more exploratory questions in subsequent stages of the project, although currently we are considering only text mining.