TMDB Movie - Data Visualisation

1. Libraries

library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

2. Kaggle Dataset

TMBD 5000 Movie Dataset

3. Pre-Processing Data

Workflow:

Reading your data
Take a peek at your data
Inspect your structure of data
Formulate questions (strategic business value)
Answer questions efficiently

Data Visualization Data (csv, database, images) -> preprocessing -> eda (exploratory data analysis) -> visualization / plotting

3.1 Pre-processing

mv is a dataframe with 4803 observations and 20 variables

mv <- read.csv("data_input/tmdb_5000_movies.csv", stringsAsFactors = FALSE)
# str(mv)

mv[,c("original_language", "status")] <- lapply(mv[,c("original_language", "status")], as.factor)
mv$release_date <- ymd(mv$release_date)

3.2 Exploratory Data Analysis

Start with question: What is top 10 movie with big revenue? And please visualize it.

top10 <- head(mv[order(mv$revenue, decreasing = TRUE), c("title", "revenue")], n = 10)
top10

##                          title    revenue
## 1                       Avatar 2787965087
## 26                     Titanic 1845034188
## 17                The Avengers 1519557910
## 29              Jurassic World 1513528810
## 45                   Furious 7 1506249360
## 8      Avengers: Age of Ultron 1405403694
## 125                     Frozen 1274219009
## 32                  Iron Man 3 1215439994
## 547                    Minions 1156730962
## 27  Captain America: Civil War 1153304495

rm(mv) #remove unused data to preserve memory

3.3 Data Visualization with ggplot2

top10$title <- reorder(top10$title, as.numeric(top10$revenue))

# Formating in Billions
top10$revenue <- paste(format(round(top10$revenue / 1e9, 2), trim = TRUE), "B")

ggplot(top10, aes(title, revenue)) +
  geom_col(position = "dodge", aes(fill = revenue)) +
  coord_flip() +
  labs(x = "Movie Name", y = "Revenues in USD", title = "Top 10 Movie with Most Revenues All Time")

4. Final Thoughts

This is my first doing data visualization using Kaggle dataset

What I Learn?

Doing pre-processing raw data
Using lubridate() for date formatting
Subsetting data based on column that will be used title and revenue
Reordering data based on most revenue and limit it to 10 rows
Formating revenue numeric values into Billion(B) format
Data visualization with ggplot2 using geom_col() + coord_flip() + labs()