Introduction

Background

In the ever-evolving world of filmmaking, understanding trends and changes is crucial. By harnessing the power of the R programming language, we can conduct in-depth analyses that unveil film trends flawlessly. In this exploration, we will immerse ourselves in the captivating cinematic landscape, meticulously analyze film datasets, and unearth valuable insights into the ongoing trends. Through careful analysis and the utilization of powerful R tools, we will uncover intriguing details, ranging from shifts in genres to the temporal popularity of films, and even audience preferences. Join us on this analytical adventure as we delve deeper and reveal the beauty and intricacies of the film world through the R programming language.

In this analysis, we will delve deeper into our understanding of an intriguing film by utilizing the available data on IMDb. IMDb, which stands for the Internet Movie Database, is an online resource that provides comprehensive information about films, including ratings, reviews, cast members, and much more.

It is important to note that IMDb serves as one source of data and opinions, which are inherently subjective. Nonetheless, by utilizing this data as a guiding reference, we can acquire invaluable insights to conduct a meticulous and thorough analysis of this film.

We will perform analysis using the R programming language. R is a powerful and popular programming language for data analysis and statistics. With R, we can manipulate data, visualize it, and conduct in-depth statistical analysis on the chosen film dataset.

We will leverage various R packages and functions available to help us address relevant questions about the film. For example, we can use packages like ‘dplyr’ for data manipulation, ‘ggplot2’ for creating informative visualizations, and ‘tidyverse’ for comprehensive data analysis.

Through this analysis, we will explore various aspects of the film, such as ratings, duration, genre, and other interesting elements. With the aid of the R programming language, we can extract valuable insights from the film dataset and gain a deeper understanding of the film.

So, let’s commence this analysis by utilizing the R programming language to explore and analyze the selected film dataset.

The Report Content :

Introduction

Background

Dataset

Converting Multiple .csv Files

Data Exploratory and Data Wrangling

Creating Dataframe Object
Creating the new column
Converting Data Type
Removing Duplicated Data
Handling N/A Values

Getting The Insight

Top 10 with The Highest Gross Revenue
Understanding the Relationship: Gross Revenue, Rating, and Votes

email : rusdipermana2@gmail.com

Dataset

Converting Multiple .csv Files

The dataset consists of multiple .csv data files acquired from Kaggle. It comprises a collection of diverse CSV data files obtained from kaggle.com. Here is a list of CSV files that will be merged :

action_series.csv
adventure_series.csv
animation_series.csv
biography_series.csv
comedy_series.csv
crime_series.csv
documentary_series.csv
drama_series.csv
family_series.csv
fantasy_series.csv
history_series.csv
horror_series.csv
biography_series.csv
music_series.csv
musical_series.csv
mystery_series.csv
romance_series.csv
sci-fi_series.csv
sport_series.csv
superhero_series.csv
thriller_series.csv
war_series.csv
western_series.csv

the various datasets will be consolidated into a single file to facilitate comprehensive analysis. To accomplish this task, the powerful ‘tidyverse’ library will be employed, allowing seamless merging and integration of the datasets. By merging these distinct data sources, a unified and enriched dataset will be created, laying the foundation for more insightful and comprehensive analyses.

# Folder tempat file-file CSV disimpan
folder_path <- "data_input/data_series"

# Membaca semua file CSV dalam folder dan menggabungkannya
merged_df <- folder_path %>%
  dir(pattern = "\\.csv$") %>%
  map_df(~ read_csv(file.path(folder_path, .),
                    col_types = cols(.default = "character")))

# Simpan dataframe hasil penggabungan menjadi file CSV
write_csv(merged_df, "data_input/imdbTV.csv")

In the above process, the objective is to merge multiple .csv files into a single file. These files are located in the directory “data_input/data_series”. The merged file will be saved in the “data_input” directory with the name “imdbTV.csv”. By merging the files, the data is consolidated and can be further analyzed.

Overview of the Dataset

This comprehensive dataset encompasses a wide range of information about TV series sourced from IMDb. It comprises vital details including the series’ title, IMDb ID, release year, genre, cast members, synopsis, rating, runtime, certificate classification, number of votes, and gross revenue statistics.

Source Dataset: Accessing the Original Dataset on IMDb’s Website

Data Exploratory and Data Wrangling

Creating Dataframe

To optimize computational efficiency, the data.table library is employed for creating dataframe objects. In this context, the data.table library is preferred due to its exceptional speed, enabling swift processing of large datasets. By initiating the dataframe creation process, an imdbTV dataframe object is generated, encompassing the dataset extracted from the .csv files. Let’s create the IMDbTV dataframe object that contains the .csv files:”

imdbTV <- fread("data_input/imdbTV.csv")
setnames(imdbTV, colnames(imdbTV), gsub(" ", ".", colnames(imdbTV)))

The fread function from the data.table library to read the “imdbTV.csv” file and create the imdbTV dataframe object. It then uses the setnames function to replace any spaces in the column names with periods (.) in the imdbTV dataframe.

Creating the new column “Weighted Rating”

The imdbTV dataframe undergoes a calculation to determine the weighted rating. This calculation takes into consideration both the Rating and Number.of.Votes columns. To obtain the weighted rating, the Rating is multiplied by the Number.of.Votes and divided by the sum of the Number.of.Votes and a constant value of 1000 (denoted as C). By incorporating the number of votes, this approach assigns greater significance to movies with a higher number of votes, resulting in a more accurate representation of their overall rating. The resulting weighted ratings are then appended as a new column called Weighted_Rating within the imdbTV dataframe. This insightful addition enables a deeper understanding of the films’ overall popularity and audience reception.

# Menghitung weighted rating
imdbTV <- imdbTV %>%
  mutate(Weighted_Rating = Rating * Number.of.Votes / (Number.of.Votes + 1000))
# Menggunakan konstanta C = 1000

Converting Data Type

In the Runtime and Gross.Revenue columns, there is a need for expanding the data type conversion to enhance accuracy and ease of analysis. The Runtime column represents the duration of the film in minutes, while the Gross.Revenue column represents the gross revenue of the film in dollars. Therefore, both columns will undergo a data type conversion process to integer (int) after appropriate cleansing, and then be transformed into numeric data types. This step aims to facilitate further data processing and ensure consistency in representing the duration and revenue figures.

Before converting Data Type

typeof(imdbTV$Runtime)

[1] "character"

typeof(imdbTV$Gross.Revenue)

[1] "character"

Converting Process

imdbTV <- imdbTV %>%
  mutate(
    Runtime = as.integer(gsub(",", "", gsub(" min", "", Runtime))),
    Gross.Revenue = as.numeric(gsub(",", "", Gross.Revenue))
  )

After converting Data Type

typeof(imdbTV$Runtime)

[1] "integer"

typeof(imdbTV$Gross.Revenue)

[1] "double"

Removing Duplicated Data

The next step is to check for data duplicates, which is crucial to ensure data integrity and relevance for analysis. By conducting a duplicate data check, we can identify and eliminate any duplicate entries, ensuring that the data used for analysis is more accurate and representative

# Mengecek nilai duplikat pada kolom IMDb.ID
sum(duplicated(imdbTV$IMDb.ID))

[1] 127631

It is interesting to note that the author of this dataset obtained the data through web scraping from imdbtv.com, specifically from various genre category pages. This means that in a single film, there may be multiple genres associated with it. For example, a film like Avengers: Endgame may belong to the genres of Action, Adventure, Drama, and Sci-Fi. As a result, the scraping process may lead to redundant data entries.

To address this issue, the redundant data will be removed to ensure a relevant dataset for analysis. By eliminating the redundant entries, we can create a more streamlined and accurate dataset that is suitable for further analysis.

imdbTV <- imdbTV[!duplicated(imdbTV$IMDb.ID), ]

After the removal process :

# Mengecek nilai duplikat setelah dihapus
sum(duplicated(imdbTV$IMDb.ID))

[1] 0

Handling N/A Values

To ensure that the dataset does not contain any entire rows with missing values or empty values, a thorough check is performed on the data before proceeding. This step ensures the data’s completeness and reliability for further analysis. By conducting this check, we can identify any rows that have missing or empty values and take appropriate measures to handle them, such as imputation or removal, to maintain the integrity and quality of the dataset.

# Menghitung jumlah nilai NA pada setiap kolom
jumlah_na_per_kolom <- apply(imdbTV, 2, function(x) sum(is.na(x)))

# Menampilkan jumlah nilai NA per kolom
print(jumlah_na_per_kolom)

          Title         IMDb.ID    Release.Year           Genre            Cast 
              0               0               6               0             630 
       Synopsis          Rating         Runtime     Certificate Number.of.Votes 
              0               0           11279           48905               0 
  Gross.Revenue Weighted_Rating 
          95944               0

While there are empty or NA values present in some columns, this is expected due to certain rows lacking specific information. For instance, in the Gross.Revenue column, which represents the gross revenue of a film, not all films have available data for this field. Therefore, in this case, no specific handling will be performed for empty or NA values, as the data still holds significant value for analysis.

It is important to note that the presence of empty or NA values in certain columns does not diminish the overall usefulness of the dataset for analysis. The remaining data provides valuable insights and can still be utilized effectively to derive meaningful conclusions and patterns. However, it is crucial to be aware of these missing values and consider their potential impact on the analysis and interpretations made based on the dataset.

Data Structure:

Title: Film title in character format (chr).
IMDb.ID: IMDb film ID in character format (chr).
Release.Year: Film release year in character format (chr).
Genre: Film genre in character format (chr).
Cast: Film cast members in character format (chr).
Synopsis: Film synopsis in character format (chr).
Rating: Film rating value in numeric format (num).
Runtime: Film duration in character format (int).
Certificate: Film certificate in character format (chr).
Number.of.Votes: Number of votes received by the film as an integer (int).
Gross.Revenue: Film gross revenue in character format (num).

Getting The Insight

Top 10 with The Highest Gross Revenue

Let’s start by gaining insights from the top 10 films with the highest gross revenue. By exploring this list, we can obtain valuable insights into the financial success and popularity of these films.

# Sorting by Gross.Revenue in descending order
imdbTV_sorted <- imdbTV %>% arrange(desc(Gross.Revenue))

# Selecting the top 10 films with the highest gross revenue
top_10_films <- head(imdbTV_sorted, 10)
# Creating a ranking plot using ggplot2
rank_plot <- ggplot(data = top_10_films, aes(y = reorder(Title, Gross.Revenue), x = Gross.Revenue, fill = Gross.Revenue)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Films with the Highest Gross Revenue", y = "Film", x = "Gross Revenue") +
  theme_minimal() +
  scale_x_continuous(labels = dollar_format(prefix = "$"), expand = c(0, 0)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
  scale_fill_gradientn(colors = c("#E6F5FF", "#99C2FF", "#6699FF", "#3366FF", "#0033FF"), guide = FALSE)+
  guides(fill = guide_colorbar(title = "Value", 
                               title.position = "top",
                               label.position = "top"))

The insights that can be drawn are that films with major franchises like “Star Wars,” “Avengers,” and “Spider-Man” have strong appeal in the market and are capable of generating high gross revenue. The film “Avatar” has also proven to be one of the highest-grossing films, indicating its popularity among audiences. Additionally, films with themes such as action, adventure, and science fiction tend to receive positive responses from viewers and generate significant revenue.

Understanding the Relationship: Gross Revenue, Rating, and Votes

By analyzing the correlation between Gross Revenue, Weighted Rating, and Number of Votes, we aim to gain deeper insights into the relationship among these variables. This analysis will enable us to determine if there is a significant association between a film’s gross revenue, its weighted rating, and the number of votes it receives. Understanding this relationship can provide valuable insights into the factors influencing a film’s financial success and audience reception.

Weighted Rating vs Gross Revenue

clean_gross <- imdbTV[!is.na(imdbTV$Gross.Revenue), ]
correlation <- cor(clean_gross$Weighted_Rating, clean_gross$Gross.Revenue, use = "pairwise.complete.obs")
correlation

[1] 0.2979338

Number.of.Votes vs Gross Revenue

correlation <- cor(clean_gross$Number.of.Votes, clean_gross$Gross.Revenue, use = "pairwise.complete.obs")
correlation

[1] 0.6569887

An upward trend line indicates a positive relationship between the two analyzed variables. In this context, the upward trend line in the relationship between Weighted Rating or Number of Votes and Gross Revenue suggests that as the Weighted Rating increases or the Number of Votes grows, the film’s gross revenue is likely to increase.

In other words, the upward trend line indicates a positive correlation between the level of financial success of a film (in terms of gross revenue) and both Weighted Rating and Number of Votes. A higher Weighted Rating or a larger number of votes received by a film corresponds to a higher possibility of achieving higher gross revenue.

The upward trend line provides insight that factors such as popularity, positive audience reception, and audience engagement can influence the financial success of a film.

The correlation between Weighted Rating and Gross Revenue is 0.2979338, indicating a moderate positive relationship. This suggests that films with higher weighted ratings tend to have higher gross revenue, although the correlation is not very strong.

On the other hand, the correlation between Number of Votes and Gross Revenue is 0.6569887, indicating a strong positive relationship. This implies that films with a larger number of votes tend to generate higher gross revenue.

These correlation values provide valuable insights into the association between weighted rating, number of votes, and gross revenue, helping us understand the factors that contribute to a film’s financial success and audience engagement.

Insight Point

Action and Adventure: Many films in the top 10 belong to the action and adventure genres, such as Star Wars: Episode VII - The Force Awakens, Avengers: Endgame, Spider-Man: No Way Home, and Avatar. These genres tend to attract a wide audience and have a strong commercial appeal, contributing to their high gross revenue.
Sci-Fi and Fantasy: Films like Avatar, Avengers: Infinity War, and Avatar: The Way of Water fall under the sci-fi and fantasy genres. These genres often offer visually stunning worlds and imaginative storytelling, captivating audiences and driving their financial success.
Genre Combination: Several films in the top 10 showcase a combination of genres. For example, Avengers: Endgame and Avengers: Infinity War blend action, adventure, and sci-fi elements, while Black Panther combines action, adventure, and sci-fi with a focus on cultural representation. This suggests that combining genres can appeal to a wider audience and increase the revenue potential of a film.
Positive Correlation: There is a positive correlation between the Weighted Rating and Gross Revenue. This means that as the Weighted Rating of a film increases, there is a tendency for its Gross Revenue to also increase. Films with higher ratings have a higher potential to generate more revenue.
Stronger Correlation: The correlation between the Number of Votes and Gross Revenue is even stronger than the correlation between Weighted Rating and Gross Revenue. This indicates that the number of votes a film receives has a stronger impact on its gross revenue. Films with a larger number of votes are more likely to have higher revenue.
Importance of Audience Engagement: Both Weighted Rating and Number of Votes serve as indicators of audience engagement and interest in a film. The insights suggest that audience engagement plays a significant role in determining a film’s financial success. Films that receive more attention and involvement from the audience, as reflected in higher ratings and a larger number of votes, tend to have higher gross revenue.
Revenue Potential: The analysis provides valuable insights into the revenue potential of films. Filmmakers and industry professionals can utilize these insights to identify the factors that contribute to a film’s financial success. By focusing on aspects such as increasing audience engagement, improving the overall quality and appeal of films, and generating positive ratings and votes, they can enhance the chances of achieving higher gross revenue.

About IMDb TV

IMDbTV stands for Internet Movie Database. It is a website that collects information related to films, television series, video productions, and entertainment events. IMDb was founded in 1990 and has become one of the most popular and trusted sources of information in the entertainment industry.

On IMDb, users can find lists of films and TV shows along with details such as title, synopsis, release dates, cast, production crew, reviews, and ratings. The site also provides information about awards, box office performance, trailers, and various interesting facts about film and television productions.