Introduction


Background

In the ever-evolving world of filmmaking, understanding trends and changes is crucial. By harnessing the power of the R programming language, we can conduct in-depth analyses that unveil film trends flawlessly. In this exploration, we will immerse ourselves in the captivating cinematic landscape, meticulously analyze film datasets, and unearth valuable insights into the ongoing trends. Through careful analysis and the utilization of powerful R tools, we will uncover intriguing details, ranging from shifts in genres to the temporal popularity of films, and even audience preferences. Join us on this analytical adventure as we delve deeper and reveal the beauty and intricacies of the film world through the R programming language.

In this analysis, we will delve deeper into our understanding of an intriguing film by utilizing the available data on IMDb. IMDb, which stands for the Internet Movie Database, is an online resource that provides comprehensive information about films, including ratings, reviews, cast members, and much more.

It is important to note that IMDb serves as one source of data and opinions, which are inherently subjective. Nonetheless, by utilizing this data as a guiding reference, we can acquire invaluable insights to conduct a meticulous and thorough analysis of this film.

We will perform analysis using the R programming language. R is a powerful and popular programming language for data analysis and statistics. With R, we can manipulate data, visualize it, and conduct in-depth statistical analysis on the chosen film dataset.

We will leverage various R packages and functions available to help us address relevant questions about the film. For example, we can use packages like ‘dplyr’ for data manipulation, ‘ggplot2’ for creating informative visualizations, and ‘tidyverse’ for comprehensive data analysis.

Through this analysis, we will explore various aspects of the film, such as ratings, duration, genre, and other interesting elements. With the aid of the R programming language, we can extract valuable insights from the film dataset and gain a deeper understanding of the film.

So, let’s commence this analysis by utilizing the R programming language to explore and analyze the selected film dataset.


The Report Content :


Introduction
  • Background
  • Dataset
  • Converting Multiple .csv Files
  • Data Exploratory and Data Wrangling
    • Creating Dataframe Object
    • Creating the new column
    • Converting Data Type
    • Removing Duplicated Data
    • Handling N/A Values
    Getting The Insight
    • Top 10 with The Highest Gross Revenue
    • Understanding the Relationship: Gross Revenue, Rating, and Votes

    email :

    Dataset



    Converting Multiple .csv Files

    The dataset consists of multiple .csv data files acquired from Kaggle. It comprises a collection of diverse CSV data files obtained from kaggle.com. Here is a list of CSV files that will be merged :

    the various datasets will be consolidated into a single file to facilitate comprehensive analysis. To accomplish this task, the powerful ‘tidyverse’ library will be employed, allowing seamless merging and integration of the datasets. By merging these distinct data sources, a unified and enriched dataset will be created, laying the foundation for more insightful and comprehensive analyses.

    # Folder tempat file-file CSV disimpan
    folder_path <- "data_input/data_series"
    
    # Membaca semua file CSV dalam folder dan menggabungkannya
    merged_df <- folder_path %>%
      dir(pattern = "\\.csv$") %>%
      map_df(~ read_csv(file.path(folder_path, .),
                        col_types = cols(.default = "character")))
    
    # Simpan dataframe hasil penggabungan menjadi file CSV
    write_csv(merged_df, "data_input/imdbTV.csv")

    In the above process, the objective is to merge multiple .csv files into a single file. These files are located in the directory “data_input/data_series”. The merged file will be saved in the “data_input” directory with the name “imdbTV.csv”. By merging the files, the data is consolidated and can be further analyzed.


    Overview of the Dataset


    This comprehensive dataset encompasses a wide range of information about TV series sourced from IMDb. It comprises vital details including the series’ title, IMDb ID, release year, genre, cast members, synopsis, rating, runtime, certificate classification, number of votes, and gross revenue statistics.

    Source Dataset: Accessing the Original Dataset on IMDb’s Website

    Data Exploratory and Data Wrangling

    Creating Dataframe

    To optimize computational efficiency, the data.table library is employed for creating dataframe objects. In this context, the data.table library is preferred due to its exceptional speed, enabling swift processing of large datasets. By initiating the dataframe creation process, an imdbTV dataframe object is generated, encompassing the dataset extracted from the .csv files. Let’s create the IMDbTV dataframe object that contains the .csv files:”

    imdbTV <- fread("data_input/imdbTV.csv")
    setnames(imdbTV, colnames(imdbTV), gsub(" ", ".", colnames(imdbTV)))

    The fread function from the data.table library to read the “imdbTV.csv” file and create the imdbTV dataframe object. It then uses the setnames function to replace any spaces in the column names with periods (.) in the imdbTV dataframe.

    Creating the new column “Weighted Rating”

    The imdbTV dataframe undergoes a calculation to determine the weighted rating. This calculation takes into consideration both the Rating and Number.of.Votes columns. To obtain the weighted rating, the Rating is multiplied by the Number.of.Votes and divided by the sum of the Number.of.Votes and a constant value of 1000 (denoted as C). By incorporating the number of votes, this approach assigns greater significance to movies with a higher number of votes, resulting in a more accurate representation of their overall rating. The resulting weighted ratings are then appended as a new column called Weighted_Rating within the imdbTV dataframe. This insightful addition enables a deeper understanding of the films’ overall popularity and audience reception.

    # Menghitung weighted rating
    imdbTV <- imdbTV %>%
      mutate(Weighted_Rating = Rating * Number.of.Votes / (Number.of.Votes + 1000))
    # Menggunakan konstanta C = 1000

    Converting Data Type

    In the Runtime and Gross.Revenue columns, there is a need for expanding the data type conversion to enhance accuracy and ease of analysis. The Runtime column represents the duration of the film in minutes, while the Gross.Revenue column represents the gross revenue of the film in dollars. Therefore, both columns will undergo a data type conversion process to integer (int) after appropriate cleansing, and then be transformed into numeric data types. This step aims to facilitate further data processing and ensure consistency in representing the duration and revenue figures.

    Before converting Data Type

    typeof(imdbTV$Runtime)
    [1] "character"
    typeof(imdbTV$Gross.Revenue)
    [1] "character"

    Converting Process

    imdbTV <- imdbTV %>%
      mutate(
        Runtime = as.integer(gsub(",", "", gsub(" min", "", Runtime))),
        Gross.Revenue = as.numeric(gsub(",", "", Gross.Revenue))
      )

    After converting Data Type

    typeof(imdbTV$Runtime)
    [1] "integer"
    typeof(imdbTV$Gross.Revenue)
    [1] "double"

    Removing Duplicated Data

    The next step is to check for data duplicates, which is crucial to ensure data integrity and relevance for analysis. By conducting a duplicate data check, we can identify and eliminate any duplicate entries, ensuring that the data used for analysis is more accurate and representative

    # Mengecek nilai duplikat pada kolom IMDb.ID
    sum(duplicated(imdbTV$IMDb.ID))
    [1] 127631

    It is interesting to note that the author of this dataset obtained the data through web scraping from imdbtv.com, specifically from various genre category pages. This means that in a single film, there may be multiple genres associated with it. For example, a film like Avengers: Endgame may belong to the genres of Action, Adventure, Drama, and Sci-Fi. As a result, the scraping process may lead to redundant data entries.

    To address this issue, the redundant data will be removed to ensure a relevant dataset for analysis. By eliminating the redundant entries, we can create a more streamlined and accurate dataset that is suitable for further analysis.

    imdbTV <- imdbTV[!duplicated(imdbTV$IMDb.ID), ]

    After the removal process :

    # Mengecek nilai duplikat setelah dihapus
    sum(duplicated(imdbTV$IMDb.ID))
    [1] 0

    Handling N/A Values

    To ensure that the dataset does not contain any entire rows with missing values or empty values, a thorough check is performed on the data before proceeding. This step ensures the data’s completeness and reliability for further analysis. By conducting this check, we can identify any rows that have missing or empty values and take appropriate measures to handle them, such as imputation or removal, to maintain the integrity and quality of the dataset.

    # Menghitung jumlah nilai NA pada setiap kolom
    jumlah_na_per_kolom <- apply(imdbTV, 2, function(x) sum(is.na(x)))
    
    # Menampilkan jumlah nilai NA per kolom
    print(jumlah_na_per_kolom)
              Title         IMDb.ID    Release.Year           Genre            Cast 
                  0               0               6               0             630 
           Synopsis          Rating         Runtime     Certificate Number.of.Votes 
                  0               0           11279           48905               0 
      Gross.Revenue Weighted_Rating 
              95944               0 

    While there are empty or NA values present in some columns, this is expected due to certain rows lacking specific information. For instance, in the Gross.Revenue column, which represents the gross revenue of a film, not all films have available data for this field. Therefore, in this case, no specific handling will be performed for empty or NA values, as the data still holds significant value for analysis.

    It is important to note that the presence of empty or NA values in certain columns does not diminish the overall usefulness of the dataset for analysis. The remaining data provides valuable insights and can still be utilized effectively to derive meaningful conclusions and patterns. However, it is crucial to be aware of these missing values and consider their potential impact on the analysis and interpretations made based on the dataset.


    Data Structure:


    Getting The Insight

    Top 10 with The Highest Gross Revenue

    Let’s start by gaining insights from the top 10 films with the highest gross revenue. By exploring this list, we can obtain valuable insights into the financial success and popularity of these films.


    # Sorting by Gross.Revenue in descending order
    imdbTV_sorted <- imdbTV %>% arrange(desc(Gross.Revenue))
    
    # Selecting the top 10 films with the highest gross revenue
    top_10_films <- head(imdbTV_sorted, 10)
    # Creating a ranking plot using ggplot2
    rank_plot <- ggplot(data = top_10_films, aes(y = reorder(Title, Gross.Revenue), x = Gross.Revenue, fill = Gross.Revenue)) +
      geom_bar(stat = "identity") +
      labs(title = "Top 10 Films with the Highest Gross Revenue", y = "Film", x = "Gross Revenue") +
      theme_minimal() +
      scale_x_continuous(labels = dollar_format(prefix = "$"), expand = c(0, 0)) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
      scale_fill_gradientn(colors = c("#E6F5FF", "#99C2FF", "#6699FF", "#3366FF", "#0033FF"), guide = FALSE)+
      guides(fill = guide_colorbar(title = "Value", 
                                   title.position = "top",
                                   label.position = "top"))

    The insights that can be drawn are that films with major franchises like “Star Wars,” “Avengers,” and “Spider-Man” have strong appeal in the market and are capable of generating high gross revenue. The film “Avatar” has also proven to be one of the highest-grossing films, indicating its popularity among audiences. Additionally, films with themes such as action, adventure, and science fiction tend to receive positive responses from viewers and generate significant revenue.

    Understanding the Relationship: Gross Revenue, Rating, and Votes

    By analyzing the correlation between Gross Revenue, Weighted Rating, and Number of Votes, we aim to gain deeper insights into the relationship among these variables. This analysis will enable us to determine if there is a significant association between a film’s gross revenue, its weighted rating, and the number of votes it receives. Understanding this relationship can provide valuable insights into the factors influencing a film’s financial success and audience reception.

    Weighted Rating vs Gross Revenue

    clean_gross <- imdbTV[!is.na(imdbTV$Gross.Revenue), ]
    correlation <- cor(clean_gross$Weighted_Rating, clean_gross$Gross.Revenue, use = "pairwise.complete.obs")
    correlation
    [1] 0.2979338


    Number.of.Votes vs Gross Revenue

    correlation <- cor(clean_gross$Number.of.Votes, clean_gross$Gross.Revenue, use = "pairwise.complete.obs")
    correlation
    [1] 0.6569887


    An upward trend line indicates a positive relationship between the two analyzed variables. In this context, the upward trend line in the relationship between Weighted Rating or Number of Votes and Gross Revenue suggests that as the Weighted Rating increases or the Number of Votes grows, the film’s gross revenue is likely to increase.

    In other words, the upward trend line indicates a positive correlation between the level of financial success of a film (in terms of gross revenue) and both Weighted Rating and Number of Votes. A higher Weighted Rating or a larger number of votes received by a film corresponds to a higher possibility of achieving higher gross revenue.

    The upward trend line provides insight that factors such as popularity, positive audience reception, and audience engagement can influence the financial success of a film.

    The correlation between Weighted Rating and Gross Revenue is 0.2979338, indicating a moderate positive relationship. This suggests that films with higher weighted ratings tend to have higher gross revenue, although the correlation is not very strong.

    On the other hand, the correlation between Number of Votes and Gross Revenue is 0.6569887, indicating a strong positive relationship. This implies that films with a larger number of votes tend to generate higher gross revenue.

    These correlation values provide valuable insights into the association between weighted rating, number of votes, and gross revenue, helping us understand the factors that contribute to a film’s financial success and audience engagement.


    Insight Point

    About IMDb TV


    IMDbTV stands for Internet Movie Database. It is a website that collects information related to films, television series, video productions, and entertainment events. IMDb was founded in 1990 and has become one of the most popular and trusted sources of information in the entertainment industry.

    On IMDb, users can find lists of films and TV shows along with details such as title, synopsis, release dates, cast, production crew, reviews, and ratings. The site also provides information about awards, box office performance, trailers, and various interesting facts about film and television productions.