Introduction

Netflix, Inc. is a media services provider and production company headquartered in Los Gatos, California. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California Netflix. The company’s primary business is its subscription-based streaming service, which offers online streaming of a library of films and television series. Netflix is a popular entertainment service used by people around the world. In this project, I explored the Netflix data set, which can be found at Kaggle. In this report, I explore the Netflix dataset through visualizations and graphs using R libraries. I chose this dataset because because I am into movies and shows. I will be sharing this project on public forums about movies with other movie-lovers. With the power of data analysis,more users were attracted to the platform, and many users tend to spend most of their time watching shows and movies on Netflix. With this approach, we would like explore the dataset to understand the trend of movies and TV shows on Netflix. I endeavor to answer the following questions through data analysis:

Sourced from wikipedia

Dataset Description

This dataset is from Kaggle, and Netflix Netflix is the collector of this data. Shivam Bansa created and uploaded this dataset to Kaggle, and the dataset is public domain. This dataset includes lists of all the movies and TV shows that are available on flixable, which is a third-party search engine on Netflix. This dataset has 8,807 rows and 12 variables. It includes information about the genre, cast, director, countries where the show is available to watch, and a summary of what the show is about. The columns and their descriptions are as listed below:

Data Quality

The purpose of this section is to describe the quality of the dataset. Missingness, strengths, and weaknesses will be discussed. A number of tables and figures will be presented to facilitate cohort description, illustrate the distributions of key variables, and present important results.

Strengths

This data has been collected for the analysis based on the movies and series. It is a tidy dataset which means that each variable has its own column, each observation has its own row, and each value has its own cell, so it is appropriate for most analysis. Another point of strength is that the dataset has both categorical and continuous variables, so we can investigate the information in a categorized way and report the information of each category separately, making informative plots about the data. For example, the description variable can be used to find similar movies and TV shows using the text similarities for further analysis. The sample size is another strength of this dataset that makes it possible to analyze and compare the data.

Weaknesses

There are some missingness in the data set, as seen in Table 1, requiring some data cleaning. The data cleaning process involved identifying incorrect, incomplete, inaccurate, irrelevant, or missing pieces of data and modifying, replacing, or deleting them as needed. This table shows that only the categories of director, cast, country, date added, rating, and duration have missing values and the category of director has the most missing values. There are 2,634 cases where director has missing values. There are 831 cases where country has missing values. There are 825 cases where cast has missing values. There are 17 cases where date added, rating, and duration have missing values. Another weakness in this dataset is data duplication. However, this could be fixed by merging duplicate data into a single value.

Missing Values Count by Variable
Variable Missing Values Count
Show ID 0
Type 0
Title 0
Director 2634
Cast 825
Country 831
Date Added 10
Release Year 0
Rating 4
Duration 3
Genre 0
Description 0

Data Exploration

Descriptives and Demographics

I analyzed the Netflix TV shows and movie data. Below are my results. Table 2 shows the characteristics of participants sampled in this study, specifically the demographic of the variables.Before initiating the data analysis, we cleaned our data.First, we would like to see different contents ans also top 10 countires Figure 1 patchwork was used to combine the 2 graphs into one figure.As we see from Figure 1.a, there are more than two times as many movies as TV Shows on Netflix. we use the bar chart to show the percentage of contents on Netflix. Interestingly 68.4% of the content on Netflix is Movies.The largest count of programs (both movies and TV shows) are made with the TV-MA rating.Release year is very important aspect of learning when a content on the platform is released.Tools used to visualize the dataset include the Netflix color pallet and coolors accessible at coolors, which is a web tool that allows users with no design background to generate perfect matching colors in seconds. These allowed for more readable visualizations. All data summaries and analyses are conducted using R software.

Demographic Table
Characteristic N = 9361
Type
    Movie 936 (100%)
Rating
    TV-14 265 (28%)
    TV-MA 553 (59%)
    TV-PG 118 (13%)
Genre
    Documentaries 289 (31%)
    Dramas, International Movies 329 (35%)
    Stand-Up Comedy 318 (34%)
Release Year 2,017.0 (2,015.0, 2,019.0)
Duration
    100 min 11 (1.2%)
    101 min 12 (1.3%)
    102 min 13 (1.4%)
    103 min 8 (0.9%)
    104 min 7 (0.7%)
    105 min 10 (1.1%)
    106 min 10 (1.1%)
    107 min 13 (1.4%)
    108 min 7 (0.7%)
    109 min 2 (0.2%)
    110 min 7 (0.7%)
    111 min 4 (0.4%)
    112 min 6 (0.6%)
    113 min 8 (0.9%)
    114 min 9 (1.0%)
    115 min 6 (0.6%)
    116 min 9 (1.0%)
    117 min 7 (0.7%)
    118 min 9 (1.0%)
    119 min 4 (0.4%)
    12 min 1 (0.1%)
    120 min 5 (0.5%)
    121 min 7 (0.7%)
    122 min 3 (0.3%)
    123 min 5 (0.5%)
    124 min 4 (0.4%)
    125 min 3 (0.3%)
    126 min 5 (0.5%)
    127 min 5 (0.5%)
    128 min 2 (0.2%)
    129 min 1 (0.1%)
    130 min 3 (0.3%)
    131 min 2 (0.2%)
    132 min 6 (0.6%)
    133 min 1 (0.1%)
    135 min 5 (0.5%)
    136 min 5 (0.5%)
    137 min 2 (0.2%)
    139 min 1 (0.1%)
    14 min 1 (0.1%)
    140 min 2 (0.2%)
    141 min 3 (0.3%)
    142 min 1 (0.1%)
    143 min 6 (0.6%)
    144 min 1 (0.1%)
    145 min 1 (0.1%)
    146 min 1 (0.1%)
    148 min 2 (0.2%)
    149 min 1 (0.1%)
    15 min 1 (0.1%)
    150 min 2 (0.2%)
    151 min 3 (0.3%)
    155 min 1 (0.1%)
    156 min 1 (0.1%)
    157 min 2 (0.2%)
    158 min 2 (0.2%)
    159 min 2 (0.2%)
    160 min 1 (0.1%)
    161 min 3 (0.3%)
    162 min 1 (0.1%)
    164 min 1 (0.1%)
    165 min 1 (0.1%)
    168 min 1 (0.1%)
    17 min 1 (0.1%)
    174 min 1 (0.1%)
    180 min 1 (0.1%)
    182 min 1 (0.1%)
    185 min 1 (0.1%)
    186 min 1 (0.1%)
    20 min 1 (0.1%)
    209 min 1 (0.1%)
    22 min 1 (0.1%)
    24 min 4 (0.4%)
    25 min 1 (0.1%)
    26 min 1 (0.1%)
    28 min 1 (0.1%)
    29 min 4 (0.4%)
    30 min 3 (0.3%)
    31 min 1 (0.1%)
    32 min 1 (0.1%)
    33 min 1 (0.1%)
    34 min 1 (0.1%)
    35 min 2 (0.2%)
    38 min 1 (0.1%)
    40 min 6 (0.6%)
    41 min 2 (0.2%)
    42 min 1 (0.1%)
    44 min 5 (0.5%)
    45 min 3 (0.3%)
    46 min 3 (0.3%)
    47 min 2 (0.2%)
    48 min 2 (0.2%)
    49 min 4 (0.4%)
    50 min 7 (0.7%)
    51 min 5 (0.5%)
    52 min 10 (1.1%)
    53 min 12 (1.3%)
    54 min 13 (1.4%)
    55 min 7 (0.7%)
    56 min 7 (0.7%)
    57 min 8 (0.9%)
    58 min 10 (1.1%)
    59 min 15 (1.6%)
    60 min 16 (1.7%)
    61 min 18 (1.9%)
    62 min 16 (1.7%)
    63 min 22 (2.4%)
    64 min 14 (1.5%)
    65 min 11 (1.2%)
    66 min 20 (2.1%)
    67 min 13 (1.4%)
    68 min 9 (1.0%)
    69 min 15 (1.6%)
    70 min 14 (1.5%)
    71 min 14 (1.5%)
    72 min 9 (1.0%)
    73 min 6 (0.6%)
    74 min 9 (1.0%)
    75 min 4 (0.4%)
    76 min 11 (1.2%)
    77 min 10 (1.1%)
    78 min 13 (1.4%)
    79 min 8 (0.9%)
    80 min 7 (0.7%)
    81 min 12 (1.3%)
    82 min 11 (1.2%)
    83 min 7 (0.7%)
    84 min 15 (1.6%)
    85 min 7 (0.7%)
    86 min 13 (1.4%)
    87 min 13 (1.4%)
    88 min 14 (1.5%)
    89 min 12 (1.3%)
    90 min 17 (1.8%)
    91 min 13 (1.4%)
    92 min 15 (1.6%)
    93 min 16 (1.7%)
    94 min 14 (1.5%)
    95 min 11 (1.2%)
    96 min 18 (1.9%)
    97 min 18 (1.9%)
    98 min 11 (1.2%)
    99 min 19 (2.0%)
Country
    Argentina 14 (1.6%)
    Argentina, Chile 1 (0.1%)
    Argentina, France, United States, Germany, Qatar 1 (0.1%)
    Argentina, Uruguay, Spain, France 1 (0.1%)
    Australia 3 (0.3%)
    Australia, United States 1 (0.1%)
    Austria 2 (0.2%)
    Austria, Czech Republic 1 (0.1%)
    Belgium, France 1 (0.1%)
    Belgium, France, Netherlands 1 (0.1%)
    Brazil 13 (1.5%)
    Cameroon 1 (0.1%)
    Canada 8 (0.9%)
    Canada, India 1 (0.1%)
    Canada, India, Thailand, United States, United Arab Emirates 1 (0.1%)
    Canada, Nigeria 1 (0.1%)
    Canada, United States 3 (0.3%)
    Canada, United States, United Kingdom 1 (0.1%)
    Chile 3 (0.3%)
    Chile, Peru 1 (0.1%)
    Chile, Spain, Argentina, Germany 1 (0.1%)
    China 1 (0.1%)
    China, Germany, India, United States 1 (0.1%)
    Colombia 4 (0.5%)
    Croatia, Slovenia, Serbia, Montenegro 1 (0.1%)
    Czech Republic, Slovakia 1 (0.1%)
    Czech Republic, United States 1 (0.1%)
    Denmark 3 (0.3%)
    Denmark, France, Poland 1 (0.1%)
    Denmark, Sweden, Israel, United States 1 (0.1%)
    Denmark, United States 1 (0.1%)
    Egypt 12 (1.4%)
    Egypt, France 1 (0.1%)
    Finland, Sweden, Norway, Latvia, Germany 1 (0.1%)
    France 11 (1.3%)
    France, Algeria 1 (0.1%)
    France, Belgium 2 (0.2%)
    France, Belgium, Luxembourg, Cambodia, 1 (0.1%)
    France, Egypt 1 (0.1%)
    France, Netherlands, Singapore 1 (0.1%)
    France, New Zealand 1 (0.1%)
    France, Norway, Lebanon, Belgium 1 (0.1%)
    Germany 8 (0.9%)
    Germany, Australia 1 (0.1%)
    Germany, Czech Republic 1 (0.1%)
    Germany, United States, Sweden 1 (0.1%)
    Ghana 1 (0.1%)
    Hong Kong 5 (0.6%)
    Hungary 1 (0.1%)
    Iceland 1 (0.1%)
    India 122 (14%)
    India, Australia 1 (0.1%)
    India, France 1 (0.1%)
    India, United Kingdom, Canada, United States 1 (0.1%)
    Indonesia 13 (1.5%)
    Indonesia, Netherlands 1 (0.1%)
    Indonesia, South Korea, Singapore 1 (0.1%)
    Indonesia, United Kingdom 1 (0.1%)
    Ireland 1 (0.1%)
    Ireland, United Kingdom 1 (0.1%)
    Italy 5 (0.6%)
    Italy, Belgium 1 (0.1%)
    Italy, France 1 (0.1%)
    Italy, Switzerland, Albania, Poland 1 (0.1%)
    Japan 7 (0.8%)
    Kenya, United States 1 (0.1%)
    Lebanon 2 (0.2%)
    Lebanon, United Arab Emirates 1 (0.1%)
    Lebanon, United States, United Arab Emirates 1 (0.1%)
    Malaysia 4 (0.5%)
    Mexico 23 (2.7%)
    Namibia 1 (0.1%)
    Netherlands 4 (0.5%)
    Netherlands, Belgium, Germany, Jordan 1 (0.1%)
    Netherlands, United States 1 (0.1%)
    New Zealand 1 (0.1%)
    Nigeria 18 (2.1%)
    Nigeria, United Kingdom 1 (0.1%)
    Norway, Germany 1 (0.1%)
    Pakistan 1 (0.1%)
    Peru, Germany, Norway 1 (0.1%)
    Philippines 3 (0.3%)
    Poland 3 (0.3%)
    Poland, United States 1 (0.1%)
    Romania, France, Switzerland, Germany 1 (0.1%)
    Russia 1 (0.1%)
    Singapore 1 (0.1%)
    Singapore, Japan, France 1 (0.1%)
    South Africa 5 (0.6%)
    South Korea 5 (0.6%)
    Spain 10 (1.2%)
    Spain, Mexico, France 1 (0.1%)
    Spain, Switzerland 1 (0.1%)
    Sweden 1 (0.1%)
    Sweden, Czech Republic, United Kingdom, Denmark, Netherlands 1 (0.1%)
    Sweden, United States 2 (0.2%)
    Taiwan 2 (0.2%)
    Thailand 1 (0.1%)
    Turkey 7 (0.8%)
    Turkey, France, Germany, Poland 1 (0.1%)
    United Kingdom 56 (6.5%)
    United Kingdom, 1 (0.1%)
    United Kingdom, Germany, Canada 1 (0.1%)
    United Kingdom, Hong Kong 1 (0.1%)
    United Kingdom, Lithuania 1 (0.1%)
    United Kingdom, United States 1 (0.1%)
    United States 389 (45%)
    United States, 1 (0.1%)
    United States, Australia 1 (0.1%)
    United States, Australia, China 1 (0.1%)
    United States, Australia, South Africa, United Kingdom 1 (0.1%)
    United States, Bermuda, Ecuador 1 (0.1%)
    United States, Botswana 1 (0.1%)
    United States, Canada 1 (0.1%)
    United States, China, United Kingdom 1 (0.1%)
    United States, Denmark 1 (0.1%)
    United States, Japan 1 (0.1%)
    United States, Mexico 1 (0.1%)
    United States, Nigeria 1 (0.1%)
    United States, Senegal 1 (0.1%)
    United States, United Kingdom 4 (0.5%)
    United States, United Kingdom, Germany 1 (0.1%)
    United States, Uruguay 1 (0.1%)
    United States, Venezuela 1 (0.1%)
    Uruguay, Argentina, Spain 1 (0.1%)
    Vietnam 1 (0.1%)
    Unknown 69
1 n (%); Median (IQR)

As we mentioned before Netflix is the largest online movie and TV show streaming service on the planet.Its service is widely available in many countries including but not limited to the United States, India, South Korea, Japan, andmany more.We visuliased top 10 countries. Figure 1.brepresents the amount of media (including TV shows and movies) in different countries. We can see that the United States is the #1 leader in the amount of content on Netflix, “Others” is in second place (this includes missing data), and India is in third place. For this plot we use the fct_reorder() to reorder factor levels by sorting another variable.Also we use the fct_lump() to Lump together factor levels into “other” and finally we use the fct_explicit_na()for the missing values as we mentioned before we have some missingness.The reason that I chose bar charts because they are good choice for showing the relationship between numeric and categorical variables.

Respective Percentages of TV Shows and Movies in Netflix and Number of Contents in Different countries

Respective Percentages of TV Shows and Movies in Netflix and Number of Contents in Different countries

The next analysis is on content durations. As we talked before we have two different contents movies and TV shows. Since there are two groups of contents using different units.2410 of the Netflix contents use “season” as the measurement for duration. 5377 of the Netflix contents use “season” as the measurement for duration.I decided to process just movies. I used the Histogram to show the movie duration.As we can see in Figure 2 Duration 90-99min are the most movies duration, then 100-109min, then 80-89min, then 110-119min.we have A continuous variable so we used histogram plot to visulaise the distribution of variables.function geom_histogram() is used for showing this distribution. Also in this figure we used gghighlight() to highlight the lines whose max values.The red line in the histogram shows that the it centered around 100 for this type of Netflix contents.

Movie duration

Movie duration

As we mentioned before in the overview of the dataset, we can see(Figure 3)that the earliest year recorded as the movie or the TV show is 1925 and the latest content that is on the platform is 2022.Figure 3 indicates that most of the movies were released after the year 2000s. As can be seen from the two graphs ,from 2016 to 2019 was the most titles were added to the streaming network the spike was in 2019 and of course, 2020 following 2019, remember here we are talking about titles added not the viewer’’s numbers. 2020 Covid-19 pandemic has stopped the increase, indeed due to decrease of Movies and TV-Shows production!

The trends of TV Shows and Movies over the years

The trends of TV Shows and Movies over the years

Lastly, we want to know that Is there any relationship between the Country and Duration of Movies? At the first,we hypothesized that the time duration of movies are the same among the countries.I selected top 10 countries to see the relationships between movies and their duration. Figure 4 shows the duration of movies in 10 countries.It can be seen from this plot that Movies produced in India tend to be the longest on average with the average duration of 127 min.We created the raw points plot with the summary showing the mean duration.The purpose is showing both raw data and a summary.

relationship between the Country and Duration of Movies

relationship between the Country and Duration of Movies

Conclusion

By analysing the Netflix dataset and using different plots like histogram, boxplot and bar charts. we found that 68.4% of the content on Netflix is Movies.The top 3 countries creating TV Shows for Netflix are United States, United Kingdom and Japan and top 3 countries creating Movies are United States, India and United Kingdom.Movies that produced in India are the longest contents.Also we found that most movies duration are between 90 to 99 minutes and there has been very few movies and TV shows released before 2000 and the spike was in 2019.I have plan to do more analysis on this dataset to find more relationship between variables.

References

Coding References

Analyses were conducted using the R Statistical language (version 4.2.1; R Core Team, 2022) on macOS Big Sur … 10.16, using the packages lubridate (version 1.9.2; Grolemund G, Wickham H, 2011), report (version 0.5.6; Makowski D et al., 2023), tibble (version 3.1.8; Müller K, Wickham H, 2022), RColorBrewer (version 1.1.3; Neuwirth E, 2022), patchwork (version 1.1.2; Pedersen T, 2022), gtsummary (version 1.7.0; Sjoberg D et al., 2021), skimr (version 2.1.5; Waring E et al., 2022), ggplot2 (version 3.4.1; Wickham H, 2016), stringr (version 1.5.0; Wickham H, 2022), forcats (version 1.0.0; Wickham H, 2023), tidyverse (version 2.0.0; Wickham H et al., 2019), dplyr (version 1.1.0; Wickham H et al., 2023), purrr (version 1.0.1; Wickham H, Henry L, 2023), readr (version 2.1.4; Wickham H et al., 2023), scales (version 1.2.1; Wickham H, Seidel D, 2022), tidyr (version 1.3.0; Wickham H et al., 2023), gghighlight (version 0.4.0; Yutani H, 2022) and kableExtra (version 1.3.4; Zhu H, 2021).

References

  1. (https://rmarkdown.rstudio.com/lesson-1.html)

  2. (https://r4ds.had.co.nz/r-markdown.html#table)

  3. (https://thriv.github.io/biodatasci2018/r-refresher-tidy-eda.html#about_the_data

  4. (https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)

  5. (https://github.com/280220/Netflix-Trend-Analysis)

  6. (https://www.roelpeters.be/scale-ggplot-y-axis-millions-or-thousands-r/)

  7. (http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/)

8.(https://www.kaggle.com/datasets/shivamb/netflix-shows/code)

  1. (https://www.kaggle.com/code/khsamaha/netflix-tv-shows-and-movies-eda-and-word-cloud/report)

  2. (https://www.kaggle.com/code/vikassingh1996/netflix-movies-and-shows-plotly-recommender-sys/notebook)

  3. (https://bookdown.org/anshul302/HE802-MGHIHP-Spring2020/intro.html)

  4. (https://socviz.co/modeling.html)

  5. (https://r4ds.had.co.nz/exploratory-data-analysis.html) ```