Introduction
Netflix, Inc. is a media services provider and production company headquartered in Los Gatos, California. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California Netflix. The company’s primary business is its subscription-based streaming service, which offers online streaming of a library of films and television series. Netflix is a popular entertainment service used by people around the world. In this project, I explored the Netflix data set, which can be found at Kaggle. In this report, I explore the Netflix dataset through visualizations and graphs using R libraries. I chose this dataset because because I am into movies and shows. I will be sharing this project on public forums about movies with other movie-lovers. With the power of data analysis,more users were attracted to the platform, and many users tend to spend most of their time watching shows and movies on Netflix. With this approach, we would like explore the dataset to understand the trend of movies and TV shows on Netflix. I endeavor to answer the following questions through data analysis:
Sourced from wikipedia
This dataset is from Kaggle, and Netflix Netflix is the collector of this data. Shivam Bansa created and uploaded this dataset to Kaggle, and the dataset is public domain. This dataset includes lists of all the movies and TV shows that are available on flixable, which is a third-party search engine on Netflix. This dataset has 8,807 rows and 12 variables. It includes information about the genre, cast, director, countries where the show is available to watch, and a summary of what the show is about. The columns and their descriptions are as listed below:
The purpose of this section is to describe the quality of the dataset. Missingness, strengths, and weaknesses will be discussed. A number of tables and figures will be presented to facilitate cohort description, illustrate the distributions of key variables, and present important results.
This data has been collected for the analysis based on the movies and series. It is a tidy dataset which means that each variable has its own column, each observation has its own row, and each value has its own cell, so it is appropriate for most analysis. Another point of strength is that the dataset has both categorical and continuous variables, so we can investigate the information in a categorized way and report the information of each category separately, making informative plots about the data. For example, the description variable can be used to find similar movies and TV shows using the text similarities for further analysis. The sample size is another strength of this dataset that makes it possible to analyze and compare the data.
There are some missingness in the data set, as seen in Table 1, requiring some data cleaning. The data cleaning process involved identifying incorrect, incomplete, inaccurate, irrelevant, or missing pieces of data and modifying, replacing, or deleting them as needed. This table shows that only the categories of director, cast, country, date added, rating, and duration have missing values and the category of director has the most missing values. There are 2,634 cases where director has missing values. There are 831 cases where country has missing values. There are 825 cases where cast has missing values. There are 17 cases where date added, rating, and duration have missing values. Another weakness in this dataset is data duplication. However, this could be fixed by merging duplicate data into a single value.
Variable | Missing Values Count |
---|---|
Show ID | 0 |
Type | 0 |
Title | 0 |
Director | 2634 |
Cast | 825 |
Country | 831 |
Date Added | 10 |
Release Year | 0 |
Rating | 4 |
Duration | 3 |
Genre | 0 |
Description | 0 |
I analyzed the Netflix TV shows and movie data. Below are my results.
Table 2 shows the characteristics of participants
sampled in this study, specifically the demographic of the
variables.Before initiating the data analysis, we cleaned our
data.First, we would like to see different contents ans also top 10
countires Figure 1 patchwork
was used to
combine the 2 graphs into one figure.As we see from Figure
1.a, there are more than two times as many movies as TV Shows
on Netflix. we use the bar chart to show the percentage of contents on
Netflix. Interestingly 68.4% of the content on Netflix is Movies.The
largest count of programs (both movies and TV shows) are made with the
TV-MA rating.Release year is very important aspect of learning when a
content on the platform is released.Tools used to visualize the dataset
include the Netflix color pallet and coolors accessible at coolors, which is a web tool that allows
users with no design background to generate perfect matching colors in
seconds. These allowed for more readable visualizations. All data
summaries and analyses are conducted using R software.
Characteristic | N = 9361 |
---|---|
Type | |
Movie | 936 (100%) |
Rating | |
TV-14 | 265 (28%) |
TV-MA | 553 (59%) |
TV-PG | 118 (13%) |
Genre | |
Documentaries | 289 (31%) |
Dramas, International Movies | 329 (35%) |
Stand-Up Comedy | 318 (34%) |
Release Year | 2,017.0 (2,015.0, 2,019.0) |
Duration | |
100 min | 11 (1.2%) |
101 min | 12 (1.3%) |
102 min | 13 (1.4%) |
103 min | 8 (0.9%) |
104 min | 7 (0.7%) |
105 min | 10 (1.1%) |
106 min | 10 (1.1%) |
107 min | 13 (1.4%) |
108 min | 7 (0.7%) |
109 min | 2 (0.2%) |
110 min | 7 (0.7%) |
111 min | 4 (0.4%) |
112 min | 6 (0.6%) |
113 min | 8 (0.9%) |
114 min | 9 (1.0%) |
115 min | 6 (0.6%) |
116 min | 9 (1.0%) |
117 min | 7 (0.7%) |
118 min | 9 (1.0%) |
119 min | 4 (0.4%) |
12 min | 1 (0.1%) |
120 min | 5 (0.5%) |
121 min | 7 (0.7%) |
122 min | 3 (0.3%) |
123 min | 5 (0.5%) |
124 min | 4 (0.4%) |
125 min | 3 (0.3%) |
126 min | 5 (0.5%) |
127 min | 5 (0.5%) |
128 min | 2 (0.2%) |
129 min | 1 (0.1%) |
130 min | 3 (0.3%) |
131 min | 2 (0.2%) |
132 min | 6 (0.6%) |
133 min | 1 (0.1%) |
135 min | 5 (0.5%) |
136 min | 5 (0.5%) |
137 min | 2 (0.2%) |
139 min | 1 (0.1%) |
14 min | 1 (0.1%) |
140 min | 2 (0.2%) |
141 min | 3 (0.3%) |
142 min | 1 (0.1%) |
143 min | 6 (0.6%) |
144 min | 1 (0.1%) |
145 min | 1 (0.1%) |
146 min | 1 (0.1%) |
148 min | 2 (0.2%) |
149 min | 1 (0.1%) |
15 min | 1 (0.1%) |
150 min | 2 (0.2%) |
151 min | 3 (0.3%) |
155 min | 1 (0.1%) |
156 min | 1 (0.1%) |
157 min | 2 (0.2%) |
158 min | 2 (0.2%) |
159 min | 2 (0.2%) |
160 min | 1 (0.1%) |
161 min | 3 (0.3%) |
162 min | 1 (0.1%) |
164 min | 1 (0.1%) |
165 min | 1 (0.1%) |
168 min | 1 (0.1%) |
17 min | 1 (0.1%) |
174 min | 1 (0.1%) |
180 min | 1 (0.1%) |
182 min | 1 (0.1%) |
185 min | 1 (0.1%) |
186 min | 1 (0.1%) |
20 min | 1 (0.1%) |
209 min | 1 (0.1%) |
22 min | 1 (0.1%) |
24 min | 4 (0.4%) |
25 min | 1 (0.1%) |
26 min | 1 (0.1%) |
28 min | 1 (0.1%) |
29 min | 4 (0.4%) |
30 min | 3 (0.3%) |
31 min | 1 (0.1%) |
32 min | 1 (0.1%) |
33 min | 1 (0.1%) |
34 min | 1 (0.1%) |
35 min | 2 (0.2%) |
38 min | 1 (0.1%) |
40 min | 6 (0.6%) |
41 min | 2 (0.2%) |
42 min | 1 (0.1%) |
44 min | 5 (0.5%) |
45 min | 3 (0.3%) |
46 min | 3 (0.3%) |
47 min | 2 (0.2%) |
48 min | 2 (0.2%) |
49 min | 4 (0.4%) |
50 min | 7 (0.7%) |
51 min | 5 (0.5%) |
52 min | 10 (1.1%) |
53 min | 12 (1.3%) |
54 min | 13 (1.4%) |
55 min | 7 (0.7%) |
56 min | 7 (0.7%) |
57 min | 8 (0.9%) |
58 min | 10 (1.1%) |
59 min | 15 (1.6%) |
60 min | 16 (1.7%) |
61 min | 18 (1.9%) |
62 min | 16 (1.7%) |
63 min | 22 (2.4%) |
64 min | 14 (1.5%) |
65 min | 11 (1.2%) |
66 min | 20 (2.1%) |
67 min | 13 (1.4%) |
68 min | 9 (1.0%) |
69 min | 15 (1.6%) |
70 min | 14 (1.5%) |
71 min | 14 (1.5%) |
72 min | 9 (1.0%) |
73 min | 6 (0.6%) |
74 min | 9 (1.0%) |
75 min | 4 (0.4%) |
76 min | 11 (1.2%) |
77 min | 10 (1.1%) |
78 min | 13 (1.4%) |
79 min | 8 (0.9%) |
80 min | 7 (0.7%) |
81 min | 12 (1.3%) |
82 min | 11 (1.2%) |
83 min | 7 (0.7%) |
84 min | 15 (1.6%) |
85 min | 7 (0.7%) |
86 min | 13 (1.4%) |
87 min | 13 (1.4%) |
88 min | 14 (1.5%) |
89 min | 12 (1.3%) |
90 min | 17 (1.8%) |
91 min | 13 (1.4%) |
92 min | 15 (1.6%) |
93 min | 16 (1.7%) |
94 min | 14 (1.5%) |
95 min | 11 (1.2%) |
96 min | 18 (1.9%) |
97 min | 18 (1.9%) |
98 min | 11 (1.2%) |
99 min | 19 (2.0%) |
Country | |
Argentina | 14 (1.6%) |
Argentina, Chile | 1 (0.1%) |
Argentina, France, United States, Germany, Qatar | 1 (0.1%) |
Argentina, Uruguay, Spain, France | 1 (0.1%) |
Australia | 3 (0.3%) |
Australia, United States | 1 (0.1%) |
Austria | 2 (0.2%) |
Austria, Czech Republic | 1 (0.1%) |
Belgium, France | 1 (0.1%) |
Belgium, France, Netherlands | 1 (0.1%) |
Brazil | 13 (1.5%) |
Cameroon | 1 (0.1%) |
Canada | 8 (0.9%) |
Canada, India | 1 (0.1%) |
Canada, India, Thailand, United States, United Arab Emirates | 1 (0.1%) |
Canada, Nigeria | 1 (0.1%) |
Canada, United States | 3 (0.3%) |
Canada, United States, United Kingdom | 1 (0.1%) |
Chile | 3 (0.3%) |
Chile, Peru | 1 (0.1%) |
Chile, Spain, Argentina, Germany | 1 (0.1%) |
China | 1 (0.1%) |
China, Germany, India, United States | 1 (0.1%) |
Colombia | 4 (0.5%) |
Croatia, Slovenia, Serbia, Montenegro | 1 (0.1%) |
Czech Republic, Slovakia | 1 (0.1%) |
Czech Republic, United States | 1 (0.1%) |
Denmark | 3 (0.3%) |
Denmark, France, Poland | 1 (0.1%) |
Denmark, Sweden, Israel, United States | 1 (0.1%) |
Denmark, United States | 1 (0.1%) |
Egypt | 12 (1.4%) |
Egypt, France | 1 (0.1%) |
Finland, Sweden, Norway, Latvia, Germany | 1 (0.1%) |
France | 11 (1.3%) |
France, Algeria | 1 (0.1%) |
France, Belgium | 2 (0.2%) |
France, Belgium, Luxembourg, Cambodia, | 1 (0.1%) |
France, Egypt | 1 (0.1%) |
France, Netherlands, Singapore | 1 (0.1%) |
France, New Zealand | 1 (0.1%) |
France, Norway, Lebanon, Belgium | 1 (0.1%) |
Germany | 8 (0.9%) |
Germany, Australia | 1 (0.1%) |
Germany, Czech Republic | 1 (0.1%) |
Germany, United States, Sweden | 1 (0.1%) |
Ghana | 1 (0.1%) |
Hong Kong | 5 (0.6%) |
Hungary | 1 (0.1%) |
Iceland | 1 (0.1%) |
India | 122 (14%) |
India, Australia | 1 (0.1%) |
India, France | 1 (0.1%) |
India, United Kingdom, Canada, United States | 1 (0.1%) |
Indonesia | 13 (1.5%) |
Indonesia, Netherlands | 1 (0.1%) |
Indonesia, South Korea, Singapore | 1 (0.1%) |
Indonesia, United Kingdom | 1 (0.1%) |
Ireland | 1 (0.1%) |
Ireland, United Kingdom | 1 (0.1%) |
Italy | 5 (0.6%) |
Italy, Belgium | 1 (0.1%) |
Italy, France | 1 (0.1%) |
Italy, Switzerland, Albania, Poland | 1 (0.1%) |
Japan | 7 (0.8%) |
Kenya, United States | 1 (0.1%) |
Lebanon | 2 (0.2%) |
Lebanon, United Arab Emirates | 1 (0.1%) |
Lebanon, United States, United Arab Emirates | 1 (0.1%) |
Malaysia | 4 (0.5%) |
Mexico | 23 (2.7%) |
Namibia | 1 (0.1%) |
Netherlands | 4 (0.5%) |
Netherlands, Belgium, Germany, Jordan | 1 (0.1%) |
Netherlands, United States | 1 (0.1%) |
New Zealand | 1 (0.1%) |
Nigeria | 18 (2.1%) |
Nigeria, United Kingdom | 1 (0.1%) |
Norway, Germany | 1 (0.1%) |
Pakistan | 1 (0.1%) |
Peru, Germany, Norway | 1 (0.1%) |
Philippines | 3 (0.3%) |
Poland | 3 (0.3%) |
Poland, United States | 1 (0.1%) |
Romania, France, Switzerland, Germany | 1 (0.1%) |
Russia | 1 (0.1%) |
Singapore | 1 (0.1%) |
Singapore, Japan, France | 1 (0.1%) |
South Africa | 5 (0.6%) |
South Korea | 5 (0.6%) |
Spain | 10 (1.2%) |
Spain, Mexico, France | 1 (0.1%) |
Spain, Switzerland | 1 (0.1%) |
Sweden | 1 (0.1%) |
Sweden, Czech Republic, United Kingdom, Denmark, Netherlands | 1 (0.1%) |
Sweden, United States | 2 (0.2%) |
Taiwan | 2 (0.2%) |
Thailand | 1 (0.1%) |
Turkey | 7 (0.8%) |
Turkey, France, Germany, Poland | 1 (0.1%) |
United Kingdom | 56 (6.5%) |
United Kingdom, | 1 (0.1%) |
United Kingdom, Germany, Canada | 1 (0.1%) |
United Kingdom, Hong Kong | 1 (0.1%) |
United Kingdom, Lithuania | 1 (0.1%) |
United Kingdom, United States | 1 (0.1%) |
United States | 389 (45%) |
United States, | 1 (0.1%) |
United States, Australia | 1 (0.1%) |
United States, Australia, China | 1 (0.1%) |
United States, Australia, South Africa, United Kingdom | 1 (0.1%) |
United States, Bermuda, Ecuador | 1 (0.1%) |
United States, Botswana | 1 (0.1%) |
United States, Canada | 1 (0.1%) |
United States, China, United Kingdom | 1 (0.1%) |
United States, Denmark | 1 (0.1%) |
United States, Japan | 1 (0.1%) |
United States, Mexico | 1 (0.1%) |
United States, Nigeria | 1 (0.1%) |
United States, Senegal | 1 (0.1%) |
United States, United Kingdom | 4 (0.5%) |
United States, United Kingdom, Germany | 1 (0.1%) |
United States, Uruguay | 1 (0.1%) |
United States, Venezuela | 1 (0.1%) |
Uruguay, Argentina, Spain | 1 (0.1%) |
Vietnam | 1 (0.1%) |
Unknown | 69 |
1 n (%); Median (IQR) |
As we mentioned before Netflix is the largest online movie and TV show streaming service on the planet.Its service is widely available in many countries including but not limited to the United States, India, South Korea, Japan, andmany more.We visuliased top 10 countries. Figure 1.brepresents the amount of media (including TV shows and movies) in different countries. We can see that the United States is the #1 leader in the amount of content on Netflix, “Others” is in second place (this includes missing data), and India is in third place. For this plot we use the fct_reorder() to reorder factor levels by sorting another variable.Also we use the fct_lump() to Lump together factor levels into “other” and finally we use the fct_explicit_na()for the missing values as we mentioned before we have some missingness.The reason that I chose bar charts because they are good choice for showing the relationship between numeric and categorical variables.
Respective Percentages of TV Shows and Movies in Netflix and Number of Contents in Different countries
The next analysis is on content durations. As we talked before we have two different contents movies and TV shows. Since there are two groups of contents using different units.2410 of the Netflix contents use “season” as the measurement for duration. 5377 of the Netflix contents use “season” as the measurement for duration.I decided to process just movies. I used the Histogram to show the movie duration.As we can see in Figure 2 Duration 90-99min are the most movies duration, then 100-109min, then 80-89min, then 110-119min.we have A continuous variable so we used histogram plot to visulaise the distribution of variables.function geom_histogram() is used for showing this distribution. Also in this figure we used gghighlight() to highlight the lines whose max values.The red line in the histogram shows that the it centered around 100 for this type of Netflix contents.
Movie duration
As we mentioned before in the overview of the dataset, we can see(Figure 3)that the earliest year recorded as the movie or the TV show is 1925 and the latest content that is on the platform is 2022.Figure 3 indicates that most of the movies were released after the year 2000s. As can be seen from the two graphs ,from 2016 to 2019 was the most titles were added to the streaming network the spike was in 2019 and of course, 2020 following 2019, remember here we are talking about titles added not the viewer’’s numbers. 2020 Covid-19 pandemic has stopped the increase, indeed due to decrease of Movies and TV-Shows production!
The trends of TV Shows and Movies over the years
Lastly, we want to know that Is there any relationship between the Country and Duration of Movies? At the first,we hypothesized that the time duration of movies are the same among the countries.I selected top 10 countries to see the relationships between movies and their duration. Figure 4 shows the duration of movies in 10 countries.It can be seen from this plot that Movies produced in India tend to be the longest on average with the average duration of 127 min.We created the raw points plot with the summary showing the mean duration.The purpose is showing both raw data and a summary.
relationship between the Country and Duration of Movies
By analysing the Netflix dataset and using different plots like histogram, boxplot and bar charts. we found that 68.4% of the content on Netflix is Movies.The top 3 countries creating TV Shows for Netflix are United States, United Kingdom and Japan and top 3 countries creating Movies are United States, India and United Kingdom.Movies that produced in India are the longest contents.Also we found that most movies duration are between 90 to 99 minutes and there has been very few movies and TV shows released before 2000 and the spike was in 2019.I have plan to do more analysis on this dataset to find more relationship between variables.
Analyses were conducted using the R Statistical language (version 4.2.1; R Core Team, 2022) on macOS Big Sur … 10.16, using the packages lubridate (version 1.9.2; Grolemund G, Wickham H, 2011), report (version 0.5.6; Makowski D et al., 2023), tibble (version 3.1.8; Müller K, Wickham H, 2022), RColorBrewer (version 1.1.3; Neuwirth E, 2022), patchwork (version 1.1.2; Pedersen T, 2022), gtsummary (version 1.7.0; Sjoberg D et al., 2021), skimr (version 2.1.5; Waring E et al., 2022), ggplot2 (version 3.4.1; Wickham H, 2016), stringr (version 1.5.0; Wickham H, 2022), forcats (version 1.0.0; Wickham H, 2023), tidyverse (version 2.0.0; Wickham H et al., 2019), dplyr (version 1.1.0; Wickham H et al., 2023), purrr (version 1.0.1; Wickham H, Henry L, 2023), readr (version 2.1.4; Wickham H et al., 2023), scales (version 1.2.1; Wickham H, Seidel D, 2022), tidyr (version 1.3.0; Wickham H et al., 2023), gghighlight (version 0.4.0; Yutani H, 2022) and kableExtra (version 1.3.4; Zhu H, 2021).
(https://thriv.github.io/biodatasci2018/r-refresher-tidy-eda.html#about_the_data
(https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
(https://www.roelpeters.be/scale-ggplot-y-axis-millions-or-thousands-r/)
(http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/)
8.(https://www.kaggle.com/datasets/shivamb/netflix-shows/code)