Introduction

Netflix, Inc. is a media services provider and production company headquartered in Los Gatos, California. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California Netflix. The company’s primary business is its subscription-based streaming service, which offers online streaming of a library of films and television series. Netflix is a popular entertainment service used by people around the world. In this project, I explored the Netflix data set, which can be found at Kaggle. In this report, I explore the Netflix dataset through visualizations and graphs using R libraries. I chose this dataset because because I am into movies and shows. I will be sharing this project on public forums about movies with other movie-lovers. With the power of data analysis,more users were attracted to the platform, and many users tend to spend most of their time watching shows and movies on Netflix. With this approach, we would like explore the dataset to understand the trend of movies and TV shows on Netflix. I endeavor to answer the following questions through data analysis:

What are the respective percentages of TV shows and movies on Netflix?
What are the trends of TV shows and movies over the years?
Find the top 10 countries with the most content.
What are the distributions of duration of movies and the number of seasons in TV shows?

Sourced from wikipedia

Dataset Description

This dataset is from Kaggle, and Netflix Netflix is the collector of this data. Shivam Bansa created and uploaded this dataset to Kaggle, and the dataset is public domain. This dataset includes lists of all the movies and TV shows that are available on flixable, which is a third-party search engine on Netflix. This dataset has 8,807 rows and 12 variables. It includes information about the genre, cast, director, countries where the show is available to watch, and a summary of what the show is about. The columns and their descriptions are as listed below:

SHOW ID: Unique ID of each show
TYPE: Show category. Could be either a Movie or a TV Show.
TITLE: Name of the show
DIRECTOR: Name of the director(s) of the show
CAST: Names of actors/actresses in the show
COUNTRY: Countries where the show is available to watch on Netflix
DATE ADDED: Date when the show was added on Netflix
RATING: Show rating on Netflix
RELEASE YEAR: Release year of the show
DURATION: Time duration of the show
LISTED IN: Genre of the show
DESCRIPTION: Brief insight into what the show is about

Data Quality

The purpose of this section is to describe the quality of the dataset. Missingness, strengths, and weaknesses will be discussed. A number of tables and figures will be presented to facilitate cohort description, illustrate the distributions of key variables, and present important results.

Strengths

This data has been collected for the analysis based on the movies and series. It is a tidy dataset which means that each variable has its own column, each observation has its own row, and each value has its own cell, so it is appropriate for most analysis. Another point of strength is that the dataset has both categorical and continuous variables, so we can investigate the information in a categorized way and report the information of each category separately, making informative plots about the data. For example, the description variable can be used to find similar movies and TV shows using the text similarities for further analysis. The sample size is another strength of this dataset that makes it possible to analyze and compare the data.

Weaknesses

There are some missingness in the data set, as seen in Table 1, requiring some data cleaning. The data cleaning process involved identifying incorrect, incomplete, inaccurate, irrelevant, or missing pieces of data and modifying, replacing, or deleting them as needed. This table shows that only the categories of director, cast, country, date added, rating, and duration have missing values and the category of director has the most missing values. There are 2,634 cases where director has missing values. There are 831 cases where country has missing values. There are 825 cases where cast has missing values. There are 17 cases where date added, rating, and duration have missing values. Another weakness in this dataset is data duplication. However, this could be fixed by merging duplicate data into a single value.

Missing Values Count by Variable
Variable	Missing Values Count
Show ID	0
Type	0
Title	0
Director	2634
Cast	825
Country	831
Date Added	10
Release Year	0
Rating	4
Duration	3
Genre	0
Description	0

Data Exploration

Descriptives and Demographics

I analyzed the Netflix TV shows and movie data. Below are my results. Table 2 shows the characteristics of participants sampled in this study, specifically the demographic of the variables.Before initiating the data analysis, we cleaned our data.First, we would like to see different contents ans also top 10 countires Figure 1 patchwork was used to combine the 2 graphs into one figure.As we see from Figure 1.a, there are more than two times as many movies as TV Shows on Netflix. we use the bar chart to show the percentage of contents on Netflix. Interestingly 68.4% of the content on Netflix is Movies.The largest count of programs (both movies and TV shows) are made with the TV-MA rating.Release year is very important aspect of learning when a content on the platform is released.Tools used to visualize the dataset include the Netflix color pallet and coolors accessible at coolors, which is a web tool that allows users with no design background to generate perfect matching colors in seconds. These allowed for more readable visualizations. All data summaries and analyses are conducted using R software.

Demographic Table
Characteristic	N = 936¹
Type
Movie	936 (100%)
Rating
TV-14	265 (28%)
TV-MA	553 (59%)
TV-PG	118 (13%)
Genre
Documentaries	289 (31%)
Dramas, International Movies	329 (35%)
Stand-Up Comedy	318 (34%)
Release Year	2,017.0 (2,015.0, 2,019.0)
Duration
100 min	11 (1.2%)
101 min	12 (1.3%)
102 min	13 (1.4%)
103 min	8 (0.9%)
104 min	7 (0.7%)
105 min	10 (1.1%)
106 min	10 (1.1%)
107 min	13 (1.4%)
108 min	7 (0.7%)
109 min	2 (0.2%)
110 min	7 (0.7%)
111 min	4 (0.4%)
112 min	6 (0.6%)
113 min	8 (0.9%)
114 min	9 (1.0%)
115 min	6 (0.6%)
116 min	9 (1.0%)
117 min	7 (0.7%)
118 min	9 (1.0%)
119 min	4 (0.4%)
12 min	1 (0.1%)
120 min	5 (0.5%)
121 min	7 (0.7%)
122 min	3 (0.3%)
123 min	5 (0.5%)
124 min	4 (0.4%)
125 min	3 (0.3%)
126 min	5 (0.5%)
127 min	5 (0.5%)
128 min	2 (0.2%)
129 min	1 (0.1%)
130 min	3 (0.3%)
131 min	2 (0.2%)
132 min	6 (0.6%)
133 min	1 (0.1%)
135 min	5 (0.5%)
136 min	5 (0.5%)
137 min	2 (0.2%)
139 min	1 (0.1%)
14 min	1 (0.1%)
140 min	2 (0.2%)
141 min	3 (0.3%)
142 min	1 (0.1%)
143 min	6 (0.6%)
144 min	1 (0.1%)
145 min	1 (0.1%)
146 min	1 (0.1%)
148 min	2 (0.2%)
149 min	1 (0.1%)
15 min	1 (0.1%)
150 min	2 (0.2%)
151 min	3 (0.3%)
155 min	1 (0.1%)
156 min	1 (0.1%)
157 min	2 (0.2%)
158 min	2 (0.2%)
159 min	2 (0.2%)
160 min	1 (0.1%)
161 min	3 (0.3%)
162 min	1 (0.1%)
164 min	1 (0.1%)
165 min	1 (0.1%)
168 min	1 (0.1%)
17 min	1 (0.1%)
174 min	1 (0.1%)
180 min	1 (0.1%)
182 min	1 (0.1%)
185 min	1 (0.1%)
186 min	1 (0.1%)
20 min	1 (0.1%)
209 min	1 (0.1%)
22 min	1 (0.1%)
24 min	4 (0.4%)
25 min	1 (0.1%)
26 min	1 (0.1%)
28 min	1 (0.1%)
29 min	4 (0.4%)
30 min	3 (0.3%)
31 min	1 (0.1%)
32 min	1 (0.1%)
33 min	1 (0.1%)
34 min	1 (0.1%)
35 min	2 (0.2%)
38 min	1 (0.1%)
40 min	6 (0.6%)
41 min	2 (0.2%)
42 min	1 (0.1%)
44 min	5 (0.5%)
45 min	3 (0.3%)
46 min	3 (0.3%)
47 min	2 (0.2%)
48 min	2 (0.2%)
49 min	4 (0.4%)
50 min	7 (0.7%)
51 min	5 (0.5%)
52 min	10 (1.1%)
53 min	12 (1.3%)
54 min	13 (1.4%)
55 min	7 (0.7%)
56 min	7 (0.7%)
57 min	8 (0.9%)
58 min	10 (1.1%)
59 min	15 (1.6%)
60 min	16 (1.7%)
61 min	18 (1.9%)
62 min	16 (1.7%)
63 min	22 (2.4%)
64 min	14 (1.5%)
65 min	11 (1.2%)
66 min	20 (2.1%)
67 min	13 (1.4%)
68 min	9 (1.0%)
69 min	15 (1.6%)
70 min	14 (1.5%)
71 min	14 (1.5%)
72 min	9 (1.0%)
73 min	6 (0.6%)
74 min	9 (1.0%)
75 min	4 (0.4%)
76 min	11 (1.2%)
77 min	10 (1.1%)
78 min	13 (1.4%)
79 min	8 (0.9%)
80 min	7 (0.7%)
81 min	12 (1.3%)
82 min	11 (1.2%)
83 min	7 (0.7%)
84 min	15 (1.6%)
85 min	7 (0.7%)
86 min	13 (1.4%)
87 min	13 (1.4%)
88 min	14 (1.5%)
89 min	12 (1.3%)
90 min	17 (1.8%)
91 min	13 (1.4%)
92 min	15 (1.6%)
93 min	16 (1.7%)
94 min	14 (1.5%)
95 min	11 (1.2%)
96 min	18 (1.9%)
97 min	18 (1.9%)
98 min	11 (1.2%)
99 min	19 (2.0%)
Country
Argentina	14 (1.6%)
Argentina, Chile	1 (0.1%)
Argentina, France, United States, Germany, Qatar	1 (0.1%)
Argentina, Uruguay, Spain, France	1 (0.1%)
Australia	3 (0.3%)
Australia, United States	1 (0.1%)
Austria	2 (0.2%)
Austria, Czech Republic	1 (0.1%)
Belgium, France	1 (0.1%)
Belgium, France, Netherlands	1 (0.1%)
Brazil	13 (1.5%)
Cameroon	1 (0.1%)
Canada	8 (0.9%)
Canada, India	1 (0.1%)
Canada, India, Thailand, United States, United Arab Emirates	1 (0.1%)
Canada, Nigeria	1 (0.1%)
Canada, United States	3 (0.3%)
Canada, United States, United Kingdom	1 (0.1%)
Chile	3 (0.3%)
Chile, Peru	1 (0.1%)
Chile, Spain, Argentina, Germany	1 (0.1%)
China	1 (0.1%)
China, Germany, India, United States	1 (0.1%)
Colombia	4 (0.5%)
Croatia, Slovenia, Serbia, Montenegro	1 (0.1%)
Czech Republic, Slovakia	1 (0.1%)
Czech Republic, United States	1 (0.1%)
Denmark	3 (0.3%)
Denmark, France, Poland	1 (0.1%)
Denmark, Sweden, Israel, United States	1 (0.1%)
Denmark, United States	1 (0.1%)
Egypt	12 (1.4%)
Egypt, France	1 (0.1%)
Finland, Sweden, Norway, Latvia, Germany	1 (0.1%)
France	11 (1.3%)
France, Algeria	1 (0.1%)
France, Belgium	2 (0.2%)
France, Belgium, Luxembourg, Cambodia,	1 (0.1%)
France, Egypt	1 (0.1%)
France, Netherlands, Singapore	1 (0.1%)
France, New Zealand	1 (0.1%)
France, Norway, Lebanon, Belgium	1 (0.1%)
Germany	8 (0.9%)
Germany, Australia	1 (0.1%)
Germany, Czech Republic	1 (0.1%)
Germany, United States, Sweden	1 (0.1%)
Ghana	1 (0.1%)
Hong Kong	5 (0.6%)
Hungary	1 (0.1%)
Iceland	1 (0.1%)
India	122 (14%)
India, Australia	1 (0.1%)
India, France	1 (0.1%)
India, United Kingdom, Canada, United States	1 (0.1%)
Indonesia	13 (1.5%)
Indonesia, Netherlands	1 (0.1%)
Indonesia, South Korea, Singapore	1 (0.1%)
Indonesia, United Kingdom	1 (0.1%)
Ireland	1 (0.1%)
Ireland, United Kingdom	1 (0.1%)
Italy	5 (0.6%)
Italy, Belgium	1 (0.1%)
Italy, France	1 (0.1%)
Italy, Switzerland, Albania, Poland	1 (0.1%)
Japan	7 (0.8%)
Kenya, United States	1 (0.1%)
Lebanon	2 (0.2%)
Lebanon, United Arab Emirates	1 (0.1%)
Lebanon, United States, United Arab Emirates	1 (0.1%)
Malaysia	4 (0.5%)
Mexico	23 (2.7%)
Namibia	1 (0.1%)
Netherlands	4 (0.5%)
Netherlands, Belgium, Germany, Jordan	1 (0.1%)
Netherlands, United States	1 (0.1%)
New Zealand	1 (0.1%)
Nigeria	18 (2.1%)
Nigeria, United Kingdom	1 (0.1%)
Norway, Germany	1 (0.1%)
Pakistan	1 (0.1%)
Peru, Germany, Norway	1 (0.1%)
Philippines	3 (0.3%)
Poland	3 (0.3%)
Poland, United States	1 (0.1%)
Romania, France, Switzerland, Germany	1 (0.1%)
Russia	1 (0.1%)
Singapore	1 (0.1%)
Singapore, Japan, France	1 (0.1%)
South Africa	5 (0.6%)
South Korea	5 (0.6%)
Spain	10 (1.2%)
Spain, Mexico, France	1 (0.1%)
Spain, Switzerland	1 (0.1%)
Sweden	1 (0.1%)
Sweden, Czech Republic, United Kingdom, Denmark, Netherlands	1 (0.1%)
Sweden, United States	2 (0.2%)
Taiwan	2 (0.2%)
Thailand	1 (0.1%)
Turkey	7 (0.8%)
Turkey, France, Germany, Poland	1 (0.1%)
United Kingdom	56 (6.5%)
United Kingdom,	1 (0.1%)
United Kingdom, Germany, Canada	1 (0.1%)
United Kingdom, Hong Kong	1 (0.1%)
United Kingdom, Lithuania	1 (0.1%)
United Kingdom, United States	1 (0.1%)
United States	389 (45%)
United States,	1 (0.1%)
United States, Australia	1 (0.1%)
United States, Australia, China	1 (0.1%)
United States, Australia, South Africa, United Kingdom	1 (0.1%)
United States, Bermuda, Ecuador	1 (0.1%)
United States, Botswana	1 (0.1%)
United States, Canada	1 (0.1%)
United States, China, United Kingdom	1 (0.1%)
United States, Denmark	1 (0.1%)
United States, Japan	1 (0.1%)
United States, Mexico	1 (0.1%)
United States, Nigeria	1 (0.1%)
United States, Senegal	1 (0.1%)
United States, United Kingdom	4 (0.5%)
United States, United Kingdom, Germany	1 (0.1%)
United States, Uruguay	1 (0.1%)
United States, Venezuela	1 (0.1%)
Uruguay, Argentina, Spain	1 (0.1%)
Vietnam	1 (0.1%)
Unknown	69
¹ n (%); Median (IQR)

As we mentioned before Netflix is the largest online movie and TV show streaming service on the planet.Its service is widely available in many countries including but not limited to the United States, India, South Korea, Japan, andmany more.We visuliased top 10 countries. Figure 1.brepresents the amount of media (including TV shows and movies) in different countries. We can see that the United States is the #1 leader in the amount of content on Netflix, “Others” is in second place (this includes missing data), and India is in third place. For this plot we use the fct_reorder() to reorder factor levels by sorting another variable.Also we use the fct_lump() to Lump together factor levels into “other” and finally we use the fct_explicit_na()for the missing values as we mentioned before we have some missingness.The reason that I chose bar charts because they are good choice for showing the relationship between numeric and categorical variables.

Respective Percentages of TV Shows and Movies in Netflix and Number of Contents in Different countries

The next analysis is on content durations. As we talked before we have two different contents movies and TV shows. Since there are two groups of contents using different units.2410 of the Netflix contents use “season” as the measurement for duration. 5377 of the Netflix contents use “season” as the measurement for duration.I decided to process just movies. I used the Histogram to show the movie duration.As we can see in Figure 2 Duration 90-99min are the most movies duration, then 100-109min, then 80-89min, then 110-119min.we have A continuous variable so we used histogram plot to visulaise the distribution of variables.function geom_histogram() is used for showing this distribution. Also in this figure we used gghighlight() to highlight the lines whose max values.The red line in the histogram shows that the it centered around 100 for this type of Netflix contents.

Movie duration

As we mentioned before in the overview of the dataset, we can see(Figure 3)that the earliest year recorded as the movie or the TV show is 1925 and the latest content that is on the platform is 2022.Figure 3 indicates that most of the movies were released after the year 2000s. As can be seen from the two graphs ,from 2016 to 2019 was the most titles were added to the streaming network the spike was in 2019 and of course, 2020 following 2019, remember here we are talking about titles added not the viewer’’s numbers. 2020 Covid-19 pandemic has stopped the increase, indeed due to decrease of Movies and TV-Shows production!

The trends of TV Shows and Movies over the years

Lastly, we want to know that Is there any relationship between the Country and Duration of Movies? At the first,we hypothesized that the time duration of movies are the same among the countries.I selected top 10 countries to see the relationships between movies and their duration. Figure 4 shows the duration of movies in 10 countries.It can be seen from this plot that Movies produced in India tend to be the longest on average with the average duration of 127 min.We created the raw points plot with the summary showing the mean duration.The purpose is showing both raw data and a summary.

relationship between the Country and Duration of Movies

Conclusion

By analysing the Netflix dataset and using different plots like histogram, boxplot and bar charts. we found that 68.4% of the content on Netflix is Movies.The top 3 countries creating TV Shows for Netflix are United States, United Kingdom and Japan and top 3 countries creating Movies are United States, India and United Kingdom.Movies that produced in India are the longest contents.Also we found that most movies duration are between 90 to 99 minutes and there has been very few movies and TV shows released before 2000 and the spike was in 2019.I have plan to do more analysis on this dataset to find more relationship between variables.

References

Research Resources

Coding References

Analyses were conducted using the R Statistical language (version 4.2.1; R Core Team, 2022) on macOS Big Sur … 10.16, using the packages lubridate (version 1.9.2; Grolemund G, Wickham H, 2011), report (version 0.5.6; Makowski D et al., 2023), tibble (version 3.1.8; Müller K, Wickham H, 2022), RColorBrewer (version 1.1.3; Neuwirth E, 2022), patchwork (version 1.1.2; Pedersen T, 2022), gtsummary (version 1.7.0; Sjoberg D et al., 2021), skimr (version 2.1.5; Waring E et al., 2022), ggplot2 (version 3.4.1; Wickham H, 2016), stringr (version 1.5.0; Wickham H, 2022), forcats (version 1.0.0; Wickham H, 2023), tidyverse (version 2.0.0; Wickham H et al., 2019), dplyr (version 1.1.0; Wickham H et al., 2023), purrr (version 1.0.1; Wickham H, Henry L, 2023), readr (version 2.1.4; Wickham H et al., 2023), scales (version 1.2.1; Wickham H, Seidel D, 2022), tidyr (version 1.3.0; Wickham H et al., 2023), gghighlight (version 0.4.0; Yutani H, 2022) and kableExtra (version 1.3.4; Zhu H, 2021).

References

Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/.
Makowski D, Lüdecke D, Patil I, Thériault R (2023). “Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption.” CRAN. https://easystats.github.io/report/.
Müller K, Wickham H (2022). tibble: Simple Data Frames. R package version 3.1.8, https://CRAN.R-project.org/package=tibble.
Neuwirth E (2022). RColorBrewer: ColorBrewer Palettes. R package version 1.1-3, https://CRAN.R-project.org/package=RColorBrewer.
Pedersen T (2022). patchwork: The Composer of Plots. R package version 1.1.2, https://CRAN.R-project.org/package=patchwork.
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Sjoberg D, Whiting K, Curry M, Lavery J, Larmarange J (2021). “Reproducible Summary Tables with the gtsummary Package.” The R Journal, 13, 570-580. doi:10.32614/RJ-2021-053 https://doi.org/10.32614/RJ-2021-053, https://doi.org/10.32614/RJ-2021-053.
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.0, https://CRAN.R-project.org/package=stringr.
Wickham H (2023). forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0, https://CRAN.R-project.org/package=forcats.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.0, https://CRAN.R-project.org/package=dplyr.
Wickham H, Henry L (2023). purrr: Functional Programming Tools. R package version 1.0.1, https://CRAN.R-project.org/package=purrr.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Wickham H, Seidel D (2022). scales: Scale Functions for Visualization. R package version 1.2.1, https://CRAN.R-project.org/package=scales.
Wickham H, Vaughan D, Girlich M (2023). tidyr: Tidy Messy Data. R package version 1.3.0, https://CRAN.R-project.org/package=tidyr.
Yutani H (2022). gghighlight: Highlight Lines and Points in ‘ggplot2’. R package version 0.4.0, https://CRAN.R-project.org/package=gghighlight.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.

8.(https://www.kaggle.com/datasets/shivamb/netflix-shows/code)

Data Analysis on the Netflix Datasets

Mahshid Arastonejad

12-01-2022

Dataset Description

Data Quality

Strengths

Weaknesses

Data Exploration

Descriptives and Demographics

Conclusion

References

Research Resources

Coding References

References