The Netflix dataset contains information about movies and TV shows available on Netflix. It includes details such as release year, ratings, duration, genres, directors, cast, and countries of production. This dataset is suitable for univariate, bivariate, and multivariate analysis, as well as sampling and demonstrating statistical concepts.
show_id : Unique identifier for each title.type : Type of content (Movie or TV Show).title : Name of the movie or TV show.director : Director(s) of the title.cast : Main cast members.country : Country of production.date_added : Date when the title was added to
Netflix.release_year : Original release year of the
title.rating : Age-based rating of the content (e.g., PG,
TV-MA).duration : Duration of the title (minutes for movies,
seasons for TV shows).listed_in : Genre(s) or categories of the title.description : Brief summary of the content.This dataset allows you to explore patterns such as the distribution of movies vs TV shows, trends in release years, ratings by type, genre popularity, and more.
The main objectives of this project are to answer the following questions:
Objective: Examine the distribution of a single categorical variable to understand its frequency and patterns.
Variable: type (Movie or TV Show)
We analyze the type variable to understand the
composition of Netflix content. By counting the number of Movies vs TV
Shows, we can determine which type of content dominates the platform. A
bar chart visualizes the frequency distribution.
Objective: Examine numerical variables to understand central tendency, spread, distribution, and demonstrate the Central Limit Theorem (CLT) using random sampling.
Variables:
- release_year → overview of release trends
- duration_num → detailed distribution and CLT
demonstration
| count | mean_duration | median_duration | sd_duration | min_duration | max_duration |
|---|---|---|---|---|---|
| 8804 | 93.45116 | 92 | 46.73873 | 3 | 765 |
Objective: Explore relationships between two variables to understand patterns or differences.
Variables: type (categorical) vs
release_year (numerical)
We analyze the relationship between type and
release_year using a box plot. This shows how release years
are distributed for Movies vs TV Shows, allowing us to observe
differences in trends between the two content types over time.
Objective: Examine interactions between multiple variables simultaneously to uncover complex patterns.
Variables: release_year (numerical),
rating (categorical), type (categorical)
We explore how release_year and rating
interact across content type. A scatter plot colored by
type helps visualize the distribution of ratings over the
years for Movies and TV Shows, highlighting patterns in ratings across
different types of content.
In this section, we examine the distribution of the
duration_num variable (duration of Netflix shows in
minutes) and demonstrate the Central Limit Theorem (CLT) using random
sampling.
Distribution of Duration
First, we visualize the distribution of the original duration data to
understand its shape and spread. This helps identify if the data is
skewed, uniform, or approximately normal.
Sampling & CLT
We then draw 1000 random samples of size 30 each from the
duration_num data, calculate the mean of each sample, and
plot the distribution of these sample means. According to the Central
Limit Theorem, regardless of the shape of the original data, the
distribution of sample means should approximate a normal distribution
when the sample size is sufficiently large.
This analysis demonstrates how the CLT allows us to make inferences about the population mean using sample means.
In this section, we demonstrate three different sampling techniques
applied to the Netflix dataset — Random Sampling,
Stratified Sampling, and Systematic
Sampling.
Sampling allows us to analyze subsets of data efficiently when working
with large datasets, while still drawing conclusions that reflect the
population trends.
Description:
Random sampling selects a fixed number of records completely at random
from the dataset. Each observation has an equal chance of being chosen,
making it an unbiased technique. However, it may not always represent
all categories if the sample size is small.
| type | title | release_year |
|---|---|---|
| Movie | Uncertain Glory | 2017 |
| Movie | Guest House | 2020 |
| TV Show | Flavorful Origins | 2020 |
Description:
Stratified sampling divides the dataset into groups (or strata) based on
a categorical variable — here, type (Movie or TV
Show).
Then, a fixed fraction is sampled from each stratum to maintain
proportional representation.
This ensures that all groups are represented in the sample, reducing
bias in datasets with uneven category sizes.
| type | title | release_year | rating |
|---|---|---|---|
| Movie | Dedemin Fisi | 2016 | TV-14 |
| Movie | The Age of Shadows | 2016 | TV-MA |
| Movie | Òlòtūré | 2020 | TV-MA |
Description:
Systematic sampling selects data points at regular intervals (for
example, every 10th row).
It is simple to perform and ensures even coverage of the dataset.
However, if the data has a hidden pattern or periodicity, this method
may introduce bias and fail to represent the dataset accurately.
Random Sampling:
Simple and unbiased, but might miss less frequent categories if they
occur rarely in the dataset.
Stratified Sampling:
Preserves proportions across categories (e.g., Movies vs TV Shows),
providing better representation for each group — especially useful when
categories are unevenly distributed.
Systematic Sampling:
Easy to perform and ensures regular selection across the dataset.
However, it may miss important patterns if the data has underlying
cycles or periodic trends.
Overall Conclusion:
All three sampling techniques are valuable depending on the analysis
goal:
- When categories are balanced, random sampling is
generally sufficient.
- For datasets with uneven category sizes, stratified
sampling offers the most reliable representation.
Description:
In this section, we apply data wrangling techniques such as
grouping, summarization, and arrangement to analyze trends in the
dataset.
Specifically, we calculate the average duration of
titles by country and type (Movies or TV
Shows).
This helps identify which countries produce longer or shorter content on
average and how many titles each contributes.
release_year is
right-skewed; type and rating show clear
audience patterns.type.duration_num confirms a normal sampling distribution,
illustrating the CLT.duration are inconsistent (minutes
vs seasons), requiring data transformation before statistical
comparison.listed_in) is a multi-valued categorical field,
limiting certain analyses.description using text
mining.Below is an optional interactive Plotly visualization summarizing average movie duration by country and type.
This analysis of the Netflix Movies & TV Shows dataset demonstrates key data analysis techniques, including univariate, bivariate, and multivariate exploration, data wrangling, sampling, and interactive visualization. The findings highlight trends in content type, release year, country distribution, and movie duration, while also confirming statistical concepts like the Central Limit Theorem. Overall, the project provides a complete and clear demonstration of a data analysis workflow from import and cleaning to interpretation and visualization.