Statistical Analysis and Visualization of Netflix

1 Dataset Description

The Netflix dataset contains information about movies and TV shows available on Netflix. It includes details such as release year, ratings, duration, genres, directors, cast, and countries of production. This dataset is suitable for univariate, bivariate, and multivariate analysis, as well as sampling and demonstrating statistical concepts.

1.1 Key Columns

show_id : Unique identifier for each title.
type : Type of content (Movie or TV Show).
title : Name of the movie or TV show.
director : Director(s) of the title.
cast : Main cast members.
country : Country of production.
date_added : Date when the title was added to Netflix.
release_year : Original release year of the title.
rating : Age-based rating of the content (e.g., PG, TV-MA).
duration : Duration of the title (minutes for movies, seasons for TV shows).
listed_in : Genre(s) or categories of the title.
description : Brief summary of the content.

This dataset allows you to explore patterns such as the distribution of movies vs TV shows, trends in release years, ratings by type, genre popularity, and more.

2 Objectives

The main objectives of this project are to answer the following questions:

Which types of content (Movies vs TV Shows) dominate Netflix’s catalog, and how has this changed over the years?
Are there patterns in release years, ratings, or durations across different types of content?
What genres are the most and least common on Netflix?
How do ratings vary by type, genre, or country of production?
Are there relationships between release year, rating, and type of content?
How complete is the dataset, and how should missing values be handled?
Can sampling or statistical methods, such as the Central Limit Theorem, be demonstrated using Netflix data?

3 Univariate Analysis (Categorical Variable)

Objective: Examine the distribution of a single categorical variable to understand its frequency and patterns.

Variable: type (Movie or TV Show)

We analyze the type variable to understand the composition of Netflix content. By counting the number of Movies vs TV Shows, we can determine which type of content dominates the platform. A bar chart visualizes the frequency distribution.

4 Univariate Analysis: Numerical Variables

Objective: Examine numerical variables to understand central tendency, spread, distribution, and demonstrate the Central Limit Theorem (CLT) using random sampling.

Variables:
- release_year → overview of release trends
- duration_num → detailed distribution and CLT demonstration

Summary Statistics of Duration (Minutes)
count	mean_duration	median_duration	sd_duration	min_duration	max_duration
8804	93.45116	92	46.73873	3	765

5 Bivariate Analysis (Two Variables)

Objective: Explore relationships between two variables to understand patterns or differences.

Variables: type (categorical) vs release_year (numerical)

We analyze the relationship between type and release_year using a box plot. This shows how release years are distributed for Movies vs TV Shows, allowing us to observe differences in trends between the two content types over time.

6 Multivariate Analysis (More than Two Variables)

Objective: Examine interactions between multiple variables simultaneously to uncover complex patterns.

Variables: release_year (numerical), rating (categorical), type (categorical)

We explore how release_year and rating interact across content type. A scatter plot colored by type helps visualize the distribution of ratings over the years for Movies and TV Shows, highlighting patterns in ratings across different types of content.

7 Distribution & Central Limit Theorem (CLT)

In this section, we examine the distribution of the duration_num variable (duration of Netflix shows in minutes) and demonstrate the Central Limit Theorem (CLT) using random sampling.

Distribution of Duration
First, we visualize the distribution of the original duration data to understand its shape and spread. This helps identify if the data is skewed, uniform, or approximately normal.
Sampling & CLT
We then draw 1000 random samples of size 30 each from the duration_num data, calculate the mean of each sample, and plot the distribution of these sample means. According to the Central Limit Theorem, regardless of the shape of the original data, the distribution of sample means should approximate a normal distribution when the sample size is sufficiently large.

This analysis demonstrates how the CLT allows us to make inferences about the population mean using sample means.

8 Interactive Plotly histogram

9 Sampling Methods

In this section, we demonstrate three different sampling techniques applied to the Netflix dataset — Random Sampling, Stratified Sampling, and Systematic Sampling.
Sampling allows us to analyze subsets of data efficiently when working with large datasets, while still drawing conclusions that reflect the population trends.

9.1 Sampling

Description:
Random sampling selects a fixed number of records completely at random from the dataset. Each observation has an equal chance of being chosen, making it an unbiased technique. However, it may not always represent all categories if the sample size is small.

Random Sample (Preview: 3 Records, Key Columns)
type	title	release_year
Movie	Uncertain Glory	2017
Movie	Guest House	2020
TV Show	Flavorful Origins	2020

9.2 Stratified Sampling by Type

Description:
Stratified sampling divides the dataset into groups (or strata) based on a categorical variable — here, type (Movie or TV Show).
Then, a fixed fraction is sampled from each stratum to maintain proportional representation.
This ensures that all groups are represented in the sample, reducing bias in datasets with uneven category sizes.

Stratified Sample by Type (10%) - Preview of Key Columns
type	title	release_year	rating
Movie	Dedemin Fisi	2016	TV-14
Movie	The Age of Shadows	2016	TV-MA
Movie	Òlòtūré	2020	TV-MA

9.3 Systematic Sampling

Description:
Systematic sampling selects data points at regular intervals (for example, every 10th row).
It is simple to perform and ensures even coverage of the dataset.
However, if the data has a hidden pattern or periodicity, this method may introduce bias and fail to represent the dataset accurately.

9.4 Summary of Sampling Techniques

Random Sampling:
Simple and unbiased, but might miss less frequent categories if they occur rarely in the dataset.
Stratified Sampling:
Preserves proportions across categories (e.g., Movies vs TV Shows), providing better representation for each group — especially useful when categories are unevenly distributed.
Systematic Sampling:
Easy to perform and ensures regular selection across the dataset. However, it may miss important patterns if the data has underlying cycles or periodic trends.

Overall Conclusion:
All three sampling techniques are valuable depending on the analysis goal:
- When categories are balanced, random sampling is generally sufficient.
- For datasets with uneven category sizes, stratified sampling offers the most reliable representation.

10 Data Wrangling and Summary Statistics

Description:
In this section, we apply data wrangling techniques such as grouping, summarization, and arrangement to analyze trends in the dataset.
Specifically, we calculate the average duration of titles by country and type (Movies or TV Shows).
This helps identify which countries produce longer or shorter content on average and how many titles each contributes.

11 Findings

11.1 Summary of Insights

Content Composition: Movies dominate the catalog, while TV Shows form a smaller but notable portion.
Release Trends: Most titles were added after 2010, showing rapid content growth.
Country Distribution: USA contributes the most titles, followed by India and the UK.
Duration and Type: Movies range from <60 to >180 minutes; TV Shows mostly indicate seasons.
Univariate Insights: release_year is right-skewed; type and rating show clear audience patterns.
Bivariate/Multivariate: Movies generally have shorter durations; scatter plots show clustering by type.
CLT Verification: Sampling duration_num confirms a normal sampling distribution, illustrating the CLT.
Sampling Methods: Random: unbiased; Stratified: preserves proportions; Systematic: simple, may introduce bias.
Data Wrangling: Average movie duration per country/type highlights regional differences.
Overall: Netflix shows strong content diversity and growth; sampling and CLT concepts are well-demonstrated.

11.2 Limitations and Future Work

Some variables like duration are inconsistent (minutes vs seasons), requiring data transformation before statistical comparison.
Genre (listed_in) is a multi-valued categorical field, limiting certain analyses.
Future work could include:
- Sentiment analysis on description using text mining.
- Trend forecasting on Netflix releases over time.
- Correlation between genre, rating, and content popularity.

11.3 Visualization Summary

Below is an optional interactive Plotly visualization summarizing average movie duration by country and type.

12 Conclusion

This analysis of the Netflix Movies & TV Shows dataset demonstrates key data analysis techniques, including univariate, bivariate, and multivariate exploration, data wrangling, sampling, and interactive visualization. The findings highlight trends in content type, release year, country distribution, and movie duration, while also confirming statistical concepts like the Central Limit Theorem. Overall, the project provides a complete and clear demonstration of a data analysis workflow from import and cleaning to interpretation and visualization.