Introduction

In this data drive, we will investigate individual rows and groups of data using the concept of probability and anomaly detection. We will create several groupings of the dataset, calculate the probability of rows/groups being selected, and identify the smallest (rarest) groups that might be considered anomalies. For each group, we will draw conclusions and provide a testable hypothesis about why certain groups are rarer than others.

Grouping and Investigating Probabilities

We will group the dataset by three different categorical columns, compute probabilities for each group, and investigate the rarest groups. The goal is to identify which groups might represent anomalies based on their low probabilities.

# Load necessary libraries
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(tv_data)

Grouping by “original_language”

We’ll first group the data by original_language, calculate the number of shows per language, and assign probabilities.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
# Group by original_language and calculate the number of shows per language
language_group <- tv_data |>
  group_by(original_language) |>
  summarize(show_count = n(), 
            avg_rating = mean(vote_average, na.rm = TRUE)) |>
  mutate(probability = show_count / sum(show_count)) |>
  arrange(desc(probability))

# Tagging the smallest group as "Rare Group"
language_group <- language_group |>
  mutate(tag = ifelse(probability == min(probability), "Rare Group", "Common Group"))

# Format the probability column as percentages and round to 4 decimal places
language_group_with_probability_percentage <- language_group |>
  mutate(probability = percent(probability, accuracy = 0.0001))

# View the result
language_group_with_probability_percentage

Interpretation:

From the table above, we observe that the majority of TV shows in this dataset are in English, followed by other languages like Chinese and Japanese. These languages represent a larger portion of the dataset, while languages such as Uzbek and Zhuang have significantly fewer shows. This trend is likely because the dataset focuses on global TV shows, and English-based shows dominate the international market.

The smallest group, identified as the “Rare Group,” contains TV shows in a language that is less frequently represented. The probability of randomly selecting a show from this group is very low, as shown by the calculated probabilities.

Visualization:

# Plotting the probability distribution (numeric form is used for sorting)
ggplot(language_group, aes(x = reorder(original_language, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of TV Shows by Original Language", 
       x = "Original Language", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +  # Show y-axis as percentages
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Visualization: Probability Distribution of Top 30 TV Shows by Original Language
top_30_languages <- language_group |> slice_max(order_by = show_count, n = 30)

ggplot(top_30_languages, aes(x = reorder(original_language, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of Top 30 TV Shows by Original Language", 
       x = "Original Language", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Insights:

  • The Frequentist perspective is reflected in how we calculate the probability of selecting a TV show in each language based on relative frequency.

  • The Law of Large Numbers suggests that as the number of samples increases, these probabilities will converge to the true underlying probabilities for each language.

Visual Interpretation:

The bar chart above visualizes the probability distribution of shows based on their original language. The “Rare Group” is represented by the smallest bar, indicating that shows in this language are much less common compared to the “Common Group”.

The significance of this trend suggests that while English and a few other languages dominate the TV show market, there is still representation of more niche or region-specific languages, which may cater to smaller but dedicated audiences.

Further investigation could explore whether this reflects cultural preferences or the accessibility of non-English shows in the global market.

Overall Learning:

The probability distribution of TV shows by their original language reveals that certain languages are highly represented, while others are rare. The language tagged as the “Rare Group” represents the language with the lowest probability. This could indicate that TV shows in this language are niche or are produced less frequently due to market demand.

  • Testable Hypothesis: The underrepresentation of TV shows in this language might be due to a smaller audience, fewer production resources, or less global appeal compared to more common languages like English.

Grouping by “genres”

We now group the data by genre to investigate the distribution of TV shows by genre type and their corresponding probabilities.

# Group by genres and calculate the number of shows in each genre
genre_group <- tv_data |>
  group_by(genres) |>
  summarize(show_count = n(), 
            avg_episodes = mean(number_of_episodes, na.rm = TRUE)) |>
  mutate(probability = show_count / sum(show_count)) |>
  arrange(desc(probability))

# Tagging the smallest group(s) as "Rare Group"
genre_group <- genre_group |>
  mutate(tag = ifelse(near(probability, min(probability)), "Rare Group", "Common Group"))

# Format the probability column as percentages and round to 4 decimal places
genre_group_with_probability_percentage <- genre_group |>
  mutate(probability = percent(probability, accuracy = 0.0001))

# View the result
genre_group_with_probability_percentage

Interpretation:

From the data, we find that genres like Documentary, Drama and Comedy are the most common, while genres like ‘Sci-Fi & Fantasy, Action & Adventure, Animation, Drama’ are rarer. The dominance of Documentary in the dataset suggests its universal appeal, while more niche genres may cater to smaller, specific audiences.

A hypothesis could be that certain genres (like Drama) are universally appealing, while others (like Documentaries) have a more selective audience, leading to their lower representation in the dataset.

ggplot(genre_group, aes(x = reorder(genres, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of TV Shows by Genre", 
       x = "Genres", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +  # Show y-axis as percentages
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Visualization: Probability Distribution of Top 30 TV Shows by Genre
top_30_genres <- genre_group |> slice_max(order_by = show_count, n = 30)

ggplot(top_30_genres, aes(x = reorder(genres, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of Top 30 TV Shows by Genre", 
       x = "Genres", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Visual Interpretation:

This trend suggests that while genres like Drama, Comedy, and Documentary dominate the TV show market, there are still a number of niche genres that cater to smaller, more dedicated audiences. These rarer genres may represent more specialized or experimental content, possibly targeting specific segments of viewers who prefer less mainstream content.

Further investigation could explore whether this distribution reflects audience preferences or production trends in the TV industry.

Insights:

  • Combinatorics: Each TV show can belong to multiple genres, and this is reflected in how genres are distributed. A show may belong to a “combination” of multiple sets (genres).

  • The rarity of some genres may reflect the audience’s preference for specific combinations of genres (e.g., Drama + Sci-Fi is more popular than others).

Overall Learning:

By grouping TV shows by genres, we see that some genres dominate the dataset, while others are underrepresented. The “Rare Group” in this case might correspond to highly niche genres that do not attract a large audience or are expensive to produce.

  • Testable Hypothesis: Niche genres, such as experimental or foreign-language-specific genres, might be less produced due to a limited target audience or higher production costs.

Grouping by “networks”

We group the data by ‘networks’ and calculate the probability of selecting a show from a specific network.

# Group by networks and calculate the number of shows per network
network_group <- tv_data |>
  group_by(networks) |>
  summarize(show_count = n(), 
            avg_rating = mean(vote_average, na.rm = TRUE)) |>
  mutate(probability = show_count / sum(show_count)) |>
  arrange(desc(probability))

# Tagging the smallest group as "Rare Group"
network_group <- network_group |>
  mutate(tag = ifelse(probability == min(probability), "Rare Group", "Common Group"))

# Format the probability column as percentages and round to 4 decimal places
network_group_with_probability_percentage <- network_group |>
  mutate(probability = percent(probability, accuracy = 0.0001))

# View the result
network_group_with_probability_percentage

Insights:

  • Frequentist Perspective: This reflects how likely a show is to come from a particular network based on relative frequency in the dataset.

  • The “Rare Group” represents networks with fewer shows, possibly indicating smaller networks or networks specializing in niche content.

Visualization:

ggplot(network_group, aes(x = reorder(networks, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of TV Shows by Network", 
       x = "Networks", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +  # Show y-axis as percentages
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Visualization: Probability Distribution of Top 40 TV Shows by Network
top_40_networks <- network_group |> slice_max(order_by = show_count, n = 40)

ggplot(top_40_networks, aes(x = reorder(networks, -probability), y = probability, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Probability Distribution of Top 40 TV Shows by Network", 
       x = "Networks", 
       y = "Probability") +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Visual Interpretation:

This trend suggests that while certain networks are responsible for producing a significant portion of the TV shows in the dataset, there are still networks that produce fewer shows, which may be more specialized or cater to niche audiences. Networks with the largest bars (highest probabilities) are likely the most well-established, having broader reach and the resources to produce a large volume of content.

Further investigation could explore whether the dominance of certain networks reflects industry consolidation or audience preferences. It would be interesting to examine whether networks producing fewer shows target specific genres or regional markets.

Overall Learning:

When grouped by networks, we find that certain networks have produced a much higher number of TV shows, while others have only a small fraction. The smallest group of networks represents those that might be either new to the industry, have less funding, or cater to a specific niche.

  • Testable Hypothesis: The rarity of certain networks could be due to them being less established, or specializing in producing content for smaller, more specific audiences (e.g., documentaries or foreign language content).

Bayesian Perspective and Hypotheses:

  • Bayesian Hypothesis: The probability of observing a TV show in a particular language, genre, or network can be updated as we collect more data, reflecting the Bayesian approach to updating probabilities.

  • Testable Hypothesis: The rarity of specific groups may be influenced by external factors such as audience preferences or production resources. For example, networks with fewer shows may cater to niche audiences or have limited budgets.

We have explored the data using a Frequentist approach, calculating probabilities for different sets (languages, genres, networks). By grouping the data and analyzing rare groups, we can draw conclusions about underrepresented categories and hypothesize reasons for their rarity. The visualizations provide a clear view of how TV shows are distributed across these categories, and further analysis could involve applying Bayesian methods to update these probabilities as new data becomes available.

Combinations of Categorical Variables

Investigating original_language and genres

Let’s first create a data frame with all possible combinations of original_language and genres and then compare it with the existing data to identify missing combinations.

# Extract unique values for original_language and genres
languages <- unique(tv_data$original_language)
genres <- unique(tv_data$genres)

# Create a data frame of all possible combinations of languages and genres
all_combinations <- expand.grid(original_language = languages, genres = genres)

# View the generated combinations
head(all_combinations)

Find Missing Combinations

# Extract existing combinations of original_language and genres from the dataset
existing_combinations <- tv_data |>
  select(original_language, genres) |>
  distinct()

# Identify missing combinations
missing_combinations <- anti_join(all_combinations, existing_combinations, by = c("original_language", "genres"))

# View the missing combinations (if any)
missing_combinations

Interpretation and Significance:

The missing combinations indicate certain genres are not produced in specific languages. This might be due to cultural preferences, production limitations, or audience demand.

  • Testable Hypothesis: Some genres may not appeal to certain linguistic audiences, leading to missing combinations in the dataset.

Identify Most/Least Common Combinations

This will count how often each combination of original_language and genres appears in the data and determine the most and least common combinations.

# Count the occurrences of each combination of original_language and genres
combination_counts <- tv_data |>
  group_by(original_language, genres) |>
  summarize(count = n()) |>
  arrange(desc(count))
## `summarise()` has grouped output by 'original_language'. You can override using
## the `.groups` argument.
# View the most and least common combinations
combination_counts

Visualization:

# Heatmap of original_language and genres combinations
ggplot(combination_counts, aes(x = original_language, y = genres, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Count") +
  labs(title = "Heatmap of Original Language and Genre Combinations", 
       x = "Original Language", 
       y = "Genres") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Limiting the heatmap to the top 2 most common language-genre combinations
top_combinations <- combination_counts |> slice_max(order_by = count, n = 2)

# Heatmap of top 2 original_language and genres combinations
ggplot(top_combinations, aes(x = original_language, y = genres, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Count") +
  labs(title = "Heatmap of Top 2 Original Language and Genre Combinations", 
       x = "Original Language", 
       y = "Genres") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 2),
        axis.text.y = element_text(size = 8),                         
        plot.title = element_text(hjust = 0.5))                       

Visual Interpretation:

The heatmap above visualizes the top 2 most common combinations of TV show languages and genres. Each cell represents a specific combination of language and genre, with the color intensity indicating the count of shows for that combination. Darker colors represent higher counts, while lighter colors represent lower counts within the top 2.

The heatmap highlights the dominant combinations of languages and genres in the dataset, revealing where content production is concentrated. This visualization shows that a few language-genre combinations dominate the dataset, suggesting that certain genres are more commonly produced in specific languages, likely due to cultural preferences, audience demand, or production focus by specific language industries.

Further investigation could explore whether these top combinations reflect global market trends or if they cater primarily to regional markets with high viewership in specific genres.