In this data dive, we explore the importance of properly documenting a dataset and understanding how the documentation helps avoid misinterpretation. We will examine specific columns that are unclear without documentation, explore elements that remain ambiguous even after reading the documentation, and visualize these issues to highlight potential risks or inconsistencies.
The dataset used in this analysis contains various columns that are initially unclear but become understandable after reviewing the documentation. Here, we list three such columns:
Issue:
Initially, it’s unclear if this column represents the number of individual users who voted or the number of ratings each show has received.
Insight from Documentation:
Upon reading the documentation, we understand that Vote_count refers to the total number of ratings submitted by users, which could include multiple votes from the same user.
Why did they encode it this way?
This encoding helps with aggregating popularity metrics but could lead to misleading results if user voting frequency isn’t controlled.
Potential Risk:
If we didn’t check the documentation, we might misinterpret this as unique votes per user, which could overstate audience engagement.
Issue:
It’s unclear whether Original_language refers to the language in which the show was produced or the language in which it was primarily broadcast.
Insight from Documentation:
The documentation clarifies that this field represents the production language, not the broadcast language.
Why did they encode it this way?
This encoding allows for consistent language attribution across different countries. If not clarified, it could lead to confusion when dealing with multilingual content.
Potential Risk:
Without clarification, we could wrongly attribute a TV show to the wrong language market, affecting analysis on regional preferences.
Issue:
Initially, it’s unclear whether the genre column is limited to one genre per show or if it can have multiple genres.
Insight from Documentation: Documentation reveals that a show can have multiple genres assigned, separated by commas.
Why did they encode it this way?
This method allows each show to be tagged with multiple genres, helping more granular classification. If we didn’t know this, we might incorrectly assume each show belongs to a single genre.
Potential Risk: Failing to recognize multiple genres per show could lead to incorrect conclusions about genre popularity.
Issue:
Even after reviewing the documentation, it is unclear whether the column Spoken_languages refers to all languages spoken throughout the show or only the primary spoken language in the show.
Why this is unclear?
The documentation fails to mention how to handle multilingual shows or shows with frequent code-switching between languages. There is no clear explanation of whether this field lists all languages spoken or just the primary language.
Risks:
Not knowing whether Spoken_languages represents all or just one language could lead to flawed analysis, especially for shows where language diversity is an important feature (e.g., bilingual shows). This could influence how we group shows by language or analyze viewership by language preferences.
We will create a bar plot to visualize the distribution of TV shows based on Spoken_languages and highlight the ambiguity by comparing shows that potentially have more than one primary spoken language.
# Load necessary libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Load the dataset (replace this with your dataset path)
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows
head(tv_data)
# Inspect the structure of the dataset
str(tv_data)
## spc_tbl_ [168,639 × 29] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:168639] 1399 71446 66732 1402 63174 ...
## $ name : chr [1:168639] "Game of Thrones" "Money Heist" "Stranger Things" "The Walking Dead" ...
## $ number_of_seasons : num [1:168639] 8 3 4 11 6 7 2 5 6 1 ...
## $ number_of_episodes : num [1:168639] 73 41 34 177 93 137 9 62 116 9 ...
## $ original_language : chr [1:168639] "en" "es" "en" "en" ...
## $ vote_count : num [1:168639] 21857 17836 16161 15432 13870 ...
## $ vote_average : num [1:168639] 8.44 8.26 8.62 8.12 8.49 ...
## $ overview : chr [1:168639] "Seven noble families fight for control of the mythical land of Westeros. Friction between the houses leads to f"| __truncated__ "To carry out the biggest heist in history, a mysterious man called The Professor recruits a band of eight robbe"| __truncated__ "When a young boy vanishes, a small town uncovers a mystery involving secret experiments, terrifying supernatura"| __truncated__ "Sheriff's deputy Rick Grimes awakens from a coma to find a post-apocalyptic world dominated by flesh-eating zom"| __truncated__ ...
## $ adult : logi [1:168639] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ backdrop_path : chr [1:168639] "/2OMB0ynKlyIenMJWI2Dy9IWT4c.jpg" "/gFZriCkpJYsApPZEF3jhxL4yLzG.jpg" "/2MaumbgBlW1NoPo3ZJO38A6v7OS.jpg" "/x4salpjB11umlUOltfNvSSrjSXm.jpg" ...
## $ first_air_date : Date[1:168639], format: "2011-04-17" "2017-05-02" ...
## $ last_air_date : Date[1:168639], format: "2019-05-19" "2021-12-03" ...
## $ homepage : chr [1:168639] "http://www.hbo.com/game-of-thrones" "https://www.netflix.com/title/80192098" "https://www.netflix.com/title/80057281" "http://www.amc.com/shows/the-walking-dead--1002293" ...
## $ in_production : logi [1:168639] FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ original_name : chr [1:168639] "Game of Thrones" "La Casa de Papel" "Stranger Things" "The Walking Dead" ...
## $ popularity : num [1:168639] 1083.9 96.4 185.7 489.7 416.7 ...
## $ poster_path : chr [1:168639] "/1XS1oqL89opfnbLl8WnZY1O1uJx.jpg" "/reEMJA1uzscCbkpeRJeTT2bjqUp.jpg" "/49WJfeN0moxb9IPfGn8AIqMGskD.jpg" "/n7PVu0hSz2sAsVekpOIoCnkWlbn.jpg" ...
## $ type : chr [1:168639] "Scripted" "Scripted" "Scripted" "Scripted" ...
## $ status : chr [1:168639] "Ended" "Ended" "Returning Series" "Ended" ...
## $ tagline : chr [1:168639] "Winter Is Coming" "The perfect robbery." "Every ending has a beginning." "Fight the dead. Fear the living." ...
## $ genres : chr [1:168639] "Sci-Fi & Fantasy, Drama, Action & Adventure" "Crime, Drama" "Drama, Sci-Fi & Fantasy, Mystery" "Action & Adventure, Drama, Sci-Fi & Fantasy" ...
## $ created_by : chr [1:168639] "David Benioff, D.B. Weiss" "Álex Pina" "Matt Duffer, Ross Duffer" "Frank Darabont" ...
## $ languages : chr [1:168639] "en" "es" "en" "en" ...
## $ networks : chr [1:168639] "HBO" "Netflix, Antena 3" "Netflix" "AMC" ...
## $ origin_country : chr [1:168639] "US" "ES" "US" "US" ...
## $ spoken_languages : chr [1:168639] "English" "Español" "English" "English" ...
## $ production_companies: chr [1:168639] "Revolution Sun Studios, Television 360, Generator Entertainment, Bighead Littlehead" "Vancouver Media" "21 Laps Entertainment, Monkey Massacre Productions" "AMC Studios, Circle of Confusion, Valhalla Motion Pictures, Darkwoods Productions, Skybound Entertainment, Idiotbox" ...
## $ production_countries: chr [1:168639] "United Kingdom, United States of America" "Spain" "United States of America" "United States of America" ...
## $ episode_run_time : num [1:168639] 0 70 0 42 45 45 0 0 43 0 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. number_of_seasons = col_double(),
## .. number_of_episodes = col_double(),
## .. original_language = col_character(),
## .. vote_count = col_double(),
## .. vote_average = col_double(),
## .. overview = col_character(),
## .. adult = col_logical(),
## .. backdrop_path = col_character(),
## .. first_air_date = col_date(format = ""),
## .. last_air_date = col_date(format = ""),
## .. homepage = col_character(),
## .. in_production = col_logical(),
## .. original_name = col_character(),
## .. popularity = col_double(),
## .. poster_path = col_character(),
## .. type = col_character(),
## .. status = col_character(),
## .. tagline = col_character(),
## .. genres = col_character(),
## .. created_by = col_character(),
## .. languages = col_character(),
## .. networks = col_character(),
## .. origin_country = col_character(),
## .. spoken_languages = col_character(),
## .. production_companies = col_character(),
## .. production_countries = col_character(),
## .. episode_run_time = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Subset of data for visualization
spoken_lang_group <- tv_data |>
group_by(spoken_languages) |>
summarize(show_count = n()) |>
arrange(desc(show_count))
# Highlight languages that may have ambiguity
spoken_lang_group <- spoken_lang_group |>
mutate(is_unclear = ifelse(grepl(",", spoken_languages), "Multiple Languages (Unclear)", "Single Language (Clear)"))
# Plotting
ggplot(spoken_lang_group, aes(x = reorder(spoken_languages, -show_count), y = show_count, fill = is_unclear)) +
geom_bar(stat = "identity") +
labs(title = "TV Shows Distribution by Spoken Languages",
x = "Spoken Languages",
y = "Show Count",
fill = "Clarity of Data") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Select the top 40 spoken languages based on show count for better representation
top_40_spoken_lang_group <- spoken_lang_group |>
slice_max(order_by = show_count, n = 40)
top_40_spoken_lang_group <- top_40_spoken_lang_group |>
mutate(is_unclear = ifelse(grepl(",", spoken_languages), "Multiple Languages (Unclear)", "Single Language (Clear)"))
ggplot(top_40_spoken_lang_group, aes(x = reorder(spoken_languages, -show_count), y = show_count, fill = is_unclear)) +
geom_bar(stat = "identity") +
labs(title = "Top 40 TV Shows Distribution by Spoken Languages",
x = "Spoken Languages",
y = "Show Count",
fill = "Clarity of Data") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This bar chart visualizes the distribution of TV shows based on their spoken languages, with bars colored to indicate whether the data on spoken languages is clear or potentially ambiguous. Specifically:
The ambiguous data points highlight the potential confusion regarding whether these shows are truly multilingual or if only the primary language should be considered.
Risk:
The uncertainty in how languages are documented could lead to incorrect conclusions about language diversity in TV shows. For example, shows with multiple spoken languages may be misinterpreted as multilingual even if one language dominates most of the content.
Potential Solution:
To reduce this risk, we could either manually inspect the shows with multiple spoken languages or clarify how to interpret this column in future documentation updates.