This data dive explores a dataset of TV shows from TMDB (The Movie Database), an online community that collects information on TV shows, movies, and more. The dataset contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.
The purpose of this analysis is to understand the characteristics of the TV shows in the dataset and investigate patterns and trends in the data. Specifically, 1. Perform summary statistics to get an overview of the numeric and categorical data in the dataset. 2. Investigate relationships between certain key variables to answer some analytical questions about TV shows, such as: - Do certain languages or genres tend to have higher ratings? - Is there any correlation between the number of episodes and the average rating of a TV show? - What are the trends in the distribution of genres across different networks? 3. Visualize the data to better understand its distribution and identify trends or patterns in the relationships between key variables, such as ratings and episode counts.
The dataset includes the following columns:
- id: Unique identifier for the TV show.
- Name: The name of the TV show.
- Number_of_seasons: The number of seasons the TV show has aired.
- Number_of_episodes: The total number of episodes across all seasons.
- Original_language: The primary language in which the show is produced.
- Vote_count: The number of votes the show has received.
- Vote_average: The average rating of the show, calculated based on user votes.
- Overview: A brief description of the TV show.
- Adult: A boolean flag indicating whether the show is for adult audiences.
- Tagline: A short slogan or phrase associated with the show.
- Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).
- Networks: The networks where the show is broadcast.
- Spoken_languages: The languages spoken in the show.
- Production_companies: The companies involved in producing the show.
- Production_countries: The countries where the show was produced.
- Episode_run_time: The typical run time of episodes for the show, if available.
The analysis will focus on understanding numeric and categorical features of the TV shows, followed by exploring relationships between some variables. The following questions will be investigated:
1. Does the average rating of a show differ based on its primary language?
2. Is there a correlation between the number of episodes a show has and its average rating?
3. How are different genres distributed across TV networks, and what are the most popular genres for each network?
Let’s load the dataset into R and inspect its structure to familiarize with the available data and the columns.
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# Install Readr
install.packages("readr")
##
## The downloaded binary packages are in
## /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages
# Load necessary libraries
library(readr)
# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(tv_data)
## # A tibble: 6 × 29
## id name number_of_seasons number_of_episodes original_language vote_count
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1399 Game … 8 73 en 21857
## 2 71446 Money… 3 41 es 17836
## 3 66732 Stran… 4 34 en 16161
## 4 1402 The W… 11 177 en 15432
## 5 63174 Lucif… 6 93 en 13870
## 6 69050 River… 7 137 en 13180
## # ℹ 23 more variables: vote_average <dbl>, overview <chr>, adult <lgl>,
## # backdrop_path <chr>, first_air_date <date>, last_air_date <date>,
## # homepage <chr>, in_production <lgl>, original_name <chr>, popularity <dbl>,
## # poster_path <chr>, type <chr>, status <chr>, tagline <chr>, genres <chr>,
## # created_by <chr>, languages <chr>, networks <chr>, origin_country <chr>,
## # spoken_languages <chr>, production_companies <chr>,
## # production_countries <chr>, episode_run_time <dbl>
Summarizing Number_of_episodes and Vote_average will help in understanding:
# Summary for number_of_episodes
summary(tv_data$number_of_episodes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 6.00 24.46 20.00 20839.00
The summary of number_of_episodes shows a wide range in the number of episodes among the TV shows in the dataset. The minimum value of 0 indicates that there are shows with no episodes, which could either be incomplete data or shows that were produced but never aired.
The difference between the median and the mean, along with the high maximum value, suggests a right-skewed distribution, meaning a few shows with many episodes are pulling the average up.
# Summary for vote_average
summary(tv_data$vote_average)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.334 6.000 10.000
The summary statistics for vote_average reveal some interesting patterns:
This distribution suggests that most of the shows in the dataset are either unrated or have received low ratings, with only a small proportion achieving high ratings. This could indicate that the dataset includes many obscure or less popular shows.
By summarizing original language and genres, we want to:
# Categorical summary for original_language
table(tv_data$original_language)
##
## aa ab af am ar as av az be bg bn bs ca
## 1 13 88 48 2638 5 2 11 3 93 229 29 156
## cn cs cy da de dv el en eo es et eu fa
## 1824 1382 30 2019 7712 2 903 76304 1 5602 52 12 244
## fi fr fy ga gd gl gu he hi ho hr ht hu
## 453 7290 3 25 4 18 20 553 1565 1 121 1 404
## hy hz id is it ja jv ka kk km kn ko ku
## 7 2 459 55 1691 14048 3 54 31 5 20 7820 10
## kv la lb ln lo lt lv mi mk ml mn mo mr
## 1 5 2 1 1 65 49 4 5 38 1 7 37
## ms mt my nb ne nl no or pa pl ps pt rm
## 495 20 112 80 6 2923 1210 14 26 923 1 3551 1
## ro ru se sh si sk sl so sq sr st sv sw
## 113 2963 1 65 6 404 50 2 20 152 1 1455 1
## ta te th ti tl tr ug uk ur uz vi xx za
## 114 84 1835 1 861 1748 1 118 333 1 205 28 1
## zh zu
## 14422 10
The distribution of TV shows by original_language reveals that the dataset is heavily dominated by English language shows, with 76,304 entries. This is not surprising, given that English content often has a broader global reach and is widely produced.
Some other languages with significant representation include:
Spanish (es): 5,602 shows.
Japanese (ja): 14,048 shows
Korean (ko): 7,820 shows
French (fr): 7,290 shows
maller languages such as Arabic (ar) and Russian (ru) have moderate representation. However, there are numerous languages with only a handful of shows, such as Uzbek (uz), Zhuang (za), and Avaric (av), each having just one or two entries.
The dominance of certain languages like English and Japanese could reflect the international popularity of shows from these regions, while the underrepresentation of other languages might suggest limited global production or distribution of content in those languages.
# Categorical summary for genres sorted in descending order
genres_summary <- table(tv_data$genres)
sorted_genres_summary <- sort(genres_summary, decreasing = TRUE)
# Display the top 100 categories
top_100_genres <- head(sorted_genres_summary, 100)
# Show the result
top_100_genres
##
## Documentary
## 17596
## Drama
## 16282
## Comedy
## 10304
## Reality
## 8009
## Animation
## 3326
## Comedy, Drama
## 1932
## Drama, Comedy
## 1878
## Talk
## 1872
## Animation, Comedy
## 1254
## Kids
## 1112
## Crime
## 1037
## News
## 1022
## Crime, Drama
## 985
## Family
## 980
## Drama, Crime
## 921
## Animation, Kids
## 908
## Drama, Family
## 886
## Drama, Mystery
## 778
## Documentary, Reality
## 758
## Action & Adventure
## 727
## Documentary, Crime
## 681
## Mystery
## 635
## Comedy, Family
## 558
## Sci-Fi & Fantasy
## 516
## Soap, Drama
## 512
## Drama, Soap
## 490
## Action & Adventure, Drama
## 479
## Drama, War & Politics
## 464
## Drama, Sci-Fi & Fantasy
## 458
## Reality, Documentary
## 444
## Soap
## 393
## Animation, Action & Adventure, Sci-Fi & Fantasy
## 385
## Family, Drama
## 377
## Drama, Action & Adventure
## 374
## Animation, Sci-Fi & Fantasy
## 367
## Mystery, Drama
## 363
## Kids, Animation
## 358
## Animation, Action & Adventure
## 356
## Documentary, War & Politics
## 351
## Crime, Drama, Mystery
## 312
## Family, Comedy
## 283
## Sci-Fi & Fantasy, Drama
## 281
## Comedy, Animation
## 269
## Comedy, Talk
## 264
## Animation, Drama
## 263
## Crime, Mystery
## 253
## Reality, Comedy
## 233
## Action & Adventure, Sci-Fi & Fantasy
## 225
## Comedy, Reality
## 223
## Crime, Documentary
## 222
## Comedy, Sci-Fi & Fantasy
## 212
## Mystery, Crime
## 203
## Animation, Comedy, Sci-Fi & Fantasy
## 199
## Documentary, Drama
## 194
## Drama, Crime, Mystery
## 170
## Animation, Comedy, Drama
## 165
## Documentary, Comedy
## 164
## War & Politics
## 164
## Reality, Talk
## 159
## War & Politics, Drama
## 158
## Talk, Comedy
## 151
## Animation, Sci-Fi & Fantasy, Action & Adventure
## 146
## Action & Adventure, Animation
## 137
## Comedy, Documentary
## 136
## Action & Adventure, Animation, Sci-Fi & Fantasy
## 133
## Western
## 133
## Reality, Family
## 130
## Family, Kids
## 129
## Kids, Family
## 126
## Documentary, News
## 123
## Animation, Family, Kids
## 118
## Comedy, Drama, Family
## 117
## Documentary, Talk
## 115
## Animation, Family
## 112
## Comedy, Crime
## 111
## Family, Reality
## 111
## Sci-Fi & Fantasy, Action & Adventure
## 111
## Animation, Comedy, Kids
## 107
## Drama, Mystery, Crime
## 106
## Talk, Reality
## 104
## Sci-Fi & Fantasy, Comedy
## 100
## Comedy, Mystery
## 96
## Crime, Mystery, Drama
## 95
## Action & Adventure, Crime
## 93
## Action & Adventure, Comedy
## 91
## Animation, Action & Adventure, Comedy
## 90
## Animation, Kids, Family
## 90
## Comedy, Action & Adventure
## 89
## Animation, Comedy, Action & Adventure
## 85
## Drama, Comedy, Family
## 83
## Soap, Comedy
## 83
## Action & Adventure, Drama, Sci-Fi & Fantasy
## 81
## Mystery, Sci-Fi & Fantasy
## 79
## Animation, Comedy, Family
## 78
## Drama, Mystery, Sci-Fi & Fantasy
## 78
## Documentary, Family
## 77
## Drama, Documentary
## 76
## Mystery, Drama, Crime
## 74
## Action & Adventure, Crime, Drama
## 71
## Sci-Fi & Fantasy, Animation
## 71
The genre distribution is dominated by a few key genres:
Documentary: 17,596 shows
Drama: 16,282 shows
Comedy: 10,304 shows
These three genres alone account for a large portion of the dataset, suggesting that factual content, character-driven stories, and humor are the most common types of TV shows.
Other genres like Reality (8,009 shows), Animation (3,326 shows), and combinations such as Comedy, Drama (1,932 shows) are also well-represented.
Additionally, genre combinations like Animation, Comedy and Crime, Drama suggest that multi-genre shows are fairly common, particularly in the case of dramatic and comedic content. Some niche combinations like Animation, Sci-Fi & Fantasy are less frequent but still present.
The over-representation of genres like Documentary and Drama could reflect the global trend toward factual content and serialized storytelling, while more niche genres like Mystery and War & Politics appear less frequently, indicating that they cater to smaller, more specialized audiences.
Question: Does the average rating of a show differ based on its language?
# Aggregating vote_average by original_language
aggregate(tv_data$vote_average, by=list(Language=tv_data$original_language), FUN=mean)
## Language x
## 1 aa 0.00000000
## 2 ab 3.43592308
## 3 af 6.47646591
## 4 am 2.35833333
## 5 ar 2.12332563
## 6 as 7.20000000
## 7 av 0.00000000
## 8 az 2.81818182
## 9 be 2.66666667
## 10 bg 4.39077419
## 11 bn 2.97334934
## 12 bs 1.93103448
## 13 ca 2.19060256
## 14 cn 2.15544956
## 15 cs 2.13012084
## 16 cy 2.77633333
## 17 da 2.54437692
## 18 de 1.73996590
## 19 dv 0.00000000
## 20 el 3.35153599
## 21 en 2.12675084
## 22 eo 0.00000000
## 23 es 3.08211871
## 24 et 1.22500000
## 25 eu 3.20833333
## 26 fa 3.19272131
## 27 fi 3.17171302
## 28 fr 2.26773155
## 29 fy 0.00000000
## 30 ga 1.44000000
## 31 gd 0.00000000
## 32 gl 3.03072222
## 33 gu 0.80000000
## 34 he 2.41592948
## 35 hi 3.14637508
## 36 ho 0.00000000
## 37 hr 1.77797521
## 38 ht 0.00000000
## 39 hu 2.35064356
## 40 hy 0.00000000
## 41 hz 5.50000000
## 42 id 1.93763181
## 43 is 3.42360000
## 44 it 3.06983087
## 45 ja 3.69261817
## 46 jv 2.33333333
## 47 ka 3.78024074
## 48 kk 0.32258065
## 49 km 2.00000000
## 50 kn 2.04750000
## 51 ko 2.17177225
## 52 ku 8.00000000
## 53 kv 0.00000000
## 54 la 2.79560000
## 55 lb 3.38600000
## 56 ln 0.00000000
## 57 lo 0.00000000
## 58 lt 1.41538462
## 59 lv 3.25510204
## 60 mi 4.25000000
## 61 mk 3.40000000
## 62 ml 2.52631579
## 63 mn 10.00000000
## 64 mo 5.57142857
## 65 mr 0.97297297
## 66 ms 1.21360404
## 67 mt 0.75000000
## 68 my 0.08928571
## 69 nb 0.89583750
## 70 ne 4.66666667
## 71 nl 1.91611290
## 72 no 2.10793554
## 73 or 0.64285714
## 74 pa 2.42307692
## 75 pl 2.21658072
## 76 ps 0.00000000
## 77 pt 2.32868206
## 78 rm 0.00000000
## 79 ro 4.88791150
## 80 ru 2.97008437
## 81 se 10.00000000
## 82 sh 2.63590769
## 83 si 6.13333333
## 84 sk 1.25779455
## 85 sl 1.28000000
## 86 so 5.00000000
## 87 sq 5.05000000
## 88 sr 3.07360526
## 89 st 10.00000000
## 90 sv 2.80850447
## 91 sw 9.00000000
## 92 ta 3.10204386
## 93 te 3.17380952
## 94 th 2.59456512
## 95 ti 10.00000000
## 96 tl 1.54707201
## 97 tr 3.94363272
## 98 ug 0.00000000
## 99 uk 2.73947458
## 100 ur 6.54834835
## 101 uz 0.00000000
## 102 vi 1.82086829
## 103 xx 2.43632143
## 104 za 0.00000000
## 105 zh 1.70325135
## 106 zu 2.52000000
The average ratings by original_language show notable differences across languages:
These differences in ratings could reflect varying production standards, audience preferences, or voting behaviors based on region. Higher-rated languages might be producing content with better production values or more popular genres, while lower-rated languages could either have fewer resources or are catering to more niche, less mainstream audiences.
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages
library(ggplot2)
ggplot(tv_data, aes(x = vote_average)) +
geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
labs(title = "Distribution of Vote Average", x = "Vote Average", y = "Count") +
theme_minimal()
The histogram visualizes the distribution of the vote_average for TV shows in the dataset. The following insights can be drawn from the plot:
The majority of TV shows have a vote average close to 0, which is consistent with the summary statistics where the median and first quartile of vote_average were both 0. This suggests that many shows in the dataset either haven’t received enough votes or have very low ratings.
As you move to the right of the plot, the number of shows decreases significantly, indicating that highly-rated shows are less common.
The sharp drop-off after the initial bins suggests that the dataset is heavily concentrated with low-rated or unrated shows. This could be due to the inclusion of obscure or less popular shows that are not widely reviewed.
The distribution’s shape may also imply a data quality issue, where shows with few or no user ratings are included in the dataset, leading to the dominance of 0-rated entries.
# Scatter plot to show correlation between number_of_episodes and vote_average
# Color is used to differentiate between different original languages
ggplot(tv_data, aes(x = number_of_episodes, y = vote_average, color = original_language)) +
geom_point(alpha = 0.7) +
labs(title = "Number of Episodes vs. Vote Average",
x = "Number of Episodes",
y = "Vote Average") +
theme_minimal() +
scale_color_viridis_d()
The scatter plot illustrates the relationship between the number of episodes and the vote average for TV shows, with points colored by their original language. Several important insights can be drawn from the plot: