This data dive explores a dataset of TV shows from TMDB (The Movie Database), an online community that collects information on TV shows, movies, and more. The dataset contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.
The purpose of this analysis is to understand the characteristics of the TV shows in the dataset and investigate patterns and trends in the data. Specifically, 1. Perform summary statistics to get an overview of the numeric and categorical data in the dataset. 2. Investigate relationships between certain key variables to answer some analytical questions about TV shows, such as: - Do certain languages or genres tend to have higher ratings? - Is there any correlation between the number of episodes and the average rating of a TV show? - What are the trends in the distribution of genres across different networks? 3. Visualize the data to better understand its distribution and identify trends or patterns in the relationships between key variables, such as ratings and episode counts.
The dataset includes the following columns:
- id: Unique identifier for the TV show.
- Name: The name of the TV show.
- Number_of_seasons: The number of seasons the TV show has aired.
- Number_of_episodes: The total number of episodes across all seasons.
- Original_language: The primary language in which the show is produced.
- Vote_count: The number of votes the show has received.
- Vote_average: The average rating of the show, calculated based on user votes.
- Overview: A brief description of the TV show.
- Adult: A boolean flag indicating whether the show is for adult audiences.
- Tagline: A short slogan or phrase associated with the show.
- Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).
- Networks: The networks where the show is broadcast.
- Spoken_languages: The languages spoken in the show.
- Production_companies: The companies involved in producing the show.
- Production_countries: The countries where the show was produced.
- Episode_run_time: The typical run time of episodes for the show, if available.
The analysis will focus on understanding numeric and categorical features of the TV shows, followed by exploring relationships between some variables. The following questions will be investigated:
1. Does the average rating of a show differ based on its primary language?
2. Is there a correlation between the number of episodes a show has and its average rating?
3. How are different genres distributed across TV networks, and what are the most popular genres for each network?
Let’s load the dataset into R and inspect its structure to familiarize with the available data and the columns.
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# Install Readr
install.packages("readr")
##
## The downloaded binary packages are in
## /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpGQvrFc/downloaded_packages
# Load necessary libraries
library(readr)
# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl (2): adult, in_production
## date (2): first_air_date, last_air_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(tv_data)
## # A tibble: 6 × 29
## id name number_of_seasons number_of_episodes original_language vote_count
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1399 Game … 8 73 en 21857
## 2 71446 Money… 3 41 es 17836
## 3 66732 Stran… 4 34 en 16161
## 4 1402 The W… 11 177 en 15432
## 5 63174 Lucif… 6 93 en 13870
## 6 69050 River… 7 137 en 13180
## # ℹ 23 more variables: vote_average <dbl>, overview <chr>, adult <lgl>,
## # backdrop_path <chr>, first_air_date <date>, last_air_date <date>,
## # homepage <chr>, in_production <lgl>, original_name <chr>, popularity <dbl>,
## # poster_path <chr>, type <chr>, status <chr>, tagline <chr>, genres <chr>,
## # created_by <chr>, languages <chr>, networks <chr>, origin_country <chr>,
## # spoken_languages <chr>, production_companies <chr>,
## # production_countries <chr>, episode_run_time <dbl>
Summarizing Number_of_episodes and Vote_average will help in understanding:
# Summary for number_of_episodes
summary(tv_data$number_of_episodes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 6.00 24.46 20.00 20839.00
# Summary for vote_average
summary(tv_data$vote_average)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.334 6.000 10.000
By summarizing original language and genres, we want to:
# Categorical summary for original_language
table(tv_data$original_language)
##
## aa ab af am ar as av az be bg bn bs ca
## 1 13 88 48 2638 5 2 11 3 93 229 29 156
## cn cs cy da de dv el en eo es et eu fa
## 1824 1382 30 2019 7712 2 903 76304 1 5602 52 12 244
## fi fr fy ga gd gl gu he hi ho hr ht hu
## 453 7290 3 25 4 18 20 553 1565 1 121 1 404
## hy hz id is it ja jv ka kk km kn ko ku
## 7 2 459 55 1691 14048 3 54 31 5 20 7820 10
## kv la lb ln lo lt lv mi mk ml mn mo mr
## 1 5 2 1 1 65 49 4 5 38 1 7 37
## ms mt my nb ne nl no or pa pl ps pt rm
## 495 20 112 80 6 2923 1210 14 26 923 1 3551 1
## ro ru se sh si sk sl so sq sr st sv sw
## 113 2963 1 65 6 404 50 2 20 152 1 1455 1
## ta te th ti tl tr ug uk ur uz vi xx za
## 114 84 1835 1 861 1748 1 118 333 1 205 28 1
## zh zu
## 14422 10
# Categorical summary for genres sorted in descending order
genres_summary <- table(tv_data$genres)
sorted_genres_summary <- sort(genres_summary, decreasing = TRUE)
# Display the top 100 categories
top_100_genres <- head(sorted_genres_summary, 100)
# Show the result
top_100_genres
##
## Documentary
## 17596
## Drama
## 16282
## Comedy
## 10304
## Reality
## 8009
## Animation
## 3326
## Comedy, Drama
## 1932
## Drama, Comedy
## 1878
## Talk
## 1872
## Animation, Comedy
## 1254
## Kids
## 1112
## Crime
## 1037
## News
## 1022
## Crime, Drama
## 985
## Family
## 980
## Drama, Crime
## 921
## Animation, Kids
## 908
## Drama, Family
## 886
## Drama, Mystery
## 778
## Documentary, Reality
## 758
## Action & Adventure
## 727
## Documentary, Crime
## 681
## Mystery
## 635
## Comedy, Family
## 558
## Sci-Fi & Fantasy
## 516
## Soap, Drama
## 512
## Drama, Soap
## 490
## Action & Adventure, Drama
## 479
## Drama, War & Politics
## 464
## Drama, Sci-Fi & Fantasy
## 458
## Reality, Documentary
## 444
## Soap
## 393
## Animation, Action & Adventure, Sci-Fi & Fantasy
## 385
## Family, Drama
## 377
## Drama, Action & Adventure
## 374
## Animation, Sci-Fi & Fantasy
## 367
## Mystery, Drama
## 363
## Kids, Animation
## 358
## Animation, Action & Adventure
## 356
## Documentary, War & Politics
## 351
## Crime, Drama, Mystery
## 312
## Family, Comedy
## 283
## Sci-Fi & Fantasy, Drama
## 281
## Comedy, Animation
## 269
## Comedy, Talk
## 264
## Animation, Drama
## 263
## Crime, Mystery
## 253
## Reality, Comedy
## 233
## Action & Adventure, Sci-Fi & Fantasy
## 225
## Comedy, Reality
## 223
## Crime, Documentary
## 222
## Comedy, Sci-Fi & Fantasy
## 212
## Mystery, Crime
## 203
## Animation, Comedy, Sci-Fi & Fantasy
## 199
## Documentary, Drama
## 194
## Drama, Crime, Mystery
## 170
## Animation, Comedy, Drama
## 165
## Documentary, Comedy
## 164
## War & Politics
## 164
## Reality, Talk
## 159
## War & Politics, Drama
## 158
## Talk, Comedy
## 151
## Animation, Sci-Fi & Fantasy, Action & Adventure
## 146
## Action & Adventure, Animation
## 137
## Comedy, Documentary
## 136
## Action & Adventure, Animation, Sci-Fi & Fantasy
## 133
## Western
## 133
## Reality, Family
## 130
## Family, Kids
## 129
## Kids, Family
## 126
## Documentary, News
## 123
## Animation, Family, Kids
## 118
## Comedy, Drama, Family
## 117
## Documentary, Talk
## 115
## Animation, Family
## 112
## Comedy, Crime
## 111
## Family, Reality
## 111
## Sci-Fi & Fantasy, Action & Adventure
## 111
## Animation, Comedy, Kids
## 107
## Drama, Mystery, Crime
## 106
## Talk, Reality
## 104
## Sci-Fi & Fantasy, Comedy
## 100
## Comedy, Mystery
## 96
## Crime, Mystery, Drama
## 95
## Action & Adventure, Crime
## 93
## Action & Adventure, Comedy
## 91
## Animation, Action & Adventure, Comedy
## 90
## Animation, Kids, Family
## 90
## Comedy, Action & Adventure
## 89
## Animation, Comedy, Action & Adventure
## 85
## Drama, Comedy, Family
## 83
## Soap, Comedy
## 83
## Action & Adventure, Drama, Sci-Fi & Fantasy
## 81
## Mystery, Sci-Fi & Fantasy
## 79
## Animation, Comedy, Family
## 78
## Drama, Mystery, Sci-Fi & Fantasy
## 78
## Documentary, Family
## 77
## Drama, Documentary
## 76
## Mystery, Drama, Crime
## 74
## Action & Adventure, Crime, Drama
## 71
## Sci-Fi & Fantasy, Animation
## 71
Question: Does the average rating of a show differ based on its language?
# Aggregating vote_average by original_language
aggregate(tv_data$vote_average, by=list(Language=tv_data$original_language), FUN=mean)
## Language x
## 1 aa 0.00000000
## 2 ab 3.43592308
## 3 af 6.47646591
## 4 am 2.35833333
## 5 ar 2.12332563
## 6 as 7.20000000
## 7 av 0.00000000
## 8 az 2.81818182
## 9 be 2.66666667
## 10 bg 4.39077419
## 11 bn 2.97334934
## 12 bs 1.93103448
## 13 ca 2.19060256
## 14 cn 2.15544956
## 15 cs 2.13012084
## 16 cy 2.77633333
## 17 da 2.54437692
## 18 de 1.73996590
## 19 dv 0.00000000
## 20 el 3.35153599
## 21 en 2.12675084
## 22 eo 0.00000000
## 23 es 3.08211871
## 24 et 1.22500000
## 25 eu 3.20833333
## 26 fa 3.19272131
## 27 fi 3.17171302
## 28 fr 2.26773155
## 29 fy 0.00000000
## 30 ga 1.44000000
## 31 gd 0.00000000
## 32 gl 3.03072222
## 33 gu 0.80000000
## 34 he 2.41592948
## 35 hi 3.14637508
## 36 ho 0.00000000
## 37 hr 1.77797521
## 38 ht 0.00000000
## 39 hu 2.35064356
## 40 hy 0.00000000
## 41 hz 5.50000000
## 42 id 1.93763181
## 43 is 3.42360000
## 44 it 3.06983087
## 45 ja 3.69261817
## 46 jv 2.33333333
## 47 ka 3.78024074
## 48 kk 0.32258065
## 49 km 2.00000000
## 50 kn 2.04750000
## 51 ko 2.17177225
## 52 ku 8.00000000
## 53 kv 0.00000000
## 54 la 2.79560000
## 55 lb 3.38600000
## 56 ln 0.00000000
## 57 lo 0.00000000
## 58 lt 1.41538462
## 59 lv 3.25510204
## 60 mi 4.25000000
## 61 mk 3.40000000
## 62 ml 2.52631579
## 63 mn 10.00000000
## 64 mo 5.57142857
## 65 mr 0.97297297
## 66 ms 1.21360404
## 67 mt 0.75000000
## 68 my 0.08928571
## 69 nb 0.89583750
## 70 ne 4.66666667
## 71 nl 1.91611290
## 72 no 2.10793554
## 73 or 0.64285714
## 74 pa 2.42307692
## 75 pl 2.21658072
## 76 ps 0.00000000
## 77 pt 2.32868206
## 78 rm 0.00000000
## 79 ro 4.88791150
## 80 ru 2.97008437
## 81 se 10.00000000
## 82 sh 2.63590769
## 83 si 6.13333333
## 84 sk 1.25779455
## 85 sl 1.28000000
## 86 so 5.00000000
## 87 sq 5.05000000
## 88 sr 3.07360526
## 89 st 10.00000000
## 90 sv 2.80850447
## 91 sw 9.00000000
## 92 ta 3.10204386
## 93 te 3.17380952
## 94 th 2.59456512
## 95 ti 10.00000000
## 96 tl 1.54707201
## 97 tr 3.94363272
## 98 ug 0.00000000
## 99 uk 2.73947458
## 100 ur 6.54834835
## 101 uz 0.00000000
## 102 vi 1.82086829
## 103 xx 2.43632143
## 104 za 0.00000000
## 105 zh 1.70325135
## 106 zu 2.52000000
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpGQvrFc/downloaded_packages
library(ggplot2)
ggplot(tv_data, aes(x = vote_average)) +
geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
labs(title = "Distribution of Vote Average", x = "Vote Average", y = "Count") +
theme_minimal()
# Scatter plot to show correlation between number_of_episodes and vote_average
# Color is used to differentiate between different original languages
ggplot(tv_data, aes(x = number_of_episodes, y = vote_average, color = original_language)) +
geom_point(alpha = 0.7) +
labs(title = "Number of Episodes vs. Vote Average",
x = "Number of Episodes",
y = "Vote Average") +
theme_minimal() +
scale_color_viridis_d()