1. Introduction

This data dive explores a dataset of TV shows from TMDB (The Movie Database), an online community that collects information on TV shows, movies, and more. The dataset contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.

Objective of the Analysis

The purpose of this analysis is to understand the characteristics of the TV shows in the dataset and investigate patterns and trends in the data. Specifically, 1. Perform summary statistics to get an overview of the numeric and categorical data in the dataset. 2. Investigate relationships between certain key variables to answer some analytical questions about TV shows, such as: - Do certain languages or genres tend to have higher ratings? - Is there any correlation between the number of episodes and the average rating of a TV show? - What are the trends in the distribution of genres across different networks? 3. Visualize the data to better understand its distribution and identify trends or patterns in the relationships between key variables, such as ratings and episode counts.

Dataset Overview

The dataset includes the following columns:

- id: Unique identifier for the TV show.

- Name: The name of the TV show.

- Number_of_seasons: The number of seasons the TV show has aired.

- Number_of_episodes: The total number of episodes across all seasons.

- Original_language: The primary language in which the show is produced.

- Vote_count: The number of votes the show has received.

- Vote_average: The average rating of the show, calculated based on user votes.

- Overview: A brief description of the TV show.

- Adult: A boolean flag indicating whether the show is for adult audiences.

- Tagline: A short slogan or phrase associated with the show.

- Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).

- Networks: The networks where the show is broadcast.

- Spoken_languages: The languages spoken in the show.

- Production_companies: The companies involved in producing the show.

- Production_countries: The countries where the show was produced.

- Episode_run_time: The typical run time of episodes for the show, if available.

Goals for the Analysis with this Data dive

The analysis will focus on understanding numeric and categorical features of the TV shows, followed by exploring relationships between some variables. The following questions will be investigated:

1. Does the average rating of a show differ based on its primary language?

2. Is there a correlation between the number of episodes a show has and its average rating?

3. How are different genres distributed across TV networks, and what are the most popular genres for each network?


2. Loading and Inspecting the Dataset

Let’s load the dataset into R and inspect its structure to familiarize with the available data and the columns.

options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Install Readr
install.packages("readr")
## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages
# Load necessary libraries
library(readr)

# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")
## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(tv_data)
## # A tibble: 6 × 29
##      id name   number_of_seasons number_of_episodes original_language vote_count
##   <dbl> <chr>              <dbl>              <dbl> <chr>                  <dbl>
## 1  1399 Game …                 8                 73 en                     21857
## 2 71446 Money…                 3                 41 es                     17836
## 3 66732 Stran…                 4                 34 en                     16161
## 4  1402 The W…                11                177 en                     15432
## 5 63174 Lucif…                 6                 93 en                     13870
## 6 69050 River…                 7                137 en                     13180
## # ℹ 23 more variables: vote_average <dbl>, overview <chr>, adult <lgl>,
## #   backdrop_path <chr>, first_air_date <date>, last_air_date <date>,
## #   homepage <chr>, in_production <lgl>, original_name <chr>, popularity <dbl>,
## #   poster_path <chr>, type <chr>, status <chr>, tagline <chr>, genres <chr>,
## #   created_by <chr>, languages <chr>, networks <chr>, origin_country <chr>,
## #   spoken_languages <chr>, production_companies <chr>,
## #   production_countries <chr>, episode_run_time <dbl>

3. Numeric Summary

Summarizing Number_of_episodes and Vote_average will help in understanding:

# Summary for number_of_episodes
summary(tv_data$number_of_episodes)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     1.00     6.00    24.46    20.00 20839.00

The summary of number_of_episodes shows a wide range in the number of episodes among the TV shows in the dataset. The minimum value of 0 indicates that there are shows with no episodes, which could either be incomplete data or shows that were produced but never aired.

The difference between the median and the mean, along with the high maximum value, suggests a right-skewed distribution, meaning a few shows with many episodes are pulling the average up.

# Summary for vote_average
summary(tv_data$vote_average)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.334   6.000  10.000

The summary statistics for vote_average reveal some interesting patterns:

This distribution suggests that most of the shows in the dataset are either unrated or have received low ratings, with only a small proportion achieving high ratings. This could indicate that the dataset includes many obscure or less popular shows.

4. Categorical Summary

By summarizing original language and genres, we want to:

# Categorical summary for original_language
table(tv_data$original_language)
## 
##    aa    ab    af    am    ar    as    av    az    be    bg    bn    bs    ca 
##     1    13    88    48  2638     5     2    11     3    93   229    29   156 
##    cn    cs    cy    da    de    dv    el    en    eo    es    et    eu    fa 
##  1824  1382    30  2019  7712     2   903 76304     1  5602    52    12   244 
##    fi    fr    fy    ga    gd    gl    gu    he    hi    ho    hr    ht    hu 
##   453  7290     3    25     4    18    20   553  1565     1   121     1   404 
##    hy    hz    id    is    it    ja    jv    ka    kk    km    kn    ko    ku 
##     7     2   459    55  1691 14048     3    54    31     5    20  7820    10 
##    kv    la    lb    ln    lo    lt    lv    mi    mk    ml    mn    mo    mr 
##     1     5     2     1     1    65    49     4     5    38     1     7    37 
##    ms    mt    my    nb    ne    nl    no    or    pa    pl    ps    pt    rm 
##   495    20   112    80     6  2923  1210    14    26   923     1  3551     1 
##    ro    ru    se    sh    si    sk    sl    so    sq    sr    st    sv    sw 
##   113  2963     1    65     6   404    50     2    20   152     1  1455     1 
##    ta    te    th    ti    tl    tr    ug    uk    ur    uz    vi    xx    za 
##   114    84  1835     1   861  1748     1   118   333     1   205    28     1 
##    zh    zu 
## 14422    10

The distribution of TV shows by original_language reveals that the dataset is heavily dominated by English language shows, with 76,304 entries. This is not surprising, given that English content often has a broader global reach and is widely produced.

Some other languages with significant representation include:

maller languages such as Arabic (ar) and Russian (ru) have moderate representation. However, there are numerous languages with only a handful of shows, such as Uzbek (uz), Zhuang (za), and Avaric (av), each having just one or two entries.

The dominance of certain languages like English and Japanese could reflect the international popularity of shows from these regions, while the underrepresentation of other languages might suggest limited global production or distribution of content in those languages.

# Categorical summary for genres sorted in descending order
genres_summary <- table(tv_data$genres)
sorted_genres_summary <- sort(genres_summary, decreasing = TRUE)

# Display the top 100 categories
top_100_genres <- head(sorted_genres_summary, 100)

# Show the result
top_100_genres
## 
##                                     Documentary 
##                                           17596 
##                                           Drama 
##                                           16282 
##                                          Comedy 
##                                           10304 
##                                         Reality 
##                                            8009 
##                                       Animation 
##                                            3326 
##                                   Comedy, Drama 
##                                            1932 
##                                   Drama, Comedy 
##                                            1878 
##                                            Talk 
##                                            1872 
##                               Animation, Comedy 
##                                            1254 
##                                            Kids 
##                                            1112 
##                                           Crime 
##                                            1037 
##                                            News 
##                                            1022 
##                                    Crime, Drama 
##                                             985 
##                                          Family 
##                                             980 
##                                    Drama, Crime 
##                                             921 
##                                 Animation, Kids 
##                                             908 
##                                   Drama, Family 
##                                             886 
##                                  Drama, Mystery 
##                                             778 
##                            Documentary, Reality 
##                                             758 
##                              Action & Adventure 
##                                             727 
##                              Documentary, Crime 
##                                             681 
##                                         Mystery 
##                                             635 
##                                  Comedy, Family 
##                                             558 
##                                Sci-Fi & Fantasy 
##                                             516 
##                                     Soap, Drama 
##                                             512 
##                                     Drama, Soap 
##                                             490 
##                       Action & Adventure, Drama 
##                                             479 
##                           Drama, War & Politics 
##                                             464 
##                         Drama, Sci-Fi & Fantasy 
##                                             458 
##                            Reality, Documentary 
##                                             444 
##                                            Soap 
##                                             393 
## Animation, Action & Adventure, Sci-Fi & Fantasy 
##                                             385 
##                                   Family, Drama 
##                                             377 
##                       Drama, Action & Adventure 
##                                             374 
##                     Animation, Sci-Fi & Fantasy 
##                                             367 
##                                  Mystery, Drama 
##                                             363 
##                                 Kids, Animation 
##                                             358 
##                   Animation, Action & Adventure 
##                                             356 
##                     Documentary, War & Politics 
##                                             351 
##                           Crime, Drama, Mystery 
##                                             312 
##                                  Family, Comedy 
##                                             283 
##                         Sci-Fi & Fantasy, Drama 
##                                             281 
##                               Comedy, Animation 
##                                             269 
##                                    Comedy, Talk 
##                                             264 
##                                Animation, Drama 
##                                             263 
##                                  Crime, Mystery 
##                                             253 
##                                 Reality, Comedy 
##                                             233 
##            Action & Adventure, Sci-Fi & Fantasy 
##                                             225 
##                                 Comedy, Reality 
##                                             223 
##                              Crime, Documentary 
##                                             222 
##                        Comedy, Sci-Fi & Fantasy 
##                                             212 
##                                  Mystery, Crime 
##                                             203 
##             Animation, Comedy, Sci-Fi & Fantasy 
##                                             199 
##                              Documentary, Drama 
##                                             194 
##                           Drama, Crime, Mystery 
##                                             170 
##                        Animation, Comedy, Drama 
##                                             165 
##                             Documentary, Comedy 
##                                             164 
##                                  War & Politics 
##                                             164 
##                                   Reality, Talk 
##                                             159 
##                           War & Politics, Drama 
##                                             158 
##                                    Talk, Comedy 
##                                             151 
## Animation, Sci-Fi & Fantasy, Action & Adventure 
##                                             146 
##                   Action & Adventure, Animation 
##                                             137 
##                             Comedy, Documentary 
##                                             136 
## Action & Adventure, Animation, Sci-Fi & Fantasy 
##                                             133 
##                                         Western 
##                                             133 
##                                 Reality, Family 
##                                             130 
##                                    Family, Kids 
##                                             129 
##                                    Kids, Family 
##                                             126 
##                               Documentary, News 
##                                             123 
##                         Animation, Family, Kids 
##                                             118 
##                           Comedy, Drama, Family 
##                                             117 
##                               Documentary, Talk 
##                                             115 
##                               Animation, Family 
##                                             112 
##                                   Comedy, Crime 
##                                             111 
##                                 Family, Reality 
##                                             111 
##            Sci-Fi & Fantasy, Action & Adventure 
##                                             111 
##                         Animation, Comedy, Kids 
##                                             107 
##                           Drama, Mystery, Crime 
##                                             106 
##                                   Talk, Reality 
##                                             104 
##                        Sci-Fi & Fantasy, Comedy 
##                                             100 
##                                 Comedy, Mystery 
##                                              96 
##                           Crime, Mystery, Drama 
##                                              95 
##                       Action & Adventure, Crime 
##                                              93 
##                      Action & Adventure, Comedy 
##                                              91 
##           Animation, Action & Adventure, Comedy 
##                                              90 
##                         Animation, Kids, Family 
##                                              90 
##                      Comedy, Action & Adventure 
##                                              89 
##           Animation, Comedy, Action & Adventure 
##                                              85 
##                           Drama, Comedy, Family 
##                                              83 
##                                    Soap, Comedy 
##                                              83 
##     Action & Adventure, Drama, Sci-Fi & Fantasy 
##                                              81 
##                       Mystery, Sci-Fi & Fantasy 
##                                              79 
##                       Animation, Comedy, Family 
##                                              78 
##                Drama, Mystery, Sci-Fi & Fantasy 
##                                              78 
##                             Documentary, Family 
##                                              77 
##                              Drama, Documentary 
##                                              76 
##                           Mystery, Drama, Crime 
##                                              74 
##                Action & Adventure, Crime, Drama 
##                                              71 
##                     Sci-Fi & Fantasy, Animation 
##                                              71

The genre distribution is dominated by a few key genres:

These three genres alone account for a large portion of the dataset, suggesting that factual content, character-driven stories, and humor are the most common types of TV shows.

Other genres like Reality (8,009 shows), Animation (3,326 shows), and combinations such as Comedy, Drama (1,932 shows) are also well-represented.

Additionally, genre combinations like Animation, Comedy and Crime, Drama suggest that multi-genre shows are fairly common, particularly in the case of dramatic and comedic content. Some niche combinations like Animation, Sci-Fi & Fantasy are less frequent but still present.

The over-representation of genres like Documentary and Drama could reflect the global trend toward factual content and serialized storytelling, while more niche genres like Mystery and War & Politics appear less frequently, indicating that they cater to smaller, more specialized audiences.

5. Aggregate Function

Question: Does the average rating of a show differ based on its language?

# Aggregating vote_average by original_language
aggregate(tv_data$vote_average, by=list(Language=tv_data$original_language), FUN=mean)
##     Language           x
## 1         aa  0.00000000
## 2         ab  3.43592308
## 3         af  6.47646591
## 4         am  2.35833333
## 5         ar  2.12332563
## 6         as  7.20000000
## 7         av  0.00000000
## 8         az  2.81818182
## 9         be  2.66666667
## 10        bg  4.39077419
## 11        bn  2.97334934
## 12        bs  1.93103448
## 13        ca  2.19060256
## 14        cn  2.15544956
## 15        cs  2.13012084
## 16        cy  2.77633333
## 17        da  2.54437692
## 18        de  1.73996590
## 19        dv  0.00000000
## 20        el  3.35153599
## 21        en  2.12675084
## 22        eo  0.00000000
## 23        es  3.08211871
## 24        et  1.22500000
## 25        eu  3.20833333
## 26        fa  3.19272131
## 27        fi  3.17171302
## 28        fr  2.26773155
## 29        fy  0.00000000
## 30        ga  1.44000000
## 31        gd  0.00000000
## 32        gl  3.03072222
## 33        gu  0.80000000
## 34        he  2.41592948
## 35        hi  3.14637508
## 36        ho  0.00000000
## 37        hr  1.77797521
## 38        ht  0.00000000
## 39        hu  2.35064356
## 40        hy  0.00000000
## 41        hz  5.50000000
## 42        id  1.93763181
## 43        is  3.42360000
## 44        it  3.06983087
## 45        ja  3.69261817
## 46        jv  2.33333333
## 47        ka  3.78024074
## 48        kk  0.32258065
## 49        km  2.00000000
## 50        kn  2.04750000
## 51        ko  2.17177225
## 52        ku  8.00000000
## 53        kv  0.00000000
## 54        la  2.79560000
## 55        lb  3.38600000
## 56        ln  0.00000000
## 57        lo  0.00000000
## 58        lt  1.41538462
## 59        lv  3.25510204
## 60        mi  4.25000000
## 61        mk  3.40000000
## 62        ml  2.52631579
## 63        mn 10.00000000
## 64        mo  5.57142857
## 65        mr  0.97297297
## 66        ms  1.21360404
## 67        mt  0.75000000
## 68        my  0.08928571
## 69        nb  0.89583750
## 70        ne  4.66666667
## 71        nl  1.91611290
## 72        no  2.10793554
## 73        or  0.64285714
## 74        pa  2.42307692
## 75        pl  2.21658072
## 76        ps  0.00000000
## 77        pt  2.32868206
## 78        rm  0.00000000
## 79        ro  4.88791150
## 80        ru  2.97008437
## 81        se 10.00000000
## 82        sh  2.63590769
## 83        si  6.13333333
## 84        sk  1.25779455
## 85        sl  1.28000000
## 86        so  5.00000000
## 87        sq  5.05000000
## 88        sr  3.07360526
## 89        st 10.00000000
## 90        sv  2.80850447
## 91        sw  9.00000000
## 92        ta  3.10204386
## 93        te  3.17380952
## 94        th  2.59456512
## 95        ti 10.00000000
## 96        tl  1.54707201
## 97        tr  3.94363272
## 98        ug  0.00000000
## 99        uk  2.73947458
## 100       ur  6.54834835
## 101       uz  0.00000000
## 102       vi  1.82086829
## 103       xx  2.43632143
## 104       za  0.00000000
## 105       zh  1.70325135
## 106       zu  2.52000000

The average ratings by original_language show notable differences across languages:

These differences in ratings could reflect varying production standards, audience preferences, or voting behaviors based on region. Higher-rated languages might be producing content with better production values or more popular genres, while lower-rated languages could either have fewer resources or are catering to more niche, less mainstream audiences.

5. Visual Summaries

  1. Distribution of Vote Average
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages
library(ggplot2)
ggplot(tv_data, aes(x = vote_average)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Vote Average", x = "Vote Average", y = "Count") +
  theme_minimal()

The histogram visualizes the distribution of the vote_average for TV shows in the dataset. The following insights can be drawn from the plot:

  1. Scatter Plot of Number of Episodes vs. Vote Average, Colored by Original Language
# Scatter plot to show correlation between number_of_episodes and vote_average
# Color is used to differentiate between different original languages
ggplot(tv_data, aes(x = number_of_episodes, y = vote_average, color = original_language)) +
  geom_point(alpha = 0.7) +
  labs(title = "Number of Episodes vs. Vote Average", 
       x = "Number of Episodes", 
       y = "Vote Average") +
  theme_minimal() +
  scale_color_viridis_d()

The scatter plot illustrates the relationship between the number of episodes and the vote average for TV shows, with points colored by their original language. Several important insights can be drawn from the plot: