Week 2 - Saransh Gupta

1. Introduction

This data dive explores a dataset of TV shows from TMDB (The Movie Database), an online community that collects information on TV shows, movies, and more. The dataset contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.

Objective of the Analysis

The purpose of this analysis is to understand the characteristics of the TV shows in the dataset and investigate patterns and trends in the data. Specifically, 1. Perform summary statistics to get an overview of the numeric and categorical data in the dataset. 2. Investigate relationships between certain key variables to answer some analytical questions about TV shows, such as: - Do certain languages or genres tend to have higher ratings? - Is there any correlation between the number of episodes and the average rating of a TV show? - What are the trends in the distribution of genres across different networks? 3. Visualize the data to better understand its distribution and identify trends or patterns in the relationships between key variables, such as ratings and episode counts.

Dataset Overview

The dataset includes the following columns:

- id: Unique identifier for the TV show.

- Name: The name of the TV show.

- Number_of_seasons: The number of seasons the TV show has aired.

- Number_of_episodes: The total number of episodes across all seasons.

- Original_language: The primary language in which the show is produced.

- Vote_count: The number of votes the show has received.

- Vote_average: The average rating of the show, calculated based on user votes.

- Overview: A brief description of the TV show.

- Adult: A boolean flag indicating whether the show is for adult audiences.

- Tagline: A short slogan or phrase associated with the show.

- Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).

- Networks: The networks where the show is broadcast.

- Spoken_languages: The languages spoken in the show.

- Production_companies: The companies involved in producing the show.

- Production_countries: The countries where the show was produced.

- Episode_run_time: The typical run time of episodes for the show, if available.

Goals for the Analysis with this Data dive

The analysis will focus on understanding numeric and categorical features of the TV shows, followed by exploring relationships between some variables. The following questions will be investigated:

1. Does the average rating of a show differ based on its primary language?

2. Is there a correlation between the number of episodes a show has and its average rating?

3. How are different genres distributed across TV networks, and what are the most popular genres for each network?

2. Loading and Inspecting the Dataset

Let’s load the dataset into R and inspect its structure to familiarize with the available data and the columns.

options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Install Readr
install.packages("readr")

## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpGQvrFc/downloaded_packages

# Load necessary libraries
library(readr)

# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows of the dataset
head(tv_data)

## # A tibble: 6 × 29
##      id name   number_of_seasons number_of_episodes original_language vote_count
##   <dbl> <chr>              <dbl>              <dbl> <chr>                  <dbl>
## 1  1399 Game …                 8                 73 en                     21857
## 2 71446 Money…                 3                 41 es                     17836
## 3 66732 Stran…                 4                 34 en                     16161
## 4  1402 The W…                11                177 en                     15432
## 5 63174 Lucif…                 6                 93 en                     13870
## 6 69050 River…                 7                137 en                     13180
## # ℹ 23 more variables: vote_average <dbl>, overview <chr>, adult <lgl>,
## #   backdrop_path <chr>, first_air_date <date>, last_air_date <date>,
## #   homepage <chr>, in_production <lgl>, original_name <chr>, popularity <dbl>,
## #   poster_path <chr>, type <chr>, status <chr>, tagline <chr>, genres <chr>,
## #   created_by <chr>, languages <chr>, networks <chr>, origin_country <chr>,
## #   spoken_languages <chr>, production_companies <chr>,
## #   production_countries <chr>, episode_run_time <dbl>

3. Numeric Summary

Summarizing Number_of_episodes and Vote_average will help in understanding:

How varied the TV shows are in terms of the number of episodes.

# Summary for number_of_episodes
summary(tv_data$number_of_episodes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     1.00     6.00    24.46    20.00 20839.00

The central tendency and distribution of ratings across TV shows, which could help identify popular or highly rated shows.

# Summary for vote_average
summary(tv_data$vote_average)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.334   6.000  10.000

4. Categorical Summary

By summarizing original language and genres, we want to:

Understand the distribution of TV shows by language. Are certain languages more common than others?

# Categorical summary for original_language
table(tv_data$original_language)

## 
##    aa    ab    af    am    ar    as    av    az    be    bg    bn    bs    ca 
##     1    13    88    48  2638     5     2    11     3    93   229    29   156 
##    cn    cs    cy    da    de    dv    el    en    eo    es    et    eu    fa 
##  1824  1382    30  2019  7712     2   903 76304     1  5602    52    12   244 
##    fi    fr    fy    ga    gd    gl    gu    he    hi    ho    hr    ht    hu 
##   453  7290     3    25     4    18    20   553  1565     1   121     1   404 
##    hy    hz    id    is    it    ja    jv    ka    kk    km    kn    ko    ku 
##     7     2   459    55  1691 14048     3    54    31     5    20  7820    10 
##    kv    la    lb    ln    lo    lt    lv    mi    mk    ml    mn    mo    mr 
##     1     5     2     1     1    65    49     4     5    38     1     7    37 
##    ms    mt    my    nb    ne    nl    no    or    pa    pl    ps    pt    rm 
##   495    20   112    80     6  2923  1210    14    26   923     1  3551     1 
##    ro    ru    se    sh    si    sk    sl    so    sq    sr    st    sv    sw 
##   113  2963     1    65     6   404    50     2    20   152     1  1455     1 
##    ta    te    th    ti    tl    tr    ug    uk    ur    uz    vi    xx    za 
##   114    84  1835     1   861  1748     1   118   333     1   205    28     1 
##    zh    zu 
## 14422    10

Analyze the diversity of genres in the dataset. Which genres are most popular, and do they have any relationship to other variables like ratings?

# Categorical summary for genres sorted in descending order
genres_summary <- table(tv_data$genres)
sorted_genres_summary <- sort(genres_summary, decreasing = TRUE)

# Display the top 100 categories
top_100_genres <- head(sorted_genres_summary, 100)

# Show the result
top_100_genres

## 
##                                     Documentary 
##                                           17596 
##                                           Drama 
##                                           16282 
##                                          Comedy 
##                                           10304 
##                                         Reality 
##                                            8009 
##                                       Animation 
##                                            3326 
##                                   Comedy, Drama 
##                                            1932 
##                                   Drama, Comedy 
##                                            1878 
##                                            Talk 
##                                            1872 
##                               Animation, Comedy 
##                                            1254 
##                                            Kids 
##                                            1112 
##                                           Crime 
##                                            1037 
##                                            News 
##                                            1022 
##                                    Crime, Drama 
##                                             985 
##                                          Family 
##                                             980 
##                                    Drama, Crime 
##                                             921 
##                                 Animation, Kids 
##                                             908 
##                                   Drama, Family 
##                                             886 
##                                  Drama, Mystery 
##                                             778 
##                            Documentary, Reality 
##                                             758 
##                              Action & Adventure 
##                                             727 
##                              Documentary, Crime 
##                                             681 
##                                         Mystery 
##                                             635 
##                                  Comedy, Family 
##                                             558 
##                                Sci-Fi & Fantasy 
##                                             516 
##                                     Soap, Drama 
##                                             512 
##                                     Drama, Soap 
##                                             490 
##                       Action & Adventure, Drama 
##                                             479 
##                           Drama, War & Politics 
##                                             464 
##                         Drama, Sci-Fi & Fantasy 
##                                             458 
##                            Reality, Documentary 
##                                             444 
##                                            Soap 
##                                             393 
## Animation, Action & Adventure, Sci-Fi & Fantasy 
##                                             385 
##                                   Family, Drama 
##                                             377 
##                       Drama, Action & Adventure 
##                                             374 
##                     Animation, Sci-Fi & Fantasy 
##                                             367 
##                                  Mystery, Drama 
##                                             363 
##                                 Kids, Animation 
##                                             358 
##                   Animation, Action & Adventure 
##                                             356 
##                     Documentary, War & Politics 
##                                             351 
##                           Crime, Drama, Mystery 
##                                             312 
##                                  Family, Comedy 
##                                             283 
##                         Sci-Fi & Fantasy, Drama 
##                                             281 
##                               Comedy, Animation 
##                                             269 
##                                    Comedy, Talk 
##                                             264 
##                                Animation, Drama 
##                                             263 
##                                  Crime, Mystery 
##                                             253 
##                                 Reality, Comedy 
##                                             233 
##            Action & Adventure, Sci-Fi & Fantasy 
##                                             225 
##                                 Comedy, Reality 
##                                             223 
##                              Crime, Documentary 
##                                             222 
##                        Comedy, Sci-Fi & Fantasy 
##                                             212 
##                                  Mystery, Crime 
##                                             203 
##             Animation, Comedy, Sci-Fi & Fantasy 
##                                             199 
##                              Documentary, Drama 
##                                             194 
##                           Drama, Crime, Mystery 
##                                             170 
##                        Animation, Comedy, Drama 
##                                             165 
##                             Documentary, Comedy 
##                                             164 
##                                  War & Politics 
##                                             164 
##                                   Reality, Talk 
##                                             159 
##                           War & Politics, Drama 
##                                             158 
##                                    Talk, Comedy 
##                                             151 
## Animation, Sci-Fi & Fantasy, Action & Adventure 
##                                             146 
##                   Action & Adventure, Animation 
##                                             137 
##                             Comedy, Documentary 
##                                             136 
## Action & Adventure, Animation, Sci-Fi & Fantasy 
##                                             133 
##                                         Western 
##                                             133 
##                                 Reality, Family 
##                                             130 
##                                    Family, Kids 
##                                             129 
##                                    Kids, Family 
##                                             126 
##                               Documentary, News 
##                                             123 
##                         Animation, Family, Kids 
##                                             118 
##                           Comedy, Drama, Family 
##                                             117 
##                               Documentary, Talk 
##                                             115 
##                               Animation, Family 
##                                             112 
##                                   Comedy, Crime 
##                                             111 
##                                 Family, Reality 
##                                             111 
##            Sci-Fi & Fantasy, Action & Adventure 
##                                             111 
##                         Animation, Comedy, Kids 
##                                             107 
##                           Drama, Mystery, Crime 
##                                             106 
##                                   Talk, Reality 
##                                             104 
##                        Sci-Fi & Fantasy, Comedy 
##                                             100 
##                                 Comedy, Mystery 
##                                              96 
##                           Crime, Mystery, Drama 
##                                              95 
##                       Action & Adventure, Crime 
##                                              93 
##                      Action & Adventure, Comedy 
##                                              91 
##           Animation, Action & Adventure, Comedy 
##                                              90 
##                         Animation, Kids, Family 
##                                              90 
##                      Comedy, Action & Adventure 
##                                              89 
##           Animation, Comedy, Action & Adventure 
##                                              85 
##                           Drama, Comedy, Family 
##                                              83 
##                                    Soap, Comedy 
##                                              83 
##     Action & Adventure, Drama, Sci-Fi & Fantasy 
##                                              81 
##                       Mystery, Sci-Fi & Fantasy 
##                                              79 
##                       Animation, Comedy, Family 
##                                              78 
##                Drama, Mystery, Sci-Fi & Fantasy 
##                                              78 
##                             Documentary, Family 
##                                              77 
##                              Drama, Documentary 
##                                              76 
##                           Mystery, Drama, Crime 
##                                              74 
##                Action & Adventure, Crime, Drama 
##                                              71 
##                     Sci-Fi & Fantasy, Animation 
##                                              71

5. Aggregate Function

Question: Does the average rating of a show differ based on its language?

# Aggregating vote_average by original_language
aggregate(tv_data$vote_average, by=list(Language=tv_data$original_language), FUN=mean)

##     Language           x
## 1         aa  0.00000000
## 2         ab  3.43592308
## 3         af  6.47646591
## 4         am  2.35833333
## 5         ar  2.12332563
## 6         as  7.20000000
## 7         av  0.00000000
## 8         az  2.81818182
## 9         be  2.66666667
## 10        bg  4.39077419
## 11        bn  2.97334934
## 12        bs  1.93103448
## 13        ca  2.19060256
## 14        cn  2.15544956
## 15        cs  2.13012084
## 16        cy  2.77633333
## 17        da  2.54437692
## 18        de  1.73996590
## 19        dv  0.00000000
## 20        el  3.35153599
## 21        en  2.12675084
## 22        eo  0.00000000
## 23        es  3.08211871
## 24        et  1.22500000
## 25        eu  3.20833333
## 26        fa  3.19272131
## 27        fi  3.17171302
## 28        fr  2.26773155
## 29        fy  0.00000000
## 30        ga  1.44000000
## 31        gd  0.00000000
## 32        gl  3.03072222
## 33        gu  0.80000000
## 34        he  2.41592948
## 35        hi  3.14637508
## 36        ho  0.00000000
## 37        hr  1.77797521
## 38        ht  0.00000000
## 39        hu  2.35064356
## 40        hy  0.00000000
## 41        hz  5.50000000
## 42        id  1.93763181
## 43        is  3.42360000
## 44        it  3.06983087
## 45        ja  3.69261817
## 46        jv  2.33333333
## 47        ka  3.78024074
## 48        kk  0.32258065
## 49        km  2.00000000
## 50        kn  2.04750000
## 51        ko  2.17177225
## 52        ku  8.00000000
## 53        kv  0.00000000
## 54        la  2.79560000
## 55        lb  3.38600000
## 56        ln  0.00000000
## 57        lo  0.00000000
## 58        lt  1.41538462
## 59        lv  3.25510204
## 60        mi  4.25000000
## 61        mk  3.40000000
## 62        ml  2.52631579
## 63        mn 10.00000000
## 64        mo  5.57142857
## 65        mr  0.97297297
## 66        ms  1.21360404
## 67        mt  0.75000000
## 68        my  0.08928571
## 69        nb  0.89583750
## 70        ne  4.66666667
## 71        nl  1.91611290
## 72        no  2.10793554
## 73        or  0.64285714
## 74        pa  2.42307692
## 75        pl  2.21658072
## 76        ps  0.00000000
## 77        pt  2.32868206
## 78        rm  0.00000000
## 79        ro  4.88791150
## 80        ru  2.97008437
## 81        se 10.00000000
## 82        sh  2.63590769
## 83        si  6.13333333
## 84        sk  1.25779455
## 85        sl  1.28000000
## 86        so  5.00000000
## 87        sq  5.05000000
## 88        sr  3.07360526
## 89        st 10.00000000
## 90        sv  2.80850447
## 91        sw  9.00000000
## 92        ta  3.10204386
## 93        te  3.17380952
## 94        th  2.59456512
## 95        ti 10.00000000
## 96        tl  1.54707201
## 97        tr  3.94363272
## 98        ug  0.00000000
## 99        uk  2.73947458
## 100       ur  6.54834835
## 101       uz  0.00000000
## 102       vi  1.82086829
## 103       xx  2.43632143
## 104       za  0.00000000
## 105       zh  1.70325135
## 106       zu  2.52000000

5. Visual Summaries

Distribution of Vote Average

install.packages("ggplot2")

## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpGQvrFc/downloaded_packages

library(ggplot2)
ggplot(tv_data, aes(x = vote_average)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Vote Average", x = "Vote Average", y = "Count") +
  theme_minimal()

Scatter Plot of Number of Episodes vs. Vote Average, Colored by Original Language

# Scatter plot to show correlation between number_of_episodes and vote_average
# Color is used to differentiate between different original languages
ggplot(tv_data, aes(x = number_of_episodes, y = vote_average, color = original_language)) +
  geom_point(alpha = 0.7) +
  labs(title = "Number of Episodes vs. Vote Average", 
       x = "Number of Episodes", 
       y = "Vote Average") +
  theme_minimal() +
  scale_color_viridis_d()