Week 2 - Saransh Gupta

1. Introduction

This data dive explores a dataset of TV shows from TMDB (The Movie Database), an online community that collects information on TV shows, movies, and more. The dataset contains detailed information on various TV shows, including metadata like the number of seasons, episodes, languages, networks, ratings, and genres.

Objective of the Analysis

The purpose of this analysis is to understand the characteristics of the TV shows in the dataset and investigate patterns and trends in the data. Specifically, 1. Perform summary statistics to get an overview of the numeric and categorical data in the dataset. 2. Investigate relationships between certain key variables to answer some analytical questions about TV shows, such as: - Do certain languages or genres tend to have higher ratings? - Is there any correlation between the number of episodes and the average rating of a TV show? - What are the trends in the distribution of genres across different networks? 3. Visualize the data to better understand its distribution and identify trends or patterns in the relationships between key variables, such as ratings and episode counts.

Dataset Overview

The dataset includes the following columns:

- id: Unique identifier for the TV show.

- Name: The name of the TV show.

- Number_of_seasons: The number of seasons the TV show has aired.

- Number_of_episodes: The total number of episodes across all seasons.

- Original_language: The primary language in which the show is produced.

- Vote_count: The number of votes the show has received.

- Vote_average: The average rating of the show, calculated based on user votes.

- Overview: A brief description of the TV show.

- Adult: A boolean flag indicating whether the show is for adult audiences.

- Tagline: A short slogan or phrase associated with the show.

- Genres: The genre(s) associated with the show (e.g., Drama, Crime, Sci-Fi).

- Networks: The networks where the show is broadcast.

- Spoken_languages: The languages spoken in the show.

- Production_companies: The companies involved in producing the show.

- Production_countries: The countries where the show was produced.

- Episode_run_time: The typical run time of episodes for the show, if available.

Goals for the Analysis with this Data dive

The analysis will focus on understanding numeric and categorical features of the TV shows, followed by exploring relationships between some variables. The following questions will be investigated:

1. Does the average rating of a show differ based on its primary language?

2. Is there a correlation between the number of episodes a show has and its average rating?

3. How are different genres distributed across TV networks, and what are the most popular genres for each network?

2. Loading and Inspecting the Dataset

Let’s load the dataset into R and inspect its structure to familiarize with the available data and the columns.

options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Install Readr
install.packages("readr")

## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages

# Load necessary libraries
library(readr)

# Load the dataset
tv_data <- read_csv("/Users/saransh/Downloads/TMDB_tv_dataset_v3.csv")

## Rows: 168639 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (18): name, original_language, overview, backdrop_path, homepage, origi...
## dbl   (7): id, number_of_seasons, number_of_episodes, vote_count, vote_avera...
## lgl   (2): adult, in_production
## date  (2): first_air_date, last_air_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows of the dataset
head(tv_data)

## # A tibble: 6 × 29
##      id name   number_of_seasons number_of_episodes original_language vote_count
##   <dbl> <chr>              <dbl>              <dbl> <chr>                  <dbl>
## 1  1399 Game …                 8                 73 en                     21857
## 2 71446 Money…                 3                 41 es                     17836
## 3 66732 Stran…                 4                 34 en                     16161
## 4  1402 The W…                11                177 en                     15432
## 5 63174 Lucif…                 6                 93 en                     13870
## 6 69050 River…                 7                137 en                     13180
## # ℹ 23 more variables: vote_average <dbl>, overview <chr>, adult <lgl>,
## #   backdrop_path <chr>, first_air_date <date>, last_air_date <date>,
## #   homepage <chr>, in_production <lgl>, original_name <chr>, popularity <dbl>,
## #   poster_path <chr>, type <chr>, status <chr>, tagline <chr>, genres <chr>,
## #   created_by <chr>, languages <chr>, networks <chr>, origin_country <chr>,
## #   spoken_languages <chr>, production_companies <chr>,
## #   production_countries <chr>, episode_run_time <dbl>

3. Numeric Summary

Summarizing Number_of_episodes and Vote_average will help in understanding:

How varied the TV shows are in terms of the number of episodes.

# Summary for number_of_episodes
summary(tv_data$number_of_episodes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     1.00     6.00    24.46    20.00 20839.00

The summary of number_of_episodes shows a wide range in the number of episodes among the TV shows in the dataset. The minimum value of 0 indicates that there are shows with no episodes, which could either be incomplete data or shows that were produced but never aired.

The median of 6 suggests that most shows have a relatively small number of episodes.
The mean of 24.46, however, is much higher, indicating that there are some shows with significantly more episodes that skew the average upward.
The maximum value of 20,839 episodes is an extreme outlier and could reflect long-running shows or a data anomaly, which should be further investigated to understand its validity.

The difference between the median and the mean, along with the high maximum value, suggests a right-skewed distribution, meaning a few shows with many episodes are pulling the average up.

The central tendency and distribution of ratings across TV shows, which could help identify popular or highly rated shows.

# Summary for vote_average
summary(tv_data$vote_average)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.334   6.000  10.000

The summary statistics for vote_average reveal some interesting patterns:

The minimum and first quartile values are both 0, indicating that a substantial portion of shows have not received any votes or were poorly rated.
The median is also 0, meaning that more than half of the shows in the dataset have either no ratings or extremely low ratings.
The mean vote average is 2.334, which indicates that even when the shows with votes are considered, the ratings are generally quite low.
The third quartile (6.0) and maximum (10.0) show that while there are some highly-rated shows, they are in the minority.

This distribution suggests that most of the shows in the dataset are either unrated or have received low ratings, with only a small proportion achieving high ratings. This could indicate that the dataset includes many obscure or less popular shows.

4. Categorical Summary

By summarizing original language and genres, we want to:

Understand the distribution of TV shows by language. Are certain languages more common than others?

# Categorical summary for original_language
table(tv_data$original_language)

## 
##    aa    ab    af    am    ar    as    av    az    be    bg    bn    bs    ca 
##     1    13    88    48  2638     5     2    11     3    93   229    29   156 
##    cn    cs    cy    da    de    dv    el    en    eo    es    et    eu    fa 
##  1824  1382    30  2019  7712     2   903 76304     1  5602    52    12   244 
##    fi    fr    fy    ga    gd    gl    gu    he    hi    ho    hr    ht    hu 
##   453  7290     3    25     4    18    20   553  1565     1   121     1   404 
##    hy    hz    id    is    it    ja    jv    ka    kk    km    kn    ko    ku 
##     7     2   459    55  1691 14048     3    54    31     5    20  7820    10 
##    kv    la    lb    ln    lo    lt    lv    mi    mk    ml    mn    mo    mr 
##     1     5     2     1     1    65    49     4     5    38     1     7    37 
##    ms    mt    my    nb    ne    nl    no    or    pa    pl    ps    pt    rm 
##   495    20   112    80     6  2923  1210    14    26   923     1  3551     1 
##    ro    ru    se    sh    si    sk    sl    so    sq    sr    st    sv    sw 
##   113  2963     1    65     6   404    50     2    20   152     1  1455     1 
##    ta    te    th    ti    tl    tr    ug    uk    ur    uz    vi    xx    za 
##   114    84  1835     1   861  1748     1   118   333     1   205    28     1 
##    zh    zu 
## 14422    10

The distribution of TV shows by original_language reveals that the dataset is heavily dominated by English language shows, with 76,304 entries. This is not surprising, given that English content often has a broader global reach and is widely produced.

Some other languages with significant representation include:

Spanish (es): 5,602 shows.
Japanese (ja): 14,048 shows
Korean (ko): 7,820 shows
French (fr): 7,290 shows

maller languages such as Arabic (ar) and Russian (ru) have moderate representation. However, there are numerous languages with only a handful of shows, such as Uzbek (uz), Zhuang (za), and Avaric (av), each having just one or two entries.

The dominance of certain languages like English and Japanese could reflect the international popularity of shows from these regions, while the underrepresentation of other languages might suggest limited global production or distribution of content in those languages.

Analyze the diversity of genres in the dataset. Which genres are most popular, and do they have any relationship to other variables like ratings?

# Categorical summary for genres sorted in descending order
genres_summary <- table(tv_data$genres)
sorted_genres_summary <- sort(genres_summary, decreasing = TRUE)

# Display the top 100 categories
top_100_genres <- head(sorted_genres_summary, 100)

# Show the result
top_100_genres

## 
##                                     Documentary 
##                                           17596 
##                                           Drama 
##                                           16282 
##                                          Comedy 
##                                           10304 
##                                         Reality 
##                                            8009 
##                                       Animation 
##                                            3326 
##                                   Comedy, Drama 
##                                            1932 
##                                   Drama, Comedy 
##                                            1878 
##                                            Talk 
##                                            1872 
##                               Animation, Comedy 
##                                            1254 
##                                            Kids 
##                                            1112 
##                                           Crime 
##                                            1037 
##                                            News 
##                                            1022 
##                                    Crime, Drama 
##                                             985 
##                                          Family 
##                                             980 
##                                    Drama, Crime 
##                                             921 
##                                 Animation, Kids 
##                                             908 
##                                   Drama, Family 
##                                             886 
##                                  Drama, Mystery 
##                                             778 
##                            Documentary, Reality 
##                                             758 
##                              Action & Adventure 
##                                             727 
##                              Documentary, Crime 
##                                             681 
##                                         Mystery 
##                                             635 
##                                  Comedy, Family 
##                                             558 
##                                Sci-Fi & Fantasy 
##                                             516 
##                                     Soap, Drama 
##                                             512 
##                                     Drama, Soap 
##                                             490 
##                       Action & Adventure, Drama 
##                                             479 
##                           Drama, War & Politics 
##                                             464 
##                         Drama, Sci-Fi & Fantasy 
##                                             458 
##                            Reality, Documentary 
##                                             444 
##                                            Soap 
##                                             393 
## Animation, Action & Adventure, Sci-Fi & Fantasy 
##                                             385 
##                                   Family, Drama 
##                                             377 
##                       Drama, Action & Adventure 
##                                             374 
##                     Animation, Sci-Fi & Fantasy 
##                                             367 
##                                  Mystery, Drama 
##                                             363 
##                                 Kids, Animation 
##                                             358 
##                   Animation, Action & Adventure 
##                                             356 
##                     Documentary, War & Politics 
##                                             351 
##                           Crime, Drama, Mystery 
##                                             312 
##                                  Family, Comedy 
##                                             283 
##                         Sci-Fi & Fantasy, Drama 
##                                             281 
##                               Comedy, Animation 
##                                             269 
##                                    Comedy, Talk 
##                                             264 
##                                Animation, Drama 
##                                             263 
##                                  Crime, Mystery 
##                                             253 
##                                 Reality, Comedy 
##                                             233 
##            Action & Adventure, Sci-Fi & Fantasy 
##                                             225 
##                                 Comedy, Reality 
##                                             223 
##                              Crime, Documentary 
##                                             222 
##                        Comedy, Sci-Fi & Fantasy 
##                                             212 
##                                  Mystery, Crime 
##                                             203 
##             Animation, Comedy, Sci-Fi & Fantasy 
##                                             199 
##                              Documentary, Drama 
##                                             194 
##                           Drama, Crime, Mystery 
##                                             170 
##                        Animation, Comedy, Drama 
##                                             165 
##                             Documentary, Comedy 
##                                             164 
##                                  War & Politics 
##                                             164 
##                                   Reality, Talk 
##                                             159 
##                           War & Politics, Drama 
##                                             158 
##                                    Talk, Comedy 
##                                             151 
## Animation, Sci-Fi & Fantasy, Action & Adventure 
##                                             146 
##                   Action & Adventure, Animation 
##                                             137 
##                             Comedy, Documentary 
##                                             136 
## Action & Adventure, Animation, Sci-Fi & Fantasy 
##                                             133 
##                                         Western 
##                                             133 
##                                 Reality, Family 
##                                             130 
##                                    Family, Kids 
##                                             129 
##                                    Kids, Family 
##                                             126 
##                               Documentary, News 
##                                             123 
##                         Animation, Family, Kids 
##                                             118 
##                           Comedy, Drama, Family 
##                                             117 
##                               Documentary, Talk 
##                                             115 
##                               Animation, Family 
##                                             112 
##                                   Comedy, Crime 
##                                             111 
##                                 Family, Reality 
##                                             111 
##            Sci-Fi & Fantasy, Action & Adventure 
##                                             111 
##                         Animation, Comedy, Kids 
##                                             107 
##                           Drama, Mystery, Crime 
##                                             106 
##                                   Talk, Reality 
##                                             104 
##                        Sci-Fi & Fantasy, Comedy 
##                                             100 
##                                 Comedy, Mystery 
##                                              96 
##                           Crime, Mystery, Drama 
##                                              95 
##                       Action & Adventure, Crime 
##                                              93 
##                      Action & Adventure, Comedy 
##                                              91 
##           Animation, Action & Adventure, Comedy 
##                                              90 
##                         Animation, Kids, Family 
##                                              90 
##                      Comedy, Action & Adventure 
##                                              89 
##           Animation, Comedy, Action & Adventure 
##                                              85 
##                           Drama, Comedy, Family 
##                                              83 
##                                    Soap, Comedy 
##                                              83 
##     Action & Adventure, Drama, Sci-Fi & Fantasy 
##                                              81 
##                       Mystery, Sci-Fi & Fantasy 
##                                              79 
##                       Animation, Comedy, Family 
##                                              78 
##                Drama, Mystery, Sci-Fi & Fantasy 
##                                              78 
##                             Documentary, Family 
##                                              77 
##                              Drama, Documentary 
##                                              76 
##                           Mystery, Drama, Crime 
##                                              74 
##                Action & Adventure, Crime, Drama 
##                                              71 
##                     Sci-Fi & Fantasy, Animation 
##                                              71

The genre distribution is dominated by a few key genres:

Documentary: 17,596 shows
Drama: 16,282 shows
Comedy: 10,304 shows

These three genres alone account for a large portion of the dataset, suggesting that factual content, character-driven stories, and humor are the most common types of TV shows.

Other genres like Reality (8,009 shows), Animation (3,326 shows), and combinations such as Comedy, Drama (1,932 shows) are also well-represented.

Additionally, genre combinations like Animation, Comedy and Crime, Drama suggest that multi-genre shows are fairly common, particularly in the case of dramatic and comedic content. Some niche combinations like Animation, Sci-Fi & Fantasy are less frequent but still present.

The over-representation of genres like Documentary and Drama could reflect the global trend toward factual content and serialized storytelling, while more niche genres like Mystery and War & Politics appear less frequently, indicating that they cater to smaller, more specialized audiences.

5. Aggregate Function

Question: Does the average rating of a show differ based on its language?

# Aggregating vote_average by original_language
aggregate(tv_data$vote_average, by=list(Language=tv_data$original_language), FUN=mean)

##     Language           x
## 1         aa  0.00000000
## 2         ab  3.43592308
## 3         af  6.47646591
## 4         am  2.35833333
## 5         ar  2.12332563
## 6         as  7.20000000
## 7         av  0.00000000
## 8         az  2.81818182
## 9         be  2.66666667
## 10        bg  4.39077419
## 11        bn  2.97334934
## 12        bs  1.93103448
## 13        ca  2.19060256
## 14        cn  2.15544956
## 15        cs  2.13012084
## 16        cy  2.77633333
## 17        da  2.54437692
## 18        de  1.73996590
## 19        dv  0.00000000
## 20        el  3.35153599
## 21        en  2.12675084
## 22        eo  0.00000000
## 23        es  3.08211871
## 24        et  1.22500000
## 25        eu  3.20833333
## 26        fa  3.19272131
## 27        fi  3.17171302
## 28        fr  2.26773155
## 29        fy  0.00000000
## 30        ga  1.44000000
## 31        gd  0.00000000
## 32        gl  3.03072222
## 33        gu  0.80000000
## 34        he  2.41592948
## 35        hi  3.14637508
## 36        ho  0.00000000
## 37        hr  1.77797521
## 38        ht  0.00000000
## 39        hu  2.35064356
## 40        hy  0.00000000
## 41        hz  5.50000000
## 42        id  1.93763181
## 43        is  3.42360000
## 44        it  3.06983087
## 45        ja  3.69261817
## 46        jv  2.33333333
## 47        ka  3.78024074
## 48        kk  0.32258065
## 49        km  2.00000000
## 50        kn  2.04750000
## 51        ko  2.17177225
## 52        ku  8.00000000
## 53        kv  0.00000000
## 54        la  2.79560000
## 55        lb  3.38600000
## 56        ln  0.00000000
## 57        lo  0.00000000
## 58        lt  1.41538462
## 59        lv  3.25510204
## 60        mi  4.25000000
## 61        mk  3.40000000
## 62        ml  2.52631579
## 63        mn 10.00000000
## 64        mo  5.57142857
## 65        mr  0.97297297
## 66        ms  1.21360404
## 67        mt  0.75000000
## 68        my  0.08928571
## 69        nb  0.89583750
## 70        ne  4.66666667
## 71        nl  1.91611290
## 72        no  2.10793554
## 73        or  0.64285714
## 74        pa  2.42307692
## 75        pl  2.21658072
## 76        ps  0.00000000
## 77        pt  2.32868206
## 78        rm  0.00000000
## 79        ro  4.88791150
## 80        ru  2.97008437
## 81        se 10.00000000
## 82        sh  2.63590769
## 83        si  6.13333333
## 84        sk  1.25779455
## 85        sl  1.28000000
## 86        so  5.00000000
## 87        sq  5.05000000
## 88        sr  3.07360526
## 89        st 10.00000000
## 90        sv  2.80850447
## 91        sw  9.00000000
## 92        ta  3.10204386
## 93        te  3.17380952
## 94        th  2.59456512
## 95        ti 10.00000000
## 96        tl  1.54707201
## 97        tr  3.94363272
## 98        ug  0.00000000
## 99        uk  2.73947458
## 100       ur  6.54834835
## 101       uz  0.00000000
## 102       vi  1.82086829
## 103       xx  2.43632143
## 104       za  0.00000000
## 105       zh  1.70325135
## 106       zu  2.52000000

The average ratings by original_language show notable differences across languages:

Afar (af) has one of the highest average ratings at 6.48, suggesting that TV shows in this language are generally well-received.
Assamese (as) also has a high average rating of 7.20, despite the smaller number of shows in this language.
Arabic (ar) shows a much lower average rating of 2.12, and Amharic (am) has an even lower average of 2.36. This could indicate either lower-quality shows or niche audiences that are more critical in their ratings.
The most dominant language, English (en), is not shown here, but other notable languages like Spanish (es) have 5,602 shows and a slightly above-average rating, while French (fr) shows a broad range of ratings.

These differences in ratings could reflect varying production standards, audience preferences, or voting behaviors based on region. Higher-rated languages might be producing content with better production values or more popular genres, while lower-rated languages could either have fewer resources or are catering to more niche, less mainstream audiences.

5. Visual Summaries

Distribution of Vote Average

install.packages("ggplot2")

## 
## The downloaded binary packages are in
##  /var/folders/hm/vdxhq1nj0b93vcvx_r77cxt00000gp/T//RtmpC5t5Rx/downloaded_packages

library(ggplot2)
ggplot(tv_data, aes(x = vote_average)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Vote Average", x = "Vote Average", y = "Count") +
  theme_minimal()

The histogram visualizes the distribution of the vote_average for TV shows in the dataset. The following insights can be drawn from the plot:

Skewed Distribution:
- The majority of TV shows have a vote average close to 0, which is consistent with the summary statistics where the median and first quartile of vote_average were both 0. This suggests that many shows in the dataset either haven’t received enough votes or have very low ratings.
- As you move to the right of the plot, the number of shows decreases significantly, indicating that highly-rated shows are less common.
Small Cluster of Higher Ratings:
- There is a small cluster of TV shows with ratings between 6 and 10. These shows represent the minority of the dataset but are likely well-received or popular, as the histogram shows fewer counts in these bins.
Outliers and Data Quality Considerations:
- The sharp drop-off after the initial bins suggests that the dataset is heavily concentrated with low-rated or unrated shows. This could be due to the inclusion of obscure or less popular shows that are not widely reviewed.
- The distribution’s shape may also imply a data quality issue, where shows with few or no user ratings are included in the dataset, leading to the dominance of 0-rated entries.
Further Investigation:
- The plot raises questions about the distribution of votes across languages and genres. For example, it might be worth investigating whether certain genres or languages tend to have higher average ratings and if those higher-rated shows follow different trends.

Scatter Plot of Number of Episodes vs. Vote Average, Colored by Original Language

# Scatter plot to show correlation between number_of_episodes and vote_average
# Color is used to differentiate between different original languages
ggplot(tv_data, aes(x = number_of_episodes, y = vote_average, color = original_language)) +
  geom_point(alpha = 0.7) +
  labs(title = "Number of Episodes vs. Vote Average", 
       x = "Number of Episodes", 
       y = "Vote Average") +
  theme_minimal() +
  scale_color_viridis_d()

The scatter plot illustrates the relationship between the number of episodes and the vote average for TV shows, with points colored by their original language. Several important insights can be drawn from the plot:

No Strong Correlation:
- From the scatter plot, it is clear that there is no strong positive or negative correlation between the number of episodes a show has and its average rating. Shows with a low number of episodes can have both high and low ratings, and the same is true for shows with many episodes.
- This indicates that episode count does not necessarily determine a show’s success in terms of ratings. Both short-running and long-running shows have the potential to be well or poorly received.
Cluster of Low-Episode, Low-Rated Shows:
- There is a dense cluster of points in the lower-left corner, representing shows with fewer episodes and low ratings. This suggests that many short-run or pilot shows may not perform well in terms of audience reception.
- A large number of these low-episode shows might also be in languages that are less globally popular, which could explain their low vote counts and lower ratings.
Variety of Languages:
- The use of color to represent different languages reveals some interesting patterns. Some languages, like English, appear across the entire range of episode counts and ratings, suggesting a wide variety of content in terms of both quantity and quality.
- Other languages may show more limited variability, indicating that some languages are more frequently associated with certain types of shows (e.g., shorter or longer running shows).
- Investigating the colors can help identify if certain languages tend to dominate the higher-rated or longer-running shows.
Outliers:
- There are several outliers where shows have an extremely high number of episodes. These points represent long-running shows, possibly from regions where serialized content (e.g., telenovelas or soap operas) is more common. Some of these shows still have relatively low ratings, which indicates that episode count does not guarantee quality or popularity.
- The outliers might warrant further investigation to see if they belong to specific genres or countries known for producing long-running content.