HW05.1

Author

Xiangzhe Li

Question 1

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Read in the data 
tv_ratings <- read_csv("https://raw.githubusercontent.com/vaiseys/dav-course/main/Data/tv_ratings.csv")

Rows: 2266 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): titleId, title, genres
dbl  (3): seasonNumber, av_rating, share
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Glimpse the data 
glimpse(tv_ratings)

Rows: 2,266
Columns: 7
$ titleId      <chr> "tt2879552", "tt3148266", "tt3148266", "tt3148266", "tt31…
$ seasonNumber <dbl> 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 1, 1, …
$ title        <chr> "11.22.63", "12 Monkeys", "12 Monkeys", "12 Monkeys", "12…
$ date         <date> 2016-03-10, 2015-02-27, 2016-05-30, 2017-05-19, 2018-06-…
$ av_rating    <dbl> 8.4890, 8.3407, 8.8196, 9.0369, 9.1363, 8.4370, 7.5089, 8…
$ share        <dbl> 0.51, 0.46, 0.25, 0.19, 0.38, 2.38, 2.19, 6.67, 7.13, 5.8…
$ genres       <chr> "Drama,Mystery,Sci-Fi", "Adventure,Drama,Mystery", "Adven…

tv_long <- tv_ratings |> 
  group_by(title) |> 
  summarise(num_seasons = n()) |> 
  ungroup() |> 
  left_join(tv_ratings, by = "title") 

tv_long <- tv_long |> 
  filter(num_seasons >= 5)

ggplot(tv_long, aes(x = seasonNumber, y = av_rating, group = title)) +
  geom_line() +
  labs(
    title = "Average TV Ratings Across Seasons",
    x = "Season Number",
    y = "Average Rating"
  )

Conclusion: By the naked eye, it seems that in this graph we cannot see any obvious correlation.

Question 2

ggplot(tv_long, aes(x = seasonNumber, y = av_rating, group = title)) +
  geom_line() +
  facet_wrap(~ genres) +
  labs(
    title = "Average TV Ratings Across Seasons by Genre",
    x = "Season Number",
    y = "Average Rating"
  ) +
  theme(
    axis.text.y = element_text(size = 5),
    strip.text = element_text(size = 5)
  )

What shows tend to last longer?

Categories such as crime drama mystery; drama romance; drama fantasy horror include series that continue for more than ten seasons while many other categories see most shows conclude within five to ten seasons.

Do ratings change much across seasons?

For many shows, ratings remain fairly steady, fluctuating within a narrow one to two points, some series exhibit clearer trends, either gradually improving or steadily declining as additional seasons are released.

Can you identify that show on Drama, Family, Fantasy whose ratings just plummeted?

tv_ratings |>
  filter(genres == "Drama,Family,Fantasy") |>
  select(title, seasonNumber, av_rating)

# A tibble: 9 × 3
  title                          seasonNumber av_rating
  <chr>                                 <dbl>     <dbl>
1 Are You Afraid of the Dark?               1      9.17
2 Are You Afraid of the Dark?               2      9.24
3 Are You Afraid of the Dark?               3      9.43
4 Are You Afraid of the Dark?               4      9.31
5 Are You Afraid of the Dark?               5      8.95
6 Are You Afraid of the Dark?               6      6.02
7 Are You Afraid of the Dark?               7      6.83
8 R.L. Stine's The Haunting Hour            1      7.53
9 Touched by an Angel                       5      9.6

Are You Afraid of the Dark rating plummeted at season 6, dropping 3 point.

Question 3

top_rated <- tv_ratings |> 
  filter(av_rating >= 9)

ggplot(top_rated, aes(x = genres)) +
  geom_bar() +
  labs(
    title = "Genres in Top-Rated Shows (Rating >= 9)",
    x = "Genre",
    y = "Count"
  ) +
  scale_y_continuous(breaks = scales::pretty_breaks()) +
  coord_flip()

What coord_flip() does:

The coord_flip() function flips x axis with y axis so that the type of genre in words are placed in a row instead of a column, hugely improving readability.

Genre with the most top-rated shows: Drama.

Question 4

comedies_dramas <- tv_ratings |> 
  mutate(is_comedy = if_else(str_detect(genres, "Comedy"), 
                             1, 
                             0)) |> # If it contains the word comedy then 1, else 0
  filter(is_comedy == 1 | genres == "Drama") |> # Keep comedies and dramas
  mutate(genres = if_else(genres == "Drama", # Make it so that we only have those two genres
                          "Drama", 
                          "Comedy"))

glimpse(comedies_dramas)

Rows: 684
Columns: 8
$ titleId      <chr> "tt0312081", "tt0312081", "tt0312081", "tt1225901", "tt12…
$ seasonNumber <dbl> 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 1, 25, 1, 1, 2, 3, 4, 5, 1,…
$ title        <chr> "8 Simple Rules", "8 Simple Rules", "8 Simple Rules", "90…
$ date         <date> 2002-09-17, 2003-11-04, 2004-11-12, 2009-01-03, 2009-11-…
$ av_rating    <dbl> 7.5000, 8.6000, 8.4043, 7.1735, 7.4686, 7.6858, 6.8344, 7…
$ share        <dbl> 0.03, 0.10, 0.06, 0.40, 0.14, 0.10, 0.04, 0.01, 0.48, 0.4…
$ genres       <chr> "Comedy", "Comedy", "Comedy", "Comedy", "Comedy", "Comedy…
$ is_comedy    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …

ggplot(comedies_dramas, aes(x = av_rating, fill = genres)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "Distribution of Average Ratings: Comedies vs Dramas",
    x = "Average Rating",
    y = "Density",
    fill = "Genre"
  )

How does my prediction above hold? Are dramas rated higher?

Yes, dramas are overall higher rated because its distribution is slight more right skewed than that of the comedy, meaning that is has higher mean and median.

Question 5

ggplot(comedies_dramas, aes(x = av_rating, fill = genres)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  labs(
    title = "Distribution of Average Ratings: Comedies vs Dramas (Histogram)",
    x = "Average Rating",
    y = "Count",
    fill = "Genre"
  )

ggplot(comedies_dramas, aes(x = av_rating, color = genres)) +
  geom_freqpoly(bins = 30) +
  labs(
    title = "Distribution of Average Ratings: Comedies vs Dramas (Frequency Polygon)",
    x = "Average Rating",
    y = "Count",
    color = "Genre"
  )

Using the histogram, I obersve that the size for drama is signifanctly smaller than comedy, suggesting that in the dataset there is more comedies.

Which plot is most informative: I personally think the frequency polygon is most informative as it shows both the difference in sample size and the distribution, combining the advantages of histogram and density graph.

Question 6

ggplot(comedies_dramas, aes(x = av_rating, y = share)) +
  geom_bin_2d() +
  labs(
    title = "Relationship between Average Rating and Viewership Share",
    x = "Average Rating",
    y = "Viewership Share",
    fill = "Count"
  )

`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

I am able to see that first of all most shows have an average rating between 7 and 9 and a share between 0% to 10% with a few outliers. The 2-D binned heatmap have frequency/density information marked with color.

ggplot(comedies_dramas, aes(x = av_rating, y = share, fill = genres)) +
  geom_bin_2d() +
  labs(
    title = "Relationship between Average Rating and Viewership Share by Genre",
    x = "Average Rating",
    y = "Viewership Share",
    fill = "Genre"
  )

`stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

Drama overall has slightly higer ratings and share, as can be seen from the four orange dots above 10% share, while dramas locate more at the bottom of the graph, suggesting lower share.

Outlier:

comedies_dramas |>
  filter(share >= 20)

# A tibble: 1 × 8
  titleId   seasonNumber title   date       av_rating share genres is_comedy
  <chr>            <dbl> <chr>   <date>         <dbl> <dbl> <chr>      <dbl>
1 tt0092337            1 Dekalog 1990-04-13      8.22  27.2 Drama          0