Assignment #7 - Ethical Web Scraping

IMDb Top 250 TV Shows

Television has continued to evolve with the rise of streaming platforms and on-demand viewing. Streaming services like Netflix, Hulu, Disney+, and HBO Max have reshaped how people watch shows, which have led to both long-running series and shorter limited mini-series becoming more common. IMDb is a useful way to look at these shifts because its Top 250 TV Shows ranking is based on user ratings, so it reflects audience opinions rather than critics or awards. On a personal level, I also really enjoy starting new TV shows, so it was interesting to look through the highest-rated ones and see what kinds of series tend to stand out to viewers over time.

This project explores the question: How have audience preferences and television trends shaped IMDb’s top-ranked TV shows?

Introduction to Dataset

The first step is to load the required packages and import the scraped IMBd dataset.

library(readr)
library(tidyverse)
library(lubridate)

imdb_tvshows <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/clorem_xavier_edu/IQCjYtSNRuKMSYOUTggXp-CYAZguOyIcd3RkUrkZE71Fx2g?download=1")

Now that the data is loaded, we can begin exploring and visualizing it. The dataset contains 250 distinct television shows from IMDb’s Top 250 TV Shows list. Each row represents a single TV show, and the original scraped variables include:

  • tvshow_title: name of the TV show

  • rating_of_show: content rating (e.g., TV-MA, TV-14)

  • series_type: type of show (TV series or mini-series)

  • stars_of_tvshow: IMDb user rating (out of 10)

  • start_year: year the show first aired

  • end_year: year the show ended (if applicable)

Additionally variables were created during the data wrangling process to support the analysis. These include:

  • show_age: number of years since release

  • show_length: number of years the show ran

  • rating_group: grouped content rating category (Kids, Family, Teen, Mature)

Data Wrangling

After scraping, the dataset was cleaned and transformed for analysis. Numeric variables such as IMDb ratings and release years were converted into appropriate formats. Two new variables were created: show age, which measures how many years have passed since a show’s release, and show length, which captures the number of years a show ran, accounting for ongoing series.

Additionally, content ratings were grouped into broader categories (Kids, Family, Teen, and Mature) to simplify comparisons across audience maturity levels. Missing values and irrelevant categories such as “Not Rated” and “Approved” were removed to improve interpretability.

Analysis & Visualizations

1. Distribution of TV Show Ages

This visualization shows the distribution of TV show ages within IMDb’s Top 250 rankings. It helps identify whether the dataset is dominated by newer or older television shows.

This distribution is positively skewed, with the vast majority of IMDb’s Top 250 TV shows having been released within the last 15 years. It peaks around 5-10 years since release, which likely reflects the rise and dominance of streaming platforms and the increased production of higher quality content in the streaming era. There is also a smaller secondary cluster around 20-22 years, suggesting that early-2000s series have maintained lasting popularity and cultural relevance. In contrast, very few shows older than 40 years appear in the rankings. Overall, this histogram indicates that newer TV content tends to dominate IMDb’s top-rated list, while a smaller number of older series still appear among the highest-ranked shows.

2. Distribution of TV Show Lengths

This plot examines how long television shows in the Top 250 tend to run.

The histogram shows that most of IMDb’s Top 250 TV shows are relatively short-lived, with the overwhelming majority running for fewer than 5 years. The largest group consists of shows that lasted around 1-2 years, which aligns with the rise of limited and mini-series formats. From there, there is a fairly steady decline as show length increases. However, shows in the 5-10 year range still appear in notable numbers, suggesting that some longer-running series have also secured a place among the highest-rated TV shows. Beyond 10 years, the count drops off dramatically with only a small handful of shows making it past the 20-year mark.

3. Total Number of TV Shows by Type

This visualization compares the frequency of different types of TV shows, such as TV series and mini-series, within the Top 250 rankings.

The bar chart shows a clear difference between the number of shows classified as series versus mini-series. Approximately 210 of IMDb’s Top 250 shows are full series, while around 40 are mini-series. This means that traditional multi-season series make up about 85% of the rankings, indicating that longer-form television continues to dominate the list even as limited series have become more common in recent years.

4. Relationship between IMDb Ratings & Show Length

This scatterplot examines the relationship between how long a show runs and its IMDb rating.

The scatterplot suggests that the highest-rated shows on IMDb tend to be shorter in duration as nearly all shows rated 9.25 and above ran for 7 years or less. This pattern may indicate that more tightly structured or limited series are associated with higher user ratings. In contrast, shows that run for 15 or more years tend to cluster in the 8.5-9.0 range, with no long-running series reaching the very highest ratings.

5. Distribution of IMDb Ratings by Content Rating Group

This boxplot compares IMDb ratings across different content rating categories.

This boxplot indicates that the Kids content group has the highest median IMDb rating of the four categories, sitting just under 9.0. It also shows a relatively wide interquartile range, which suggests that while many kids’ shows are rated highly, there is also some variation in scores within this group. In contrast, Mature content, despite being the most represented category in the Top 250, has a lower median rating and a comparatively tighter spread. This suggests that while adult dramas and thrillers make up a large portion of the list, their ratings tend to cluster more consistently within a narrower range. Family and Teen content both share similar median ratings around 8.7.

Conclusion

Overall, this analysis suggests that audience preferences and modern TV trends have strongly influenced IMDb’s top-ranked TV shows. Newer content released during the streaming era dominates the rankings, while shorter and more tightly structured series appear to receive the highest ratings from viewers. Although traditional multi-season series still make up the majority of the Top 250, limited and mini-series formats have become increasingly prominent. In addition to this, the differences across content rating groups indicates that audience reception varies depending on the intended maturity level of the show. Ultimately, IMDb’s highest-ranked TV shows reflect a shift toward modern, highly produced storytelling that resonates with today’s streaming-focused audiences.