Midterm Project

2025-10-30

The Dataset

The dataset used in this analysis contains 10,000 TV show records with rich metadata including titles, original names and languages, genre IDs, countries of origin, popularity scores, voting metrics, overviews, poster and backdrop paths, and first air dates.

This project explores patterns among the highest-rated shows—examining what common characteristics they share—and analyzes them in various different to help understand the data better

The dataset was obtained from Kaggle: 10000 Popular TV Shows Data set (https://www.kaggle.com/datasets/riteshswami08/10000-popular-tv-shows-dataset-tmdb).

Research Questions

Which shows get the highest ratings (vote_average)?
Do highly rated shows also have high popularity?
Does language / country relate to rating?
Can we predict a show’s rating using popularity and vote_count?

Summary Statistics

desc_table <- tv_clean %>%
  summarize(
    avg_rating_mean = mean(vote_average, na.rm = TRUE),
    avg_rating_sd   = sd(vote_average, na.rm = TRUE),
    popularity_mean = mean(popularity, na.rm = TRUE),
    popularity_sd   = sd(popularity, na.rm = TRUE),
    votes_mean      = mean(vote_count, na.rm = TRUE),
    votes_sd        = sd(vote_count, na.rm = TRUE)
  )

knitr::kable(desc_table, digits = 3)

avg_rating_mean	avg_rating_sd	popularity_mean	popularity_sd	votes_mean	votes_sd
6.55	2.315	7.826	10.551	230.099	872.623

Ratings vs Popularity (ggplot scatter)

Ratings by Language

Popularity, Votes, and Rating (plotly 3D)

Average Rating by Language (plotly bar)

Statistical Model: Can we predict rating?

We used popularity and vote_count to predict vote_average (rating).
Both variables have positive effects, meaning higher popularity and more votes are generally linked to higher ratings.
However, the R² value shows that these factors explain only a small portion of the variation in ratings.

Linear Regression Results: vote_average ~ popularity + vote_count
Term	Estimate	Std. Error	t value
(Intercept)	6.38136	0.02859	223.22087
popularity	0.01165	0.00250	4.66970
vote_count	0.00034	0.00003	11.20575

Takeaways

Popular shows with a high number of votes are not always the highest rated, but many high-rating shows also attract large audiences.
Some languages show slightly higher typical ratings, suggesting regional or cultural taste patterns.
Popularity + vote_count can partially explain rating, but not perfectly — quality and hype aren’t identical.

Appendix: Data Notes / Source

Rows: 10,000 TV shows.

Columns include: name, original_language, origin_country, first_air_date popularity, vote_average, vote_count

text fields like overview, image paths, etc.

Source: Kaggle (10000 Popular TV Shows Data set) (https://www.kaggle.com/datasets/riteshswami08/10000-popular-tv-shows-dataset-tmdb).

Thank You!

Thank you for viewing my presentation!
I appreciate your time and attention.

Questions or Feedback?
vbhatia8@asu.edu