Link to data set

Question & Background

TV shows can be rated by several criteria, including audience size, critical reviews, and awards. The higher the ratings, the more successful the show is said to be. Ratings are often based on the number of viewers and their engagement with the show, as well as the quality of the writing, acting, and production. The most-watched shows tend to be those that feature strong characters and engaging storylines, as well as those that attract a wide demographic of viewers. Popular genres such as comedy, drama, and reality television all have their own ratings systems, and shows that successfully combine multiple genres can often stand out and achieve even higher ratings. In addition to ratings, awards are also a way of measuring the success of a TV show. Awards such as the Emmy, Golden Globe, and Peabody often recognize excellence in television, making a show more popular and successful. Overall, TV shows that feature strong characters, engaging storylines, and are well-produced tend to be rated.

In this project, the question we are trying to answer is whether we can predict TV show ratings with other variables that relate to each of the shows. In our data set, we have variables such as the show air date, the origin country, the original language, the name, the average rating, the number of votes, and the description of the show, and lastly how popular the TV show is based on consumer views.

Exploratory Data Analysis

Data Dictionary

  1. first_air_date: The date when the show was first aired on television
  2. origin_country: The country where the show was created / originates from
  3. original_language: The original language of the show
  4. name: Name of the show in English. Note that names in original language are not included in this data set
  5. popularity: A metric that measures how popular a TV show is based on consumer views
  6. vote_average: Average of the total number of votes the show received
  7. vote_count: The number of votes the show received
  8. overview: A brief description of the show

Summaries

tvshows = read_csv("/Users/nhinguyen/Desktop/School/Fall 2022/DS 3001/DS-3001/FINAL PROJECT/data_TV.csv")

sum1 = summary(tvshows$popularity)
sum2 = summary(tvshows$vote_average)
sum3 = summary(tvshows$vote_count)

sum_df = rbind(sum1, sum2, sum3)
sum_row = c("Popularity","Average Rating","Vote Count")
rownames(sum_df) = sum_row
knitr::kable(sum_df)
Min. 1st Qu. Median Mean 3rd Qu. Max.
Popularity 0.866 16.567 27.489 59.805976 49.765 6684.611
Average Rating 0.600 7.300 7.700 7.692434 8.100 9.000
Vote Count 99.000 150.000 257.000 604.823462 569.000 19459.000

Distribution of Average Rating

ggplot(data=tvshows, aes(x=vote_average)) +
  geom_histogram(fill="steelblue", color="black") +
  ggtitle("Histogram of Vote Averages")+
  theme_minimal()

Methods

Data Cleaning

weird = startsWith(tvshows$origin_country, 'c')
tvshows$origin_country[weird] = "Multiple"

# mice::md.pattern(tvshows, rotate.names = TRUE)

# Around 65-70 missing pieces of data
# Wasn't planning on getting rid of them b/c that meant getting rid of 70 ish tv shows but it would've interfered with the sentiment analysis process
tvshows = tvshows[complete.cases(tvshows),]

# Separated original date column into `year`, `month`, and `day` for further analysis purposes
tvshows = tvshows %>%
  separate(first_air_date, c('year', 'month', 'day'), '-') %>%
  mutate_at(c('year', 'month', 'day'), as.numeric)

tvshows$origin_country = fct_collapse(tvshows$origin_country, 
                           N_America = c("CA"),
                           S_America = c("AR", "BR", "CL", "CO", "PE"),
                           Asia = c("CN", "IL", "IN", "JO", "PH", "TH", "TR"),
                           Europe = c("AT", "BE", "DE", "DK", "ES", "FR", "IE", "IS", "IT", "NO", "PL", "RU", "SE"),
                           Africa = c("EG", "ZA"),
                           Other = c("NZ", "PR", "AU"))

# original_language: en, es, ja, ko; rest is others

tvshows$original_language = fct_collapse(tvshows$original_language, 
                           other = c("ar","ca","da","de","fr","he","hi","is","it",
                           "nl","no","pl","pt","ru","sv","th","tl","tr","zh"))

paged_table(tvshows)

Besides the usual data cleaning steps, we decided to break the date variable into three new variables: year, month, and day. We did this to explore if the release time period correlates with rating. Collapsing our factor variables was quite challenging because there were some categories that had more observations than others. For origin_country, we decided to keep the most common countries (i.e., US, GB, MX, JP, and KR) as they were and sorted the other countries into their respective continents. We followed a similar process for original_language where English, Spanish, Japanese, and Korean were kept the same and the other languages were recategorized.

Text Mining

To deal with our overview variable, we decided to do some simple text mining to turn its unstructured text format into data that was more suitable for data analysis. The method we utilized was tf-idf.

# Source: https://stackoverflow.com/questions/30900229/performing-text-analytics-on-a-text-column-in-dataframe-in-r
text1 = tvshows[,c("year","name","overview")]

text1$overview = tm::removeNumbers(text1$overview)
text1$overview = str_replace_all(text1$overview, pattern = "[[:punct:]]", " ")

text1$overview = tm::removeWords(x = text1$overview, stopwords(kind = "SMART"))

word_count = text1 %>%
  unnest_tokens(word, overview) %>%
  count(year, name, word, sort = TRUE)

total_words = word_count %>% 
  group_by(year, name) %>% 
  summarize(total = sum(n))

overview_words = inner_join(word_count, total_words)

overview_words = overview_words %>%
  bind_tf_idf(word, name, n)

word_im = overview_words %>% 
  group_by(year, name) %>%
  top_n(1, tf_idf)

word_im = word_im %>% 
  distinct(year, name, .keep_all = TRUE)

tv_shows = inner_join(tvshows, word_im)
tv_shows = tv_shows[-c(10:15)]

paged_table(word_im)

Term Trequency (tf): how frequently does the word appear in a show’s overview

Inverse Document Frequency (idf): how much information does this word provide about all of the overviews (i.e., Is it common or rare?)

Term Frequency-Inverse Document Frequency (tf-idf): the ratio of tf and idf; how important is this word to its own overview

paged_table(tv_shows)

Model

#4 Split your data into test, tune, and train. (80/10/10)
tv_shows[sapply(tv_shows, is.character)] = lapply(tv_shows[sapply(tv_shows, is.character)], as.factor)

set.seed(1130)
part_index_1 = caret::createDataPartition(tv_shows$vote_average,  
                                           times=1,
                                           p = 0.80,
                                           groups=1,
                                           list=FALSE)

train = tv_shows[part_index_1, ]  # subset so it's only training data from part_index_1
tune_and_test = tv_shows[-part_index_1, ] # everything else that's not training data

set.seed(1130)
tune_and_test_index = createDataPartition(tune_and_test$vote_average,  # split the tune and test set 50-50
                                           p = .5,
                                           list = FALSE,
                                           times = 1)

tune = tune_and_test[tune_and_test_index, ]  # subset the 50% chosen into the tune set
test = tune_and_test[-tune_and_test_index, ]  # subset the remaining 50% into the test set

Results

features = train[,-c(8, 6)] # drop target variable (vote_average) and the column tv show names bc it will mess with variable importance later on

target = train$vote_average

# Step 1: Cross Validation
# the process by which the training data will be used to build the initial model must be set.
fitControl = caret::trainControl(method = "repeatedcv",
                          number = 15,
                          repeats = 7) 

# Step 2: 
set.seed(1130)
tvshows_mdl = caret::train(x=features,
                y=target,
                method="rpart2",
                trControl=fitControl,
                metric="RMSE")

tvshows_mdl # maxdepth = 3
## CART 
## 
## 2039 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (15 fold, repeated 7 times) 
## Summary of sample sizes: 1904, 1903, 1903, 1903, 1903, 1904, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##   1         0.5696701  0.1662966  0.4343963
##   2         0.5363837  0.2629091  0.4049771
##   3         0.5282397  0.2855980  0.3963330
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 3.

We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the vote_count variable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.

Plot

plot(tvshows_mdl)  

# range(tv_shows$vote_average)
# 0.5282397/(9-0.6) # NRMSE = 0.06288568 based on model 1 with maxdepth = 3

We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the vote_count variable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.

Variable Importance

varImp(tvshows_mdl) 
## rpart2 variable importance
## 
##                   Overall
## vote_count        100.000
## popularity         48.841
## original_language  34.001
## origin_country     33.605
## day                 8.088
## year                5.405
## tf_idf              1.212
## month               0.000

We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the vote_count variable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.

Tree

rpart.plot(tvshows_mdl$finalModel, type=5,extra=101)

We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the vote_count variable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.

Evaluations

Tune Set

tvshows_pred_tune = predict(tvshows_mdl,tune)

tune_eval = postResample(pred = tvshows_pred_tune, obs = tune$vote_average)

# range(tune$vote_average)
tune_nrmse = 0.5180544/(8.7-5.5) 

tune_eval1 = data.frame(tune_eval)
tune_eval1 = rbind(tune_eval1, tune_nrmse)
row1 = c("RMSE","Rsquared","MAE","NRMSE")
coln = c("Value")
rownames(tune_eval1) = row1
colnames(tune_eval1) = coln

knitr::kable(tune_eval1)
Value
RMSE 0.5180544
Rsquared 0.2596263
MAE 0.4110412
NRMSE 0.1618920

Test Set

set.seed(1130)
tvshows_mdl_final = caret::train(x=features,
                y=target,
                method="rpart2",
                trControl=fitControl,
                metric="RMSE")

# tvshows_mdl_final

# rpart.plot(tvshows_mdl_final$finalModel, type = 5, extra=101)

pred_test_reg = predict(tvshows_mdl_final, test)
# View(as_tibble(pred_test_reg))

test_eval = postResample(pred = pred_test_reg, obs = test$vote_average)
# range(test$vote_average)
test_nrmse = 0.4948011/(8.7-5.5) 

test_eval1 = data.frame(test_eval)
test_eval1 = rbind(test_eval1, test_nrmse)

rownames(test_eval1) = row1
colnames(test_eval1) = coln

knitr::kable(test_eval1)
Value
RMSE 0.4948011
Rsquared 0.3420358
MAE 0.3804108
NRMSE 0.1546253

Fairness Assessment

We did not identify any protected class in our data set. However, we recognized that doing work in text mining that utilizes natural process learning (NLP) might pose some issues. For example, some of the summaries were in a language other than English or contained some non-English words. We thought it was okay for our situation because of the method we selected, but this would be more of a concern if we went with our original plan of doing sentiment analysis.

Conclusions

When we predicted on the test set with our final model, we got a RMSE of 0.49 and a Rsquared of 0.34. A RMSE of 0.49 meant that our model wasn’t the best at predicting the observed data points. An Rsquared of 0.34 shows that our variables did not have a strong correlation. However, this was probably the best we could do with our limited data set. Even though our decision tree was quite small, it showed that TV shows originating from Japan and South Korea performed better than the other regions. This information could be helpful to streaming services, such as Netflix or Hulu, if they wanted to study what kind of content they should add in the future.

Future Work

In the future, it would be preferable to improve things from the data collection steps. This would ensure that our data set would include variables that the model can differentiate and learn from. It would also be better if the data set had shows that were rated poorly so that the model could learn what that would mean.