Link to data set
TV shows can be rated by several criteria, including audience size, critical reviews, and awards. The higher the ratings, the more successful the show is said to be. Ratings are often based on the number of viewers and their engagement with the show, as well as the quality of the writing, acting, and production. The most-watched shows tend to be those that feature strong characters and engaging storylines, as well as those that attract a wide demographic of viewers. Popular genres such as comedy, drama, and reality television all have their own ratings systems, and shows that successfully combine multiple genres can often stand out and achieve even higher ratings. In addition to ratings, awards are also a way of measuring the success of a TV show. Awards such as the Emmy, Golden Globe, and Peabody often recognize excellence in television, making a show more popular and successful. Overall, TV shows that feature strong characters, engaging storylines, and are well-produced tend to be rated.
In this project, the question we are trying to answer is whether we can predict TV show ratings with other variables that relate to each of the shows. In our data set, we have variables such as the show air date, the origin country, the original language, the name, the average rating, the number of votes, and the description of the show, and lastly how popular the TV show is based on consumer views.
first_air_date: The date when the show was first aired
on televisionorigin_country: The country where the show was created
/ originates fromoriginal_language: The original language of the
showname: Name of the show in English. Note that names in
original language are not included in this data setpopularity: A metric that measures how popular a TV
show is based on consumer viewsvote_average: Average of the total number of votes the
show receivedvote_count: The number of votes the show receivedoverview: A brief description of the showtvshows = read_csv("/Users/nhinguyen/Desktop/School/Fall 2022/DS 3001/DS-3001/FINAL PROJECT/data_TV.csv")
sum1 = summary(tvshows$popularity)
sum2 = summary(tvshows$vote_average)
sum3 = summary(tvshows$vote_count)
sum_df = rbind(sum1, sum2, sum3)
sum_row = c("Popularity","Average Rating","Vote Count")
rownames(sum_df) = sum_row
knitr::kable(sum_df)
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Popularity | 0.866 | 16.567 | 27.489 | 59.805976 | 49.765 | 6684.611 |
| Average Rating | 0.600 | 7.300 | 7.700 | 7.692434 | 8.100 | 9.000 |
| Vote Count | 99.000 | 150.000 | 257.000 | 604.823462 | 569.000 | 19459.000 |
ggplot(data=tvshows, aes(x=vote_average)) +
geom_histogram(fill="steelblue", color="black") +
ggtitle("Histogram of Vote Averages")+
theme_minimal()
weird = startsWith(tvshows$origin_country, 'c')
tvshows$origin_country[weird] = "Multiple"
# mice::md.pattern(tvshows, rotate.names = TRUE)
# Around 65-70 missing pieces of data
# Wasn't planning on getting rid of them b/c that meant getting rid of 70 ish tv shows but it would've interfered with the sentiment analysis process
tvshows = tvshows[complete.cases(tvshows),]
# Separated original date column into `year`, `month`, and `day` for further analysis purposes
tvshows = tvshows %>%
separate(first_air_date, c('year', 'month', 'day'), '-') %>%
mutate_at(c('year', 'month', 'day'), as.numeric)
tvshows$origin_country = fct_collapse(tvshows$origin_country,
N_America = c("CA"),
S_America = c("AR", "BR", "CL", "CO", "PE"),
Asia = c("CN", "IL", "IN", "JO", "PH", "TH", "TR"),
Europe = c("AT", "BE", "DE", "DK", "ES", "FR", "IE", "IS", "IT", "NO", "PL", "RU", "SE"),
Africa = c("EG", "ZA"),
Other = c("NZ", "PR", "AU"))
# original_language: en, es, ja, ko; rest is others
tvshows$original_language = fct_collapse(tvshows$original_language,
other = c("ar","ca","da","de","fr","he","hi","is","it",
"nl","no","pl","pt","ru","sv","th","tl","tr","zh"))
paged_table(tvshows)
Besides the usual data cleaning steps, we decided to break the
datevariable into three new variables:year,month, andday. We did this to explore if the release time period correlates with rating. Collapsing our factor variables was quite challenging because there were some categories that had more observations than others. Fororigin_country, we decided to keep the most common countries (i.e., US, GB, MX, JP, and KR) as they were and sorted the other countries into their respective continents. We followed a similar process fororiginal_languagewhere English, Spanish, Japanese, and Korean were kept the same and the other languages were recategorized.
To deal with our overview variable, we decided to do
some simple text mining to turn its unstructured text format into data
that was more suitable for data analysis. The method we utilized was
tf-idf.
# Source: https://stackoverflow.com/questions/30900229/performing-text-analytics-on-a-text-column-in-dataframe-in-r
text1 = tvshows[,c("year","name","overview")]
text1$overview = tm::removeNumbers(text1$overview)
text1$overview = str_replace_all(text1$overview, pattern = "[[:punct:]]", " ")
text1$overview = tm::removeWords(x = text1$overview, stopwords(kind = "SMART"))
word_count = text1 %>%
unnest_tokens(word, overview) %>%
count(year, name, word, sort = TRUE)
total_words = word_count %>%
group_by(year, name) %>%
summarize(total = sum(n))
overview_words = inner_join(word_count, total_words)
overview_words = overview_words %>%
bind_tf_idf(word, name, n)
word_im = overview_words %>%
group_by(year, name) %>%
top_n(1, tf_idf)
word_im = word_im %>%
distinct(year, name, .keep_all = TRUE)
tv_shows = inner_join(tvshows, word_im)
tv_shows = tv_shows[-c(10:15)]
paged_table(word_im)
Term Trequency (tf): how frequently does the word appear in a show’s overview
Inverse Document Frequency (idf): how much information does this word provide about all of the overviews (i.e., Is it common or rare?)
Term Frequency-Inverse Document Frequency (tf-idf): the ratio of tf and idf; how important is this word to its own overview
paged_table(tv_shows)
#4 Split your data into test, tune, and train. (80/10/10)
tv_shows[sapply(tv_shows, is.character)] = lapply(tv_shows[sapply(tv_shows, is.character)], as.factor)
set.seed(1130)
part_index_1 = caret::createDataPartition(tv_shows$vote_average,
times=1,
p = 0.80,
groups=1,
list=FALSE)
train = tv_shows[part_index_1, ] # subset so it's only training data from part_index_1
tune_and_test = tv_shows[-part_index_1, ] # everything else that's not training data
set.seed(1130)
tune_and_test_index = createDataPartition(tune_and_test$vote_average, # split the tune and test set 50-50
p = .5,
list = FALSE,
times = 1)
tune = tune_and_test[tune_and_test_index, ] # subset the 50% chosen into the tune set
test = tune_and_test[-tune_and_test_index, ] # subset the remaining 50% into the test set
features = train[,-c(8, 6)] # drop target variable (vote_average) and the column tv show names bc it will mess with variable importance later on
target = train$vote_average
# Step 1: Cross Validation
# the process by which the training data will be used to build the initial model must be set.
fitControl = caret::trainControl(method = "repeatedcv",
number = 15,
repeats = 7)
# Step 2:
set.seed(1130)
tvshows_mdl = caret::train(x=features,
y=target,
method="rpart2",
trControl=fitControl,
metric="RMSE")
tvshows_mdl # maxdepth = 3
## CART
##
## 2039 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (15 fold, repeated 7 times)
## Summary of sample sizes: 1904, 1903, 1903, 1903, 1903, 1904, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.5696701 0.1662966 0.4343963
## 2 0.5363837 0.2629091 0.4049771
## 3 0.5282397 0.2855980 0.3963330
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 3.
We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the
vote_countvariable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.
plot(tvshows_mdl)
# range(tv_shows$vote_average)
# 0.5282397/(9-0.6) # NRMSE = 0.06288568 based on model 1 with maxdepth = 3
We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the
vote_countvariable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.
varImp(tvshows_mdl)
## rpart2 variable importance
##
## Overall
## vote_count 100.000
## popularity 48.841
## original_language 34.001
## origin_country 33.605
## day 8.088
## year 5.405
## tf_idf 1.212
## month 0.000
We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the
vote_countvariable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.
rpart.plot(tvshows_mdl$finalModel, type=5,extra=101)
We used cross validation to build our model on the train set and the most optimal model was the one that utilized a maxdepth value of 3. Looking at the variable importance and tree, we became concern with the
vote_countvariable because of its high importance value and being 1 of the two variables being used in the tree. To combat this, we decided to build a new model without that variable, but found it made things worst. The tree decreased in size and and the evaluation metrics did not improve. We decided to proceed with this model as our final model.
tvshows_pred_tune = predict(tvshows_mdl,tune)
tune_eval = postResample(pred = tvshows_pred_tune, obs = tune$vote_average)
# range(tune$vote_average)
tune_nrmse = 0.5180544/(8.7-5.5)
tune_eval1 = data.frame(tune_eval)
tune_eval1 = rbind(tune_eval1, tune_nrmse)
row1 = c("RMSE","Rsquared","MAE","NRMSE")
coln = c("Value")
rownames(tune_eval1) = row1
colnames(tune_eval1) = coln
knitr::kable(tune_eval1)
| Value | |
|---|---|
| RMSE | 0.5180544 |
| Rsquared | 0.2596263 |
| MAE | 0.4110412 |
| NRMSE | 0.1618920 |
set.seed(1130)
tvshows_mdl_final = caret::train(x=features,
y=target,
method="rpart2",
trControl=fitControl,
metric="RMSE")
# tvshows_mdl_final
# rpart.plot(tvshows_mdl_final$finalModel, type = 5, extra=101)
pred_test_reg = predict(tvshows_mdl_final, test)
# View(as_tibble(pred_test_reg))
test_eval = postResample(pred = pred_test_reg, obs = test$vote_average)
# range(test$vote_average)
test_nrmse = 0.4948011/(8.7-5.5)
test_eval1 = data.frame(test_eval)
test_eval1 = rbind(test_eval1, test_nrmse)
rownames(test_eval1) = row1
colnames(test_eval1) = coln
knitr::kable(test_eval1)
| Value | |
|---|---|
| RMSE | 0.4948011 |
| Rsquared | 0.3420358 |
| MAE | 0.3804108 |
| NRMSE | 0.1546253 |
We did not identify any protected class in our data set. However, we recognized that doing work in text mining that utilizes natural process learning (NLP) might pose some issues. For example, some of the summaries were in a language other than English or contained some non-English words. We thought it was okay for our situation because of the method we selected, but this would be more of a concern if we went with our original plan of doing sentiment analysis.
When we predicted on the test set with our final model, we got a RMSE of 0.49 and a Rsquared of 0.34. A RMSE of 0.49 meant that our model wasn’t the best at predicting the observed data points. An Rsquared of 0.34 shows that our variables did not have a strong correlation. However, this was probably the best we could do with our limited data set. Even though our decision tree was quite small, it showed that TV shows originating from Japan and South Korea performed better than the other regions. This information could be helpful to streaming services, such as Netflix or Hulu, if they wanted to study what kind of content they should add in the future.
In the future, it would be preferable to improve things from the data collection steps. This would ensure that our data set would include variables that the model can differentiate and learn from. It would also be better if the data set had shows that were rated poorly so that the model could learn what that would mean.