library(ggplot2)
library(dplyr)
library(statsr)
library(caret) # machine learning and pre processing functions
library(gridExtra) # combo plot for ggplot functions
load("movies.Rdata")
set.seed(123)
The movies dataset gathers data on movies and a few attributes about its structure, like movie length in minutes, release day, month and year, MPAA rating, and so on. Besides these variables, there are a few columns of interest regarding the movie evaluation, such as the Internet Movie Database - IMDB rating, critics rating and audience ratings. The goal of this project is to bring a perspective in one of these classifications, the IMDB ratings, and its relationship with the other variables in the dataset, excluding the other variables of interest. Thus, we will not try to predict the IMDB ratings over other places ratings.
This dataset was composed via random sampling the IMDB and Rotten Tomatoes database and, as there was no random assignment treatments to the variables, all findings can only be states as correlational, and not causational.
A few features about the movie can make it more or less popular, and we do not mean only its content of drama, humor, horror, etc…A few biases can make make people unintentionally vote for high or low scores in online ratings. We will review and analyze the movies database and see if:
We will see how people respond to different types of movie, length, year of release and other characteristics to see if we can make a prediction of how well could a movie be rated in the future. Online movie renters, such as netflix, already do this kind of analysis to have the best return when deciding which movie they are going to make available in their libraries.
First step before actually making the first viewings and processings over the data, we split our dataset in training and test data. As well explained here, most of the algorithms that learn with data with the objetive to predict new incomes try to minimize the error they make with the first step (learning). Because of this, if we evaluate our data over the error the algorithm (regression in this case) makes with the training data, we would be making an over optimistic assumption. So the objetive of the test set is to give our algorithm new data and thus make a more real error estimate of our modelling process.
# sample testing data
testingIndex <- sample(nrow(movies),floor(nrow(movies) * 0.2),replace = FALSE)
# get test and training data
testing <- movies[testingIndex,]
training <- movies[-testingIndex,]
# response variables of interest:
# imdb_rating, critics_rating, critics_score, audience_rating, audience_score
# remove response cols, title, studio, actors and url's columns
responseCols <- c(13,15,16,17,18)
noInterestVars <- c(1,6,25:32)
trainingPredictors <- training[,-c(responseCols,noInterestVars)]
trainingResponse <- training[, responseCols]
testingPredictors <- testing[,-c(responseCols,noInterestVars)]
testingResponse <- testing[, responseCols]
For our modelling process, we will follow a methodology that is greatly highlighted in this infographic.It basically tells us a general thinking process with which we can work with our data modelling:
Multiple Linear Regression Checklist by: http://www.data-mania.com/blog/a-5-step-checklist-for-multiple-linear-regression/
But, before starting with our analytical thinking, we must evaluate and act against missing values as linear models are no robust against it. We can deal with NA’s with a few approaches and I evaluated 3 of them:
Row removal: Remove the lines that have at least one NA
Column Removal: Remove columns if they have more NA’s than data
Imputation: Use an algorithmic approach to estimate the missing values. We can make this approach using a few techniques, such as median imputation, knn or bagged tree imputation. I have tested the three proposed methods here. In order not to make this report too long, I evaluated and tested these approaches with this project final model and the Median Imputation was the best one, i.e. the smallest RMSE.
# model 1: remove rows with NA?
anyNA <- function(row){ any(is.na(row))} # function to return which row has at least a NA
rowsWithNA <- apply(training, 1, anyNA) # apply the function to each row
rowsWithNAIndexTRUE <- which(rowsWithNA == TRUE) # get which row return TRUE
length(rowsWithNAIndexTRUE)/nrow(training) # few rows with NA's, might be a good way of removing NA
## [1] 0.04990403
# In this report we will NOT use this model, as I have already compared it with the Median imputation model
naRMTrainingData <- training[-rowsWithNAIndexTRUE,] # create new model with the NA rows removed
############################################################################################################
# Model 2: remove columns that have a lot of NA's?
# function to return which row has at least a NA
colsWithNA <- apply(trainingPredictors, 2, function(col) { any(is.na(col))}) #
colsWithNAIndexTRUE <- which(colsWithNA == TRUE) # get cols index that have at least one NA
# low percentage of NA's. No need to exclude columns, take other actions to imput then
apply(trainingPredictors[,colsWithNAIndexTRUE], 2, function(cols) {sum(is.na(cols))/length(cols)})
## runtime dvd_rel_year dvd_rel_month dvd_rel_day
## 0.001919386 0.009596929 0.009596929 0.009596929
############################################################################################################
# model 3, 4 and 5: IMPUTATIONS
#kNNImput <- preProcess(training, method='knnImpute')
#bagImput <- preProcess(training, method='bagImpute', na.remove = TRUE)
medianImput <- preProcess(training, method='medianImpute')
#knnImputTrainingData <- predict(kNNImput, training)
#bagImputTrainingData <- predict(bagImput, training)
medianImputTrainingData <- predict(medianImput, training)
# Already tested these models with the final model, and the best model was the Median Imputation
With our data cleared from NA’s, let see a first estimative for our multiple regression model. We took the responses and no interest variables out and just added our response variable of interest, the imdb_ratings back.
medianImputTrainingPredictors <- medianImputTrainingData[,-c(responseCols,noInterestVars)]
medianImputTrainingPredictors <- cbind(medianImputTrainingPredictors,medianImputTrainingData[,responseCols[1]])
### first raw model with what we got
model.raw <- lm(imdb_rating ~ .,
data = medianImputTrainingPredictors)
summary(model.raw)
##
## Call:
## lm(formula = imdb_rating ~ ., data = medianImputTrainingPredictors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8515 -0.3851 0.0398 0.5020 2.1519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.186e+01 1.787e+01 2.343 0.019546 *
## title_typeFeature Film -9.488e-01 3.561e-01 -2.665 0.007959 **
## title_typeTV Movie -1.678e+00 5.539e-01 -3.030 0.002578 **
## genreAnimation -2.011e-02 4.124e-01 -0.049 0.961128
## genreArt House & International 9.444e-01 2.786e-01 3.390 0.000755 ***
## genreComedy -1.122e-02 1.674e-01 -0.067 0.946574
## genreDocumentary 1.080e+00 3.781e-01 2.856 0.004470 **
## genreDrama 8.047e-01 1.396e-01 5.764 1.45e-08 ***
## genreHorror -4.519e-02 2.450e-01 -0.184 0.853703
## genreMusical & Performing Arts 1.106e+00 3.277e-01 3.376 0.000794 ***
## genreMystery & Suspense 5.710e-01 1.743e-01 3.276 0.001127 **
## genreOther 4.847e-01 2.497e-01 1.941 0.052840 .
## genreScience Fiction & Fantasy -6.344e-01 3.423e-01 -1.853 0.064441 .
## runtime 5.923e-03 2.356e-03 2.514 0.012243 *
## mpaa_ratingNC-17 -1.561e-01 6.602e-01 -0.236 0.813167
## mpaa_ratingPG -1.997e-01 2.933e-01 -0.681 0.496290
## mpaa_ratingPG-13 -4.619e-01 3.035e-01 -1.522 0.128684
## mpaa_ratingR -1.421e-01 2.948e-01 -0.482 0.630097
## mpaa_ratingUnrated 1.077e-01 3.242e-01 0.332 0.739884
## thtr_rel_year -9.027e-03 4.953e-03 -1.822 0.069006 .
## thtr_rel_month -1.124e-03 1.100e-02 -0.102 0.918713
## thtr_rel_day 1.644e-03 4.211e-03 0.390 0.696445
## dvd_rel_year -8.889e-03 1.107e-02 -0.803 0.422519
## dvd_rel_month 1.434e-02 1.139e-02 1.259 0.208800
## dvd_rel_day -6.382e-04 4.186e-03 -0.152 0.878884
## imdb_num_votes 3.474e-06 4.079e-07 8.518 < 2e-16 ***
## best_pic_nomyes 4.128e-01 2.530e-01 1.631 0.103438
## best_pic_winyes -6.520e-01 4.430e-01 -1.472 0.141762
## best_actor_winyes 6.562e-02 1.093e-01 0.600 0.548620
## best_actress_winyes 1.187e-02 1.204e-01 0.099 0.921551
## best_dir_winyes 1.428e-01 1.676e-01 0.852 0.394646
## top200_boxyes 6.074e-03 2.635e-01 0.023 0.981617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8327 on 489 degrees of freedom
## Multiple R-squared: 0.4556, Adjusted R-squared: 0.4211
## F-statistic: 13.2 on 31 and 489 DF, p-value: < 2.2e-16
With this first attempt, we see that our model does a relative poor fit, with a R2 score of just 45%, and a adjusted R2 of 42%.
Feature engineering is a non standard field of data modelling / machine learning that tries to transform raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data[2]. In this section, I present the insights that I have made to get to the final model. The movies training dataset was repeatedly sampled and transformed in order to get a high adjusted R2 score. The list below shows the feature engineering process taken to get to the final model and their respective R2 scores. I will explain each item and why I have included them in the final model. For each model created, the residuals and predicted values plot was analyzed, using the code below. I have left it commented because it generated too many plots, one for each attribute, but feel free to run it in your machine:
# recurrent evaluation using this section
# predicted values for the training dataset
#testePredict <- predict(final.model,medianImputTrainingPredictors)
# plot expected versus predicted values
#plot(abs(medianImputTrainingPredictors$imdb_rating - testePredict))
# plot attributes distribution versus predicted values
#for(i in 1:ncol(medianImputTrainingPredictors)){
# title <- sprintf('%s', names(medianImputTrainingPredictors)[i])
# plot(medianImputTrainingPredictors[,i],medianImputTrainingPredictors$imdb_rating,main = #title)
#}
1 - Multiple R-squared: 0.4556, Adjusted R-squared: 0.4211 Pure Model
2 - Multiple R-squared: 0.5119, Adjusted R-squared: 0.4954 Transformation of imdb_ratings < 5 to 5
3 - Multiple R-squared: 0.5267, Adjusted R-squared: 0.5107 Theatre release year bin transformation
4 - Multiple R-squared: 0.5774, Adjusted R-squared: 0.5524 Weights to award wining/nominated movies
5 - Multiple R-squared: 0.593, Adjusted R-squared: 0.5698 Best Actress factor removal and horror movie category removed
6 - Multiple R-squared: 0.6887, Adjusted R-squared: 0.6426 Interaction terms between release year and movie genre, best actor, actress and director
7 - Multiple R-squared: 0.7381, Adjusted R-squared: 0.6960 Inclusion of the 5 biggest studios
8 - Multiple R-squared: 0.8762, Adjusted R-squared: 0.8563 Removal of the best actress weights1 - Pure Model: The raw model built in section 3.1
2 - Transformation of imdb_ratings < 5 to 5: As we are trying to predict the best ratings, I removed the low score ratings and merged then into the 5 score. In this way, maybe we would reduce the noise and force the model to focus to predict the different scores > 5.
3 - Theatre release year bin transformation: Another feature transformation to try to reduce the variance of the model. Firstly, grouping the years in bins created just 5 years categories, instead of more than 40 years of variations and, besides that, looking at the histogram of the release year attribute, we noticed that people is not used to rating a movie with a bad score, they usually rate a movie if there is a positive impression about it. Secondly, the idea is that maybe the impact of year in the final imdb ratings were not granulated as individual years, but in a range of years.
4 - Weights to award wining/nominated movies: As we have verified in the NZV variables, the nominated people indicator for best actor / actress / director showed a high discrimination capacity. But we have also seen that the number of examples with nominated celebrities was really small. This sounded like a class imbalance problem, where the model just learned to ignore this examples and tended to use the majority of the other values, i.e. no people nominate. To force the model to pay a higher attention to these minor examples, we altered our linear model to a weighted linear regression, where we add a weight term to the classic least squares algorithm:
\(J(\theta ) = \sum_{i = 1}^{m} w_{i}(y_{i} - \hat{y})^2, \hat{y} = \beta_{0} + \sum_{i=1}^{n} \beta_{n} x_{mn}\)
5 - Best Actress factor removal and horror movie category removed: After ploting the best_actress_win versus the predicted imdb_rating, I saw that there was no significant difference between the predicted imbd_ratings for a movie with a best actress prize winner and a no winner. Besides, I also realized that the horror movie genre could not bring a better prediction power, and removing it increased the adjusted R2 score.
6 - Interaction terms between release year and movie genre, best actor, actress and director: This interaction between factor variables worked as an increase in the \(\beta_{0}\) bias parameter. My thought was that maybe people used to rate old (i.e. ‘classic’, ‘old-school’, etc…) movies with a better spirit than when with new blockbuster movies. This was a gut feeling, there was no graphical or statistical evidence for the variables interaction but the feeling that people like to feel cult. This step generated a lot of parameters, but they also provided good useful information also, because the adjusted R2 followed the general R2 closely.
7 - Inclusion of the 5 biggest studios: At the beggining of the project, I did not include the studios variable, because there was too much variability (an unique variable factor of 90%, meaning that 90% of the total studios values were unique). But maybe, movies coming from the biggests (and more famous) studios could bring some more clues about its ratings. So I selected the 5 more frequent studios in the training dataset and transformed the other studios to a new category, ‘other’. The new added variable was a factor with 6 levels, the 5 biggest studios and the ‘other’ category.
8 - Removal of the best actress weights: This process followed step 5, as I had forgotten to remove the weights of the best_actress_win variable. They were not even being included in the model, so there was no sense in maintain big importance weights for the movies coming with an award wining actress.
All the feature engineer steps presented above are coded below and, finally, the final model is presented. As I would use the same steps for the training and testing dataset, I created a function that took a dataframe plus the function mode (training or testing) and it returned the processed dataframe. Besides the feature engineer process, we also transformed two significant parameters from the first model. “runtime” and “imdb_num_votes” were really significant predictors, but their parameter value were really small (runtime = 5.923e-03 and imdb_num_votes), so I transformed then to a bigger scale (runtime in hours and imdb_num_votes x 1000). In this way, our parameters would be on a scale easier to understand.
moviesVarEngineer <- function(dataFrame, mode){
######### feature transformation
dataFrame$imdb_num_votes <- dataFrame$imdb_num_votes/1000
dataFrame$runtime <- dataFrame$runtime/60
#########
# step 2
dataFrame$imdb_rating[dataFrame$imdb_rating < 5] <- 5
# step 3
dataFrame$thtr_rel_year_BIN <- cut(dataFrame$thtr_rel_year, breaks = c(1969, 1980, 1990,2000,2010, 2014))
dataFrame$thtr_rel_year <- NULL
# step 4 = weights - outside of the function
# step 5 = best actress removal and horror genre category removal
dataFrame$genre[dataFrame$genre == 'Horror'] <- 'Other'
dataFrame$genre <- factor(dataFrame$genre, levels = levels(dataFrame$genre)[c(1:6,8:11)])
# step 6 = added interactions in the final model
# step 7 = selected the 5 biggest studios and transformed the others to 'other'
studios <- c('Paramount Pictures','Warner Bros. Pictures','Sony Pictures Home Entertainment','Universal Pictures','Warner Home Video')
if(mode == 'training'){
dataFrame$studio <- training$studio
}else{
dataFrame$studio <- testing$studio
}
dataFrame$studio[!dataFrame$studio %in% studios] <- NA
dataFrame$studio <- factor(dataFrame$studio, levels = c(studios,'Other'))
dataFrame$studio[is.na(dataFrame$studio)] <- 'Other'
invisible(dataFrame)
}
medianImputTrainingPredictors <- moviesVarEngineer(medianImputTrainingPredictors,'training')
weight <- rep(NA,nrow(medianImputTrainingPredictors))
weight[medianImputTrainingPredictors$best_pic_nom == 'yes'] <- 1000
weight[medianImputTrainingPredictors$best_pic_win == 'yes'] <- 1000
weight[medianImputTrainingPredictors$best_dir_win == 'yes'] <- 1000
weight[is.na(weight)] <- 1
final.model <- lm(imdb_rating ~ title_type + genre + runtime +
mpaa_rating + imdb_num_votes +
best_pic_nom + thtr_rel_year_BIN + best_pic_win +
best_dir_win + studio + genre:thtr_rel_year_BIN +
thtr_rel_year_BIN:best_pic_win + thtr_rel_year_BIN:best_actress_win +
thtr_rel_year_BIN:best_dir_win,
data = medianImputTrainingPredictors,
weights = weight)
summary(final.model)
##
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
## imdb_num_votes + best_pic_nom + thtr_rel_year_BIN + best_pic_win +
## best_dir_win + studio + genre:thtr_rel_year_BIN + thtr_rel_year_BIN:best_pic_win +
## thtr_rel_year_BIN:best_actress_win + thtr_rel_year_BIN:best_dir_win,
## data = medianImputTrainingPredictors, weights = weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -32.748 -0.904 -0.234 0.324 27.330
##
## Coefficients: (7 not defined because of singularities)
## Estimate
## (Intercept) 4.5733900
## title_typeFeature Film -0.4771147
## title_typeTV Movie -0.7272767
## genreAnimation 1.9017413
## genreArt House & International 0.8889418
## genreComedy -0.1568169
## genreDocumentary 0.9393243
## genreDrama 1.4754281
## genreMusical & Performing Arts -1.1554540
## genreMystery & Suspense 1.3790490
## genreOther 0.1897325
## genreScience Fiction & Fantasy -0.3996974
## runtime 0.6906618
## mpaa_ratingNC-17 -0.5771154
## mpaa_ratingPG 0.3032650
## mpaa_ratingPG-13 0.1847876
## mpaa_ratingR -0.2295623
## mpaa_ratingUnrated -0.0491466
## imdb_num_votes 0.0027573
## best_pic_nomyes -0.0034484
## thtr_rel_year_BIN(1980,1990] 0.0281348
## thtr_rel_year_BIN(1990,2000] -0.4949214
## thtr_rel_year_BIN(2000,2010] -2.0806625
## thtr_rel_year_BIN(2010,2014] -1.0796343
## best_pic_winyes -0.0178430
## best_dir_winyes -1.0490195
## studioWarner Bros. Pictures 0.5746303
## studioSony Pictures Home Entertainment 1.4204896
## studioUniversal Pictures 1.9739412
## studioWarner Home Video 0.9032382
## studioOther 1.1241572
## genreAnimation:thtr_rel_year_BIN(1980,1990] NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990] -1.4125059
## genreComedy:thtr_rel_year_BIN(1980,1990] -0.4278370
## genreDocumentary:thtr_rel_year_BIN(1980,1990] -0.4912124
## genreDrama:thtr_rel_year_BIN(1980,1990] -1.2330254
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990] 0.6267459
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990] -1.4449559
## genreOther:thtr_rel_year_BIN(1980,1990] -0.2502388
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] -1.2686210
## genreAnimation:thtr_rel_year_BIN(1990,2000] NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000] 0.3516835
## genreComedy:thtr_rel_year_BIN(1990,2000] 2.3761955
## genreDocumentary:thtr_rel_year_BIN(1990,2000] 0.4500423
## genreDrama:thtr_rel_year_BIN(1990,2000] -0.3915023
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000] 2.4587418
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000] 0.1736703
## genreOther:thtr_rel_year_BIN(1990,2000] -0.0780751
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] -0.1127124
## genreAnimation:thtr_rel_year_BIN(2000,2010] NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010] 1.8034443
## genreComedy:thtr_rel_year_BIN(2000,2010] 1.3937065
## genreDocumentary:thtr_rel_year_BIN(2000,2010] 2.0461984
## genreDrama:thtr_rel_year_BIN(2000,2010] 1.3823521
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010] 3.6309217
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] 1.2653928
## genreOther:thtr_rel_year_BIN(2000,2010] 3.8415854
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010] 2.0360166
## genreAnimation:thtr_rel_year_BIN(2010,2014] NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014] 0.1318916
## genreComedy:thtr_rel_year_BIN(2010,2014] 0.6231210
## genreDocumentary:thtr_rel_year_BIN(2010,2014] 1.0455072
## genreDrama:thtr_rel_year_BIN(2010,2014] -0.2982281
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014] NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014] -0.6847781
## genreOther:thtr_rel_year_BIN(2010,2014] 0.0421118
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014] 0.1566839
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes -0.0213499
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes -0.1527893
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes -1.0495287
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes 0.3176271
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes -0.4387228
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes 0.0172925
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes -0.1006795
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes 1.5644755
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes 0.4863802
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes 0.4535078
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes 1.4356526
## Std. Error
## (Intercept) 1.5018906
## title_typeFeature Film 1.1923086
## title_typeTV Movie 1.8331915
## genreAnimation 1.2976069
## genreArt House & International 1.9526031
## genreComedy 2.7499331
## genreDocumentary 2.0046678
## genreDrama 0.1080429
## genreMusical & Performing Arts 2.7504205
## genreMystery & Suspense 0.1751700
## genreOther 0.9969168
## genreScience Fiction & Fantasy 1.5991737
## runtime 0.1219331
## mpaa_ratingNC-17 2.1601855
## mpaa_ratingPG 0.9434460
## mpaa_ratingPG-13 0.9453747
## mpaa_ratingR 0.9387382
## mpaa_ratingUnrated 1.0424707
## imdb_num_votes 0.0001338
## best_pic_nomyes 0.0956836
## thtr_rel_year_BIN(1980,1990] 0.4827692
## thtr_rel_year_BIN(1990,2000] 0.2816363
## thtr_rel_year_BIN(2000,2010] 0.2966526
## thtr_rel_year_BIN(2010,2014] 1.6044279
## best_pic_winyes 0.1169867
## best_dir_winyes 0.2379733
## studioWarner Bros. Pictures 0.1782330
## studioSony Pictures Home Entertainment 0.1703437
## studioUniversal Pictures 0.1383241
## studioWarner Home Video 0.1497871
## studioOther 0.1043026
## genreAnimation:thtr_rel_year_BIN(1980,1990] NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990] 2.7802652
## genreComedy:thtr_rel_year_BIN(1980,1990] 2.9119256
## genreDocumentary:thtr_rel_year_BIN(1980,1990] 2.1758397
## genreDrama:thtr_rel_year_BIN(1980,1990] 0.1874641
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990] 2.7551132
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990] 1.1849103
## genreOther:thtr_rel_year_BIN(1980,1990] 1.0085224
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] 3.2003704
## genreAnimation:thtr_rel_year_BIN(1990,2000] NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000] 2.7686645
## genreComedy:thtr_rel_year_BIN(1990,2000] 2.7576451
## genreDocumentary:thtr_rel_year_BIN(1990,2000] 2.2790450
## genreDrama:thtr_rel_year_BIN(1990,2000] 0.2012553
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000] 3.3662303
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000] 0.2866742
## genreOther:thtr_rel_year_BIN(1990,2000] 1.4625833
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] 3.1759188
## genreAnimation:thtr_rel_year_BIN(2000,2010] NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010] 2.3955812
## genreComedy:thtr_rel_year_BIN(2000,2010] 2.8113998
## genreDocumentary:thtr_rel_year_BIN(2000,2010] 1.7138877
## genreDrama:thtr_rel_year_BIN(2000,2010] 0.1899815
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010] 3.2043126
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] 0.2484800
## genreOther:thtr_rel_year_BIN(2000,2010] 1.0337153
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010] 3.1887923
## genreAnimation:thtr_rel_year_BIN(2010,2014] NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014] 3.1866282
## genreComedy:thtr_rel_year_BIN(2010,2014] 3.3648032
## genreDocumentary:thtr_rel_year_BIN(2010,2014] 2.4276554
## genreDrama:thtr_rel_year_BIN(2010,2014] 1.5905918
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014] NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014] 1.8144927
## genreOther:thtr_rel_year_BIN(2010,2014] 1.8826352
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014] 3.5438994
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes 0.1553447
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes 0.1916048
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes 0.1421702
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes 0.4185223
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes 0.1086283
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes 0.1172106
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes 1.1352685
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes 0.4608510
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes 0.2367964
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes 0.2363320
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes 1.3605037
## t value
## (Intercept) 3.045
## title_typeFeature Film -0.400
## title_typeTV Movie -0.397
## genreAnimation 1.466
## genreArt House & International 0.455
## genreComedy -0.057
## genreDocumentary 0.469
## genreDrama 13.656
## genreMusical & Performing Arts -0.420
## genreMystery & Suspense 7.873
## genreOther 0.190
## genreScience Fiction & Fantasy -0.250
## runtime 5.664
## mpaa_ratingNC-17 -0.267
## mpaa_ratingPG 0.321
## mpaa_ratingPG-13 0.195
## mpaa_ratingR -0.245
## mpaa_ratingUnrated -0.047
## imdb_num_votes 20.606
## best_pic_nomyes -0.036
## thtr_rel_year_BIN(1980,1990] 0.058
## thtr_rel_year_BIN(1990,2000] -1.757
## thtr_rel_year_BIN(2000,2010] -7.014
## thtr_rel_year_BIN(2010,2014] -0.673
## best_pic_winyes -0.153
## best_dir_winyes -4.408
## studioWarner Bros. Pictures 3.224
## studioSony Pictures Home Entertainment 8.339
## studioUniversal Pictures 14.270
## studioWarner Home Video 6.030
## studioOther 10.778
## genreAnimation:thtr_rel_year_BIN(1980,1990] NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990] -0.508
## genreComedy:thtr_rel_year_BIN(1980,1990] -0.147
## genreDocumentary:thtr_rel_year_BIN(1980,1990] -0.226
## genreDrama:thtr_rel_year_BIN(1980,1990] -6.577
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990] 0.227
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990] -1.219
## genreOther:thtr_rel_year_BIN(1980,1990] -0.248
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] -0.396
## genreAnimation:thtr_rel_year_BIN(1990,2000] NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000] 0.127
## genreComedy:thtr_rel_year_BIN(1990,2000] 0.862
## genreDocumentary:thtr_rel_year_BIN(1990,2000] 0.197
## genreDrama:thtr_rel_year_BIN(1990,2000] -1.945
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000] 0.730
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000] 0.606
## genreOther:thtr_rel_year_BIN(1990,2000] -0.053
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] -0.035
## genreAnimation:thtr_rel_year_BIN(2000,2010] NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010] 0.753
## genreComedy:thtr_rel_year_BIN(2000,2010] 0.496
## genreDocumentary:thtr_rel_year_BIN(2000,2010] 1.194
## genreDrama:thtr_rel_year_BIN(2000,2010] 7.276
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010] 1.133
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] 5.093
## genreOther:thtr_rel_year_BIN(2000,2010] 3.716
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010] 0.638
## genreAnimation:thtr_rel_year_BIN(2010,2014] NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014] 0.041
## genreComedy:thtr_rel_year_BIN(2010,2014] 0.185
## genreDocumentary:thtr_rel_year_BIN(2010,2014] 0.431
## genreDrama:thtr_rel_year_BIN(2010,2014] -0.187
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014] NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014] -0.377
## genreOther:thtr_rel_year_BIN(2010,2014] 0.022
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014] 0.044
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes -0.137
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes -0.797
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes -7.382
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes 0.759
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes -4.039
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes 0.148
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes -0.089
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes 3.395
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes 2.054
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes 1.919
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes 1.055
## Pr(>|t|)
## (Intercept) 0.002463 **
## title_typeFeature Film 0.689229
## title_typeTV Movie 0.691758
## genreAnimation 0.143465
## genreArt House & International 0.649143
## genreComedy 0.954550
## genreDocumentary 0.639606
## genreDrama < 2e-16 ***
## genreMusical & Performing Arts 0.674613
## genreMystery & Suspense 2.64e-14 ***
## genreOther 0.849145
## genreScience Fiction & Fantasy 0.802748
## runtime 2.64e-08 ***
## mpaa_ratingNC-17 0.789469
## mpaa_ratingPG 0.748024
## mpaa_ratingPG-13 0.845118
## mpaa_ratingR 0.806922
## mpaa_ratingUnrated 0.962419
## imdb_num_votes < 2e-16 ***
## best_pic_nomyes 0.971267
## thtr_rel_year_BIN(1980,1990] 0.953553
## thtr_rel_year_BIN(1990,2000] 0.079548 .
## thtr_rel_year_BIN(2000,2010] 8.60e-12 ***
## thtr_rel_year_BIN(2010,2014] 0.501352
## best_pic_winyes 0.878844
## best_dir_winyes 1.31e-05 ***
## studioWarner Bros. Pictures 0.001356 **
## studioSony Pictures Home Entertainment 9.31e-16 ***
## studioUniversal Pictures < 2e-16 ***
## studioWarner Home Video 3.43e-09 ***
## studioOther < 2e-16 ***
## genreAnimation:thtr_rel_year_BIN(1980,1990] NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990] 0.611670
## genreComedy:thtr_rel_year_BIN(1980,1990] 0.883257
## genreDocumentary:thtr_rel_year_BIN(1980,1990] 0.821493
## genreDrama:thtr_rel_year_BIN(1980,1990] 1.34e-10 ***
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990] 0.820151
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990] 0.223310
## genreOther:thtr_rel_year_BIN(1980,1990] 0.804152
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] 0.692000
## genreAnimation:thtr_rel_year_BIN(1990,2000] NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000] 0.898979
## genreComedy:thtr_rel_year_BIN(1990,2000] 0.389327
## genreDocumentary:thtr_rel_year_BIN(1990,2000] 0.843550
## genreDrama:thtr_rel_year_BIN(1990,2000] 0.052364 .
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000] 0.465519
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000] 0.544947
## genreOther:thtr_rel_year_BIN(1990,2000] 0.957452
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] 0.971705
## genreAnimation:thtr_rel_year_BIN(2000,2010] NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010] 0.451953
## genreComedy:thtr_rel_year_BIN(2000,2010] 0.620325
## genreDocumentary:thtr_rel_year_BIN(2000,2010] 0.233152
## genreDrama:thtr_rel_year_BIN(2000,2010] 1.55e-12 ***
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010] 0.257763
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] 5.21e-07 ***
## genreOther:thtr_rel_year_BIN(2000,2010] 0.000228 ***
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010] 0.523480
## genreAnimation:thtr_rel_year_BIN(2010,2014] NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014] 0.967004
## genreComedy:thtr_rel_year_BIN(2010,2014] 0.853165
## genreDocumentary:thtr_rel_year_BIN(2010,2014] 0.666919
## genreDrama:thtr_rel_year_BIN(2010,2014] 0.851357
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014] NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014] 0.706060
## genreOther:thtr_rel_year_BIN(2010,2014] 0.982164
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014] 0.964755
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes 0.890748
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes 0.425630
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes 7.64e-13 ***
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes 0.448296
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes 6.32e-05 ***
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes 0.882777
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes 0.929373
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes 0.000748 ***
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes 0.040555 *
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes 0.055627 .
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes 0.291886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.739 on 448 degrees of freedom
## Multiple R-squared: 0.8762, Adjusted R-squared: 0.8563
## F-statistic: 44.05 on 72 and 448 DF, p-value: < 2.2e-16
Firstly, we identify a few NA’s in the model. That happened because there was no examples in the training set the agreed with the two factor variables, e.g. there was no animation movie released between years 1980 and 1990 in the dataset.
Lets take a look at the statistical significant parameter estimation: Statistical significant we mean that, with a 95% confidence, that parameter estimator is different than 0:
Variable | P.Value | Estimate |
---|---|---|
genreDrama | < 2e-16 | 1.4754281 |
genreMystery & Suspense | 2.64e-14 | 1.3790490 |
runtime | 2.64e-08 | 0.6906618 |
imdb_num_votes | 2e-16 | 0.0027573 |
thtr_rel_year_BIN(2000,2010] | 8.60e-12 | -2.0806625 |
studioWarner Bros. Pictures | 0.001356 | 0.5746303 |
studioSony Pictures Home Entertainment | 9.31e-16 | 1.4204896 |
studioUniversal Pictures | < 2e-16 | 1.9739412 |
studioWarner Home Video | 3.43e-09 | 0.9032382 |
studioOther | < 2e-16 | 1.1241572 |
genreDrama:thtr_rel_year_BIN(1980,1990] | 1.34e-10 | -1.2330254 |
genreDrama:thtr_rel_year_BIN(2000,2010] | 1.55e-12 | 1.3823521 |
genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] | 5.21e-07 | 1.2653928 |
genreOther:thtr_rel_year_BIN(2000,2010] | 0.000228 | 3.8415854 |
thtr_rel_year_BIN(1969,1980]:best_actress_winyes | 7.64e-13 | -1.0495287 |
thtr_rel_year_BIN(1990,2000]:best_actress_winyes | 6.32e-05 | -0.4387228 |
thtr_rel_year_BIN(1980,1990]:best_dir_winyes | 0.000748 | 1.5644755 |
thtr_rel_year_BIN(1990,2000]:best_dir_winyes | 0.040555 | 0.4863802 |
All the following results can be interpreted as the correlation of each paramenter with the imdb_rating response variable, when all the other attributes remain constant. Most of the attributes were positive correlated with the response variable.
The base factor predictor for the genre attribute was action & adventure [3]. Starting from there, we can see that drama movies tend to get higher ratings, on average of 1.47 points more, and even higher scores for drama movies between (2000,2010)]. However, as the same drama genre is not so high for movies released between (1980,1990].
The runtime attribute (in hours) got a positive correlation with the imdb_rating. In theory, each additional hour to the movie, we could, on average get a increase of 0.69 rating points. We have to be careful with this case, in the sense that we can not extra extrapolate the fit, looking at the Interquartile Range - IQR for the runtime variable, we see that the core values, the 25% and 75% quartiles are in a interval of just 0.383 points apart. So we can not extrapolate too much on the number of hours, i.e. we can not predict the average imdb_ratings for a 13 hours movie, expecting an incredible high score.
IQR(medianImputTrainingPredictors$runtime)
## [1] 0.3833333
One theory that might make sense in a video popularity is that, the more upvotes a movie receives, the more popular it becomes, more people will want to watch it because it is popular and more upvotes it will receive. This redundant path can be represented in the model above with the imdb_num_votes parameter estimation. For its value, each one thousand votes at IMDB, can make the movie ratings increase more 0.002 points. Considering that some movies can receive hundres of thousands votes, advertising for people review your movies can make it even more popular.
Maybe it was because of random factors, but the parameter value for movies released between (2000,2010] souded like it was a bad decade. By its value, movies released in this decade tended to receive -2.08 points in imdb_ratings. Probably, this was not the only determinant factor for such low scores, but a better analysis with movies from this decade could be carried out, so we could identify more interesting factors exclusevely in this decade.
The only factor variable that brought all of its levels as significant estimates was the studios attribute. Using the ‘Paramount Pictures’ studio as the baseline, we see that analyzing a movie from each of the 5 big studios, all of them could bring a bigger imdb_rating than Paramount studio, starting from 0.57 more for Warner Bros Pictures, until almost 2 points more with Universal Pictures. Even the ‘Other’ category got a higher estimate, but this interpretation is riskier because we grouped together all the other studios that did not belong to the 5 biggest studio group.
Lastly, we obtained a negative parameter estimation for movies from (1969,1980] and (1990,2000] that had a best actress award wining. This was a surprise for me, as I expected that movies from older or newer decades had a positive association with the imdb_ratings. More research can be done with this item to try to identify the reasons for this results. Regarding the best director award wining movies, we see that maybe they had a two decade gold era (1980,2000], where people really enjoyed and apreciated them, resulting in better scores for the imdb_rating, in the interval of 0.48 to 1.56 more points, compared to the baseline year_BIN, (2000,2014].
As we have discussed in section 3, declaring the model efficiency by the estimated training error could be a little unrealistic because the model tries to minimize the training set error. The need to evaluate the training set by the adjusted R2 was to delimit the number of parameters we would include in our model, preventing overfitting. As the original training dataset had to go throught a preprocess step, we did the same procedure with the testing dataset, with all the parameters learned with the training data, including the median imputation and the steps in the ‘moviesVarEngineer’ function. We only do not create a weight array because it was only used when estimating the \(w_{i}\) parameters of the weighted least squares. A good way to evaluate regression models is the Root Mean Square Error - RMSE, which makes an estimate of the standard error made by the model in unseen data.
medianImputeTestingData <- predict(medianImput, testing)
medianImputTestingPredictors <- medianImputeTestingData[,-c(responseCols,noInterestVars)]
#medianImputTestingPredictors <- #cbind(medianImputTestingPredictors,medianImputeTestingData[,responseCols[1]])
medianImputTestingPredictors$imdb_rating <- medianImputeTestingData[,responseCols[1]]
medianImputTestingPredictors <- moviesVarEngineer(medianImputTestingPredictors,'testing')
testingPredictions <- predict(final.model,medianImputTestingPredictors)
## Warning in predict.lm(final.model, medianImputTestingPredictors):
## prediction from a rank-deficient fit may be misleading
RMSE(testingPredictions,medianImputTestingPredictors$imdb_rating,na.rm = T)
## [1] 1.158517
The feature engineering process was the responsible in greatly improving the adjusted R2 and the model seemed to generalize well in new unseen data, as we obtained a relatively low RMSE.
A probable solution for the NA’s presented in the model was to remove the specific interactions from the model. A way to accomplish this would be to use the dummyVars function in the caret package. This function splits the factor in n columns, one for each factor level, as a dummy variable, i.e. 1 if the variable has that factor level and 0 otherwise.
Besides the NA’s, there was one more point to improve in the model. As I have created interaction terms with factor with many levels, some linear dependency was introduced with them. As we can see in the testing dataset prediction section, the model complained with a few warnings about predicting in a rank-deficient matrix (dataset). That probably is a linear dependency between some of my variables and that could make the parameter estimates unstable and, consequently, not accurate for new data. The NZV and Correlation analysis taken with the final model could not identify and eliminate the probable dependent columns, so further investigation would have to be done to improve the model stability.
It is important to notice that I have tried to build a model using the standard lm function and the stepwise automated model selection algorithm, more specifically the AIC stepwise model selection. Only before the first first raw model, I managed to extract a better model with the stepwise selection, after that, the criterious feature selection, transformation and engineering processes always gave a better fit in the training set.
The project show here only used the ‘imdb_rating’ variable as a response variable. Similar analysis could be made using the other two continuous variables (‘critics_score’ and ‘audience_score’), a logistic regression could be modelled and fit for the binary response variable ‘audience_rating’ and a multinomial regression model could be built for the multilevel response variable ‘critics_rating’.
There is a lot of research that can be done and the one explained in this project shows that we can accurately predict how people will rate (like) the movies from its general characteristics.
.