Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(caret)      # machine learning and pre processing functions
library(gridExtra)  # combo plot for ggplot functions

Load data

load("movies.Rdata")
set.seed(123)

Part 1: Data

The movies dataset gathers data on movies and a few attributes about its structure, like movie length in minutes, release day, month and year, MPAA rating, and so on. Besides these variables, there are a few columns of interest regarding the movie evaluation, such as the Internet Movie Database - IMDB rating, critics rating and audience ratings. The goal of this project is to bring a perspective in one of these classifications, the IMDB ratings, and its relationship with the other variables in the dataset, excluding the other variables of interest. Thus, we will not try to predict the IMDB ratings over other places ratings.

This dataset was composed via random sampling the IMDB and Rotten Tomatoes database and, as there was no random assignment treatments to the variables, all findings can only be states as correlational, and not causational.


Part 2: Research question

A few features about the movie can make it more or less popular, and we do not mean only its content of drama, humor, horror, etc…A few biases can make make people unintentionally vote for high or low scores in online ratings. We will review and analyze the movies database and see if:

Is there any correlations between movie general feature(s) that make people like or dislike a movie?

We will see how people respond to different types of movie, length, year of release and other characteristics to see if we can make a prediction of how well could a movie be rated in the future. Online movie renters, such as netflix, already do this kind of analysis to have the best return when deciding which movie they are going to make available in their libraries.


Part 3: Exploratory data analysis

First step before actually making the first viewings and processings over the data, we split our dataset in training and test data. As well explained here, most of the algorithms that learn with data with the objetive to predict new incomes try to minimize the error they make with the first step (learning). Because of this, if we evaluate our data over the error the algorithm (regression in this case) makes with the training data, we would be making an over optimistic assumption. So the objetive of the test set is to give our algorithm new data and thus make a more real error estimate of our modelling process.

# sample testing data
testingIndex <- sample(nrow(movies),floor(nrow(movies) * 0.2),replace = FALSE)
# get test and training data
testing  <- movies[testingIndex,]
training <- movies[-testingIndex,]

# response variables of interest:
# imdb_rating, critics_rating, critics_score, audience_rating, audience_score
# remove response cols, title, studio, actors and url's columns
responseCols <- c(13,15,16,17,18)
noInterestVars <- c(1,6,25:32)

trainingPredictors <- training[,-c(responseCols,noInterestVars)]
trainingResponse   <- training[, responseCols]

testingPredictors <- testing[,-c(responseCols,noInterestVars)]
testingResponse   <- testing[, responseCols]

3.1. Data Imputation

For our modelling process, we will follow a methodology that is greatly highlighted in this infographic.It basically tells us a general thinking process with which we can work with our data modelling:

But, before starting with our analytical thinking, we must evaluate and act against missing values as linear models are no robust against it. We can deal with NA’s with a few approaches and I evaluated 3 of them:

  • Row removal: Remove the lines that have at least one NA

  • Column Removal: Remove columns if they have more NA’s than data

  • Imputation: Use an algorithmic approach to estimate the missing values. We can make this approach using a few techniques, such as median imputation, knn or bagged tree imputation. I have tested the three proposed methods here. In order not to make this report too long, I evaluated and tested these approaches with this project final model and the Median Imputation was the best one, i.e. the smallest RMSE.

# model 1: remove rows with NA?
anyNA <- function(row){ any(is.na(row))}          # function to return which row has at least a NA

rowsWithNA <- apply(training, 1, anyNA)           # apply the function to each row

rowsWithNAIndexTRUE <- which(rowsWithNA == TRUE)  # get which row return TRUE
length(rowsWithNAIndexTRUE)/nrow(training)        # few rows with NA's, might be a good way of removing NA
## [1] 0.04990403
# In this report we will NOT use this model, as I have already compared it with the Median imputation model
naRMTrainingData <- training[-rowsWithNAIndexTRUE,] # create new model with the NA rows removed
############################################################################################################
# Model 2: remove columns that have a lot of NA's?

# function to return which row has at least a NA
colsWithNA <- apply(trainingPredictors, 2, function(col) { any(is.na(col))}) #

colsWithNAIndexTRUE <- which(colsWithNA == TRUE) # get cols index that have at least one NA

# low percentage of NA's. No need to exclude columns, take other actions to imput then
apply(trainingPredictors[,colsWithNAIndexTRUE], 2, function(cols) {sum(is.na(cols))/length(cols)})
##       runtime  dvd_rel_year dvd_rel_month   dvd_rel_day 
##   0.001919386   0.009596929   0.009596929   0.009596929
############################################################################################################
# model 3, 4 and 5: IMPUTATIONS
#kNNImput <- preProcess(training, method='knnImpute')
#bagImput <- preProcess(training, method='bagImpute', na.remove = TRUE)
medianImput <- preProcess(training, method='medianImpute')

#knnImputTrainingData <- predict(kNNImput, training)
#bagImputTrainingData <- predict(bagImput, training)
medianImputTrainingData <- predict(medianImput, training)
# Already tested these models with the final model, and the best model was the Median Imputation

With our data cleared from NA’s, let see a first estimative for our multiple regression model. We took the responses and no interest variables out and just added our response variable of interest, the imdb_ratings back.

medianImputTrainingPredictors <- medianImputTrainingData[,-c(responseCols,noInterestVars)]
medianImputTrainingPredictors <- cbind(medianImputTrainingPredictors,medianImputTrainingData[,responseCols[1]])

### first raw model with what we got
model.raw <- lm(imdb_rating ~ .,
                data = medianImputTrainingPredictors)
summary(model.raw)
## 
## Call:
## lm(formula = imdb_rating ~ ., data = medianImputTrainingPredictors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8515 -0.3851  0.0398  0.5020  2.1519 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.186e+01  1.787e+01   2.343 0.019546 *  
## title_typeFeature Film         -9.488e-01  3.561e-01  -2.665 0.007959 ** 
## title_typeTV Movie             -1.678e+00  5.539e-01  -3.030 0.002578 ** 
## genreAnimation                 -2.011e-02  4.124e-01  -0.049 0.961128    
## genreArt House & International  9.444e-01  2.786e-01   3.390 0.000755 ***
## genreComedy                    -1.122e-02  1.674e-01  -0.067 0.946574    
## genreDocumentary                1.080e+00  3.781e-01   2.856 0.004470 ** 
## genreDrama                      8.047e-01  1.396e-01   5.764 1.45e-08 ***
## genreHorror                    -4.519e-02  2.450e-01  -0.184 0.853703    
## genreMusical & Performing Arts  1.106e+00  3.277e-01   3.376 0.000794 ***
## genreMystery & Suspense         5.710e-01  1.743e-01   3.276 0.001127 ** 
## genreOther                      4.847e-01  2.497e-01   1.941 0.052840 .  
## genreScience Fiction & Fantasy -6.344e-01  3.423e-01  -1.853 0.064441 .  
## runtime                         5.923e-03  2.356e-03   2.514 0.012243 *  
## mpaa_ratingNC-17               -1.561e-01  6.602e-01  -0.236 0.813167    
## mpaa_ratingPG                  -1.997e-01  2.933e-01  -0.681 0.496290    
## mpaa_ratingPG-13               -4.619e-01  3.035e-01  -1.522 0.128684    
## mpaa_ratingR                   -1.421e-01  2.948e-01  -0.482 0.630097    
## mpaa_ratingUnrated              1.077e-01  3.242e-01   0.332 0.739884    
## thtr_rel_year                  -9.027e-03  4.953e-03  -1.822 0.069006 .  
## thtr_rel_month                 -1.124e-03  1.100e-02  -0.102 0.918713    
## thtr_rel_day                    1.644e-03  4.211e-03   0.390 0.696445    
## dvd_rel_year                   -8.889e-03  1.107e-02  -0.803 0.422519    
## dvd_rel_month                   1.434e-02  1.139e-02   1.259 0.208800    
## dvd_rel_day                    -6.382e-04  4.186e-03  -0.152 0.878884    
## imdb_num_votes                  3.474e-06  4.079e-07   8.518  < 2e-16 ***
## best_pic_nomyes                 4.128e-01  2.530e-01   1.631 0.103438    
## best_pic_winyes                -6.520e-01  4.430e-01  -1.472 0.141762    
## best_actor_winyes               6.562e-02  1.093e-01   0.600 0.548620    
## best_actress_winyes             1.187e-02  1.204e-01   0.099 0.921551    
## best_dir_winyes                 1.428e-01  1.676e-01   0.852 0.394646    
## top200_boxyes                   6.074e-03  2.635e-01   0.023 0.981617    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8327 on 489 degrees of freedom
## Multiple R-squared:  0.4556, Adjusted R-squared:  0.4211 
## F-statistic:  13.2 on 31 and 489 DF,  p-value: < 2.2e-16

With this first attempt, we see that our model does a relative poor fit, with a R2 score of just 45%, and a adjusted R2 of 42%.

3.2. Near Zero Variance - NZV and Correlated Columns Verification

Non informative predictors and Multicollinearity are a few problems that can make or model worse or unstable (not relevant predictors, therefore, not generalizable for new data and predictions).

NZV variables are columns that have little variation among its values, i.e., there is one predominant value, making the values almost constant between the data examples, providing little discriminative power. There are discussions if we should remove the nzv predictors, considering that we are throwing out information but, as we are trying to get the most informative model to our data, we would stick with this approach to remove them. However, as it is going to be shown next,, maybe these rows with the rare column values could mean something special.

As for multicollinearity, we will analyze among the predictors if we can find a big correlation between then. Correlated predictors are known [1] to destabilize the model parameters, increase the prediction standard error and worse the fit to new data. Let see how our data is organized. We will use the Caret package to extract the nzv variables and the cor command to extract the correlation matrix:

nzv <- nearZeroVar(training, saveMetrics= TRUE) 
nzvIndices <- which(nzv$nzv == TRUE)
nzv[nzvIndices, ]
##              freqRatio percentUnique zeroVar  nzv
## best_pic_nom  31.56250     0.3838772   FALSE TRUE
## best_pic_win 103.20000     0.3838772   FALSE TRUE
## top200_box    42.41667     0.3838772   FALSE TRUE

As we can see {best_pic_nom, best_pic_win, top200_box} were selected as no discriminative variables. But lets start inputing some domain knowledge here, as we will do in the variable engineering section. Probably, if a movie is nominated to or wins an oscar or it is in the Top 200 Box Office, it has a bigger chance to receive higher scores. In the plots below, we analyzed the nzv variables distributions, viewing whether a movie is premiated or not and its possible correspondent scores, such as imdb_rating, critics_rating and critics_score. As we have seen, there is an imbalance in the nzv columns, but they might provide a great discrimination power over the imdb_rating, our response variable of interest in this report. Therefore, we will not remove the nzv variables.

for(i in 1:3){
  cat('Distribution of:', names(training)[nzvIndices[i]])
  print(table(movies[,nzvIndices[i]]))
  plot1 <- ggplot(training, aes_string(x = names(training)[nzvIndices[i]], y = names(training)[responseCols[1]], colour = names(training)[responseCols[1]])) + geom_point()
  plot2 <- ggplot(training, aes_string(x = names(training)[nzvIndices[i]], fill = names(training)[responseCols[2]])) + geom_bar(position = 'fill')
  plot3 <- ggplot(training, aes_string(x = names(training)[nzvIndices[i]], y = names(training)[responseCols[3]], colour = names(training)[responseCols[3]])) + geom_point()
  grid.arrange(plot1, plot2, plot3, ncol=3)
  cat('\n')
}
## Distribution of: best_pic_nom
##  no yes 
## 629  22

## 
## Distribution of: best_pic_win
##  no yes 
## 644   7

## 
## Distribution of: top200_box
##  no yes 
## 636  15

Next, we evaluate the columns correlation. As we perform the traditional Pearson Numerical Correlation, I first removed the factor variables. For a mixed correlation status, between numerical and factor variables, we could have used the Heterogeneous Correlation Matrix or the Point-Biserial Correlation, but as these are more advanced techniques, we will not consider it in our analysis. Inspecting our correlation matrix, we see that the most correlated columns are between the response columns, not of interest in our research. Therefore, in spite of our predictor variables, we will not remove them to decrease multicollinearity.

numericIndices <- sapply(medianImputTrainingData, is.numeric) # extract which columns are numeric
colsCorr <- cor(medianImputTrainingData[,numericIndices])     # filter out non numeric columns
colsCorr                                                      # evaluate results
##                    runtime thtr_rel_year thtr_rel_month thtr_rel_day
## runtime         1.00000000   -0.09391657     0.24038267  0.026681592
## thtr_rel_year  -0.09391657    1.00000000    -0.00123443  0.117305939
## thtr_rel_month  0.24038267   -0.00123443     1.00000000  0.113887159
## thtr_rel_day    0.02668159    0.11730594     0.11388716  1.000000000
## dvd_rel_year   -0.04948940    0.64442653    -0.03503367  0.038496425
## dvd_rel_month  -0.01782400    0.03067943    -0.15517808 -0.003990974
## dvd_rel_day     0.02066362   -0.01894207     0.02626493  0.005082023
## imdb_rating     0.25791998   -0.01152601     0.06983412  0.026170006
## imdb_num_votes  0.33880973    0.18433121     0.12073182  0.065972579
## critics_score   0.17345536   -0.07510111     0.03499788  0.021334195
## audience_score  0.16991493   -0.04306953     0.04001151  0.031657178
##                dvd_rel_year dvd_rel_month  dvd_rel_day  imdb_rating
## runtime        -0.049489403  -0.017824005  0.020663623  0.257919981
## thtr_rel_year   0.644426530   0.030679426 -0.018942070 -0.011526011
## thtr_rel_month -0.035033673  -0.155178079  0.026264930  0.069834118
## thtr_rel_day    0.038496425  -0.003990974  0.005082023  0.026170006
## dvd_rel_year    1.000000000  -0.018777008 -0.068948976  0.009437087
## dvd_rel_month  -0.018777008   1.000000000 -0.031338070  0.059184980
## dvd_rel_day    -0.068948976  -0.031338070  1.000000000  0.025777246
## imdb_rating     0.009437087   0.059184980  0.025777246  1.000000000
## imdb_num_votes  0.094486607  -0.002573342 -0.026367160  0.303974099
## critics_score   0.035104270   0.019224629 -0.010529204  0.764708960
## audience_score -0.044897280   0.060226703  0.007492514  0.870449971
##                imdb_num_votes critics_score audience_score
## runtime           0.338809732    0.17345536    0.169914934
## thtr_rel_year     0.184331205   -0.07510111   -0.043069529
## thtr_rel_month    0.120731821    0.03499788    0.040011505
## thtr_rel_day      0.065972579    0.02133419    0.031657178
## dvd_rel_year      0.094486607    0.03510427   -0.044897280
## dvd_rel_month    -0.002573342    0.01922463    0.060226703
## dvd_rel_day      -0.026367160   -0.01052920    0.007492514
## imdb_rating       0.303974099    0.76470896    0.870449971
## imdb_num_votes    1.000000000    0.19759395    0.267938386
## critics_score     0.197593955    1.00000000    0.706728922
## audience_score    0.267938386    0.70672892    1.000000000

Part 4: Feature Engineering and Modeling

Feature engineering is a non standard field of data modelling / machine learning that tries to transform raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data[2]. In this section, I present the insights that I have made to get to the final model. The movies training dataset was repeatedly sampled and transformed in order to get a high adjusted R2 score. The list below shows the feature engineering process taken to get to the final model and their respective R2 scores. I will explain each item and why I have included them in the final model. For each model created, the residuals and predicted values plot was analyzed, using the code below. I have left it commented because it generated too many plots, one for each attribute, but feel free to run it in your machine:

# recurrent evaluation using this section

                  # predicted values for the training dataset
#testePredict <- predict(final.model,medianImputTrainingPredictors)
                  # plot expected versus predicted values
#plot(abs(medianImputTrainingPredictors$imdb_rating - testePredict))
                  # plot attributes distribution versus predicted values
#for(i in 1:ncol(medianImputTrainingPredictors)){
#  title <- sprintf('%s', names(medianImputTrainingPredictors)[i])
#  plot(medianImputTrainingPredictors[,i],medianImputTrainingPredictors$imdb_rating,main = #title)
#}

1 - Multiple R-squared: 0.4556, Adjusted R-squared: 0.4211 Pure Model

2 - Multiple R-squared: 0.5119, Adjusted R-squared: 0.4954 Transformation of imdb_ratings < 5 to 5

3 - Multiple R-squared: 0.5267, Adjusted R-squared: 0.5107 Theatre release year bin transformation

4 - Multiple R-squared: 0.5774, Adjusted R-squared: 0.5524 Weights to award wining/nominated movies

5 - Multiple R-squared: 0.593, Adjusted R-squared: 0.5698 Best Actress factor removal and horror movie category removed

6 - Multiple R-squared: 0.6887, Adjusted R-squared: 0.6426 Interaction terms between release year and movie genre, best actor, actress and director

7 - Multiple R-squared: 0.7381, Adjusted R-squared: 0.6960 Inclusion of the 5 biggest studios

8 - Multiple R-squared: 0.8762, Adjusted R-squared: 0.8563 Removal of the best actress weights

1 - Pure Model: The raw model built in section 3.1

2 - Transformation of imdb_ratings < 5 to 5: As we are trying to predict the best ratings, I removed the low score ratings and merged then into the 5 score. In this way, maybe we would reduce the noise and force the model to focus to predict the different scores > 5.

3 - Theatre release year bin transformation: Another feature transformation to try to reduce the variance of the model. Firstly, grouping the years in bins created just 5 years categories, instead of more than 40 years of variations and, besides that, looking at the histogram of the release year attribute, we noticed that people is not used to rating a movie with a bad score, they usually rate a movie if there is a positive impression about it. Secondly, the idea is that maybe the impact of year in the final imdb ratings were not granulated as individual years, but in a range of years.

4 - Weights to award wining/nominated movies: As we have verified in the NZV variables, the nominated people indicator for best actor / actress / director showed a high discrimination capacity. But we have also seen that the number of examples with nominated celebrities was really small. This sounded like a class imbalance problem, where the model just learned to ignore this examples and tended to use the majority of the other values, i.e. no people nominate. To force the model to pay a higher attention to these minor examples, we altered our linear model to a weighted linear regression, where we add a weight term to the classic least squares algorithm:

\(J(\theta ) = \sum_{i = 1}^{m} w_{i}(y_{i} - \hat{y})^2, \hat{y} = \beta_{0} + \sum_{i=1}^{n} \beta_{n} x_{mn}\)

5 - Best Actress factor removal and horror movie category removed: After ploting the best_actress_win versus the predicted imdb_rating, I saw that there was no significant difference between the predicted imbd_ratings for a movie with a best actress prize winner and a no winner. Besides, I also realized that the horror movie genre could not bring a better prediction power, and removing it increased the adjusted R2 score.

6 - Interaction terms between release year and movie genre, best actor, actress and director: This interaction between factor variables worked as an increase in the \(\beta_{0}\) bias parameter. My thought was that maybe people used to rate old (i.e. ‘classic’, ‘old-school’, etc…) movies with a better spirit than when with new blockbuster movies. This was a gut feeling, there was no graphical or statistical evidence for the variables interaction but the feeling that people like to feel cult. This step generated a lot of parameters, but they also provided good useful information also, because the adjusted R2 followed the general R2 closely.

7 - Inclusion of the 5 biggest studios: At the beggining of the project, I did not include the studios variable, because there was too much variability (an unique variable factor of 90%, meaning that 90% of the total studios values were unique). But maybe, movies coming from the biggests (and more famous) studios could bring some more clues about its ratings. So I selected the 5 more frequent studios in the training dataset and transformed the other studios to a new category, ‘other’. The new added variable was a factor with 6 levels, the 5 biggest studios and the ‘other’ category.

8 - Removal of the best actress weights: This process followed step 5, as I had forgotten to remove the weights of the best_actress_win variable. They were not even being included in the model, so there was no sense in maintain big importance weights for the movies coming with an award wining actress.

All the feature engineer steps presented above are coded below and, finally, the final model is presented. As I would use the same steps for the training and testing dataset, I created a function that took a dataframe plus the function mode (training or testing) and it returned the processed dataframe. Besides the feature engineer process, we also transformed two significant parameters from the first model. “runtime” and “imdb_num_votes” were really significant predictors, but their parameter value were really small (runtime = 5.923e-03 and imdb_num_votes), so I transformed then to a bigger scale (runtime in hours and imdb_num_votes x 1000). In this way, our parameters would be on a scale easier to understand.

moviesVarEngineer <- function(dataFrame, mode){
  
  ######### feature transformation 
  dataFrame$imdb_num_votes <- dataFrame$imdb_num_votes/1000
  dataFrame$runtime <- dataFrame$runtime/60
  ######### 
  
  # step 2
  dataFrame$imdb_rating[dataFrame$imdb_rating < 5] <- 5
  
  # step 3
  dataFrame$thtr_rel_year_BIN <- cut(dataFrame$thtr_rel_year, breaks = c(1969, 1980, 1990,2000,2010, 2014))
  dataFrame$thtr_rel_year <- NULL
  
  # step 4 = weights - outside of the function
  
  # step 5 = best actress removal and horror genre category removal
  dataFrame$genre[dataFrame$genre == 'Horror'] <- 'Other'
  dataFrame$genre <- factor(dataFrame$genre, levels = levels(dataFrame$genre)[c(1:6,8:11)])
  
  # step 6 = added interactions in the final model
  
  # step 7 = selected the 5 biggest studios and transformed the others to 'other'
  studios <- c('Paramount Pictures','Warner Bros. Pictures','Sony Pictures Home Entertainment','Universal Pictures','Warner Home Video')
  if(mode == 'training'){
    dataFrame$studio <- training$studio  
  }else{
    dataFrame$studio <- testing$studio
  }
  dataFrame$studio[!dataFrame$studio %in% studios] <- NA
  dataFrame$studio <- factor(dataFrame$studio, levels = c(studios,'Other'))
  dataFrame$studio[is.na(dataFrame$studio)] <- 'Other'
  
  invisible(dataFrame)
}
medianImputTrainingPredictors <- moviesVarEngineer(medianImputTrainingPredictors,'training')

weight <- rep(NA,nrow(medianImputTrainingPredictors))
weight[medianImputTrainingPredictors$best_pic_nom == 'yes'] <- 1000
weight[medianImputTrainingPredictors$best_pic_win == 'yes'] <- 1000
weight[medianImputTrainingPredictors$best_dir_win == 'yes'] <- 1000
weight[is.na(weight)] <- 1

final.model <- lm(imdb_rating ~ title_type + genre + runtime + 
              mpaa_rating  + imdb_num_votes + 
              best_pic_nom + thtr_rel_year_BIN + best_pic_win + 
              best_dir_win + studio + genre:thtr_rel_year_BIN +
              thtr_rel_year_BIN:best_pic_win + thtr_rel_year_BIN:best_actress_win +
              thtr_rel_year_BIN:best_dir_win,
                data = medianImputTrainingPredictors,
                weights = weight)
summary(final.model)
## 
## Call:
## lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
##     imdb_num_votes + best_pic_nom + thtr_rel_year_BIN + best_pic_win + 
##     best_dir_win + studio + genre:thtr_rel_year_BIN + thtr_rel_year_BIN:best_pic_win + 
##     thtr_rel_year_BIN:best_actress_win + thtr_rel_year_BIN:best_dir_win, 
##     data = medianImputTrainingPredictors, weights = weight)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.748  -0.904  -0.234   0.324  27.330 
## 
## Coefficients: (7 not defined because of singularities)
##                                                               Estimate
## (Intercept)                                                  4.5733900
## title_typeFeature Film                                      -0.4771147
## title_typeTV Movie                                          -0.7272767
## genreAnimation                                               1.9017413
## genreArt House & International                               0.8889418
## genreComedy                                                 -0.1568169
## genreDocumentary                                             0.9393243
## genreDrama                                                   1.4754281
## genreMusical & Performing Arts                              -1.1554540
## genreMystery & Suspense                                      1.3790490
## genreOther                                                   0.1897325
## genreScience Fiction & Fantasy                              -0.3996974
## runtime                                                      0.6906618
## mpaa_ratingNC-17                                            -0.5771154
## mpaa_ratingPG                                                0.3032650
## mpaa_ratingPG-13                                             0.1847876
## mpaa_ratingR                                                -0.2295623
## mpaa_ratingUnrated                                          -0.0491466
## imdb_num_votes                                               0.0027573
## best_pic_nomyes                                             -0.0034484
## thtr_rel_year_BIN(1980,1990]                                 0.0281348
## thtr_rel_year_BIN(1990,2000]                                -0.4949214
## thtr_rel_year_BIN(2000,2010]                                -2.0806625
## thtr_rel_year_BIN(2010,2014]                                -1.0796343
## best_pic_winyes                                             -0.0178430
## best_dir_winyes                                             -1.0490195
## studioWarner Bros. Pictures                                  0.5746303
## studioSony Pictures Home Entertainment                       1.4204896
## studioUniversal Pictures                                     1.9739412
## studioWarner Home Video                                      0.9032382
## studioOther                                                  1.1241572
## genreAnimation:thtr_rel_year_BIN(1980,1990]                         NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990] -1.4125059
## genreComedy:thtr_rel_year_BIN(1980,1990]                    -0.4278370
## genreDocumentary:thtr_rel_year_BIN(1980,1990]               -0.4912124
## genreDrama:thtr_rel_year_BIN(1980,1990]                     -1.2330254
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990]  0.6267459
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990]        -1.4449559
## genreOther:thtr_rel_year_BIN(1980,1990]                     -0.2502388
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] -1.2686210
## genreAnimation:thtr_rel_year_BIN(1990,2000]                         NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000]  0.3516835
## genreComedy:thtr_rel_year_BIN(1990,2000]                     2.3761955
## genreDocumentary:thtr_rel_year_BIN(1990,2000]                0.4500423
## genreDrama:thtr_rel_year_BIN(1990,2000]                     -0.3915023
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000]  2.4587418
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000]         0.1736703
## genreOther:thtr_rel_year_BIN(1990,2000]                     -0.0780751
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] -0.1127124
## genreAnimation:thtr_rel_year_BIN(2000,2010]                         NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010]  1.8034443
## genreComedy:thtr_rel_year_BIN(2000,2010]                     1.3937065
## genreDocumentary:thtr_rel_year_BIN(2000,2010]                2.0461984
## genreDrama:thtr_rel_year_BIN(2000,2010]                      1.3823521
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010]  3.6309217
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010]         1.2653928
## genreOther:thtr_rel_year_BIN(2000,2010]                      3.8415854
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010]  2.0360166
## genreAnimation:thtr_rel_year_BIN(2010,2014]                         NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014]  0.1318916
## genreComedy:thtr_rel_year_BIN(2010,2014]                     0.6231210
## genreDocumentary:thtr_rel_year_BIN(2010,2014]                1.0455072
## genreDrama:thtr_rel_year_BIN(2010,2014]                     -0.2982281
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014]         NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014]        -0.6847781
## genreOther:thtr_rel_year_BIN(2010,2014]                      0.0421118
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014]  0.1566839
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes                -0.0213499
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes                -0.1527893
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes                        NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes                        NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes            -1.0495287
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes             0.3176271
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes            -0.4387228
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes             0.0172925
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes            -0.1006795
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes                 1.5644755
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes                 0.4863802
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes                 0.4535078
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes                 1.4356526
##                                                             Std. Error
## (Intercept)                                                  1.5018906
## title_typeFeature Film                                       1.1923086
## title_typeTV Movie                                           1.8331915
## genreAnimation                                               1.2976069
## genreArt House & International                               1.9526031
## genreComedy                                                  2.7499331
## genreDocumentary                                             2.0046678
## genreDrama                                                   0.1080429
## genreMusical & Performing Arts                               2.7504205
## genreMystery & Suspense                                      0.1751700
## genreOther                                                   0.9969168
## genreScience Fiction & Fantasy                               1.5991737
## runtime                                                      0.1219331
## mpaa_ratingNC-17                                             2.1601855
## mpaa_ratingPG                                                0.9434460
## mpaa_ratingPG-13                                             0.9453747
## mpaa_ratingR                                                 0.9387382
## mpaa_ratingUnrated                                           1.0424707
## imdb_num_votes                                               0.0001338
## best_pic_nomyes                                              0.0956836
## thtr_rel_year_BIN(1980,1990]                                 0.4827692
## thtr_rel_year_BIN(1990,2000]                                 0.2816363
## thtr_rel_year_BIN(2000,2010]                                 0.2966526
## thtr_rel_year_BIN(2010,2014]                                 1.6044279
## best_pic_winyes                                              0.1169867
## best_dir_winyes                                              0.2379733
## studioWarner Bros. Pictures                                  0.1782330
## studioSony Pictures Home Entertainment                       0.1703437
## studioUniversal Pictures                                     0.1383241
## studioWarner Home Video                                      0.1497871
## studioOther                                                  0.1043026
## genreAnimation:thtr_rel_year_BIN(1980,1990]                         NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990]  2.7802652
## genreComedy:thtr_rel_year_BIN(1980,1990]                     2.9119256
## genreDocumentary:thtr_rel_year_BIN(1980,1990]                2.1758397
## genreDrama:thtr_rel_year_BIN(1980,1990]                      0.1874641
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990]  2.7551132
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990]         1.1849103
## genreOther:thtr_rel_year_BIN(1980,1990]                      1.0085224
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990]  3.2003704
## genreAnimation:thtr_rel_year_BIN(1990,2000]                         NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000]  2.7686645
## genreComedy:thtr_rel_year_BIN(1990,2000]                     2.7576451
## genreDocumentary:thtr_rel_year_BIN(1990,2000]                2.2790450
## genreDrama:thtr_rel_year_BIN(1990,2000]                      0.2012553
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000]  3.3662303
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000]         0.2866742
## genreOther:thtr_rel_year_BIN(1990,2000]                      1.4625833
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000]  3.1759188
## genreAnimation:thtr_rel_year_BIN(2000,2010]                         NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010]  2.3955812
## genreComedy:thtr_rel_year_BIN(2000,2010]                     2.8113998
## genreDocumentary:thtr_rel_year_BIN(2000,2010]                1.7138877
## genreDrama:thtr_rel_year_BIN(2000,2010]                      0.1899815
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010]  3.2043126
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010]         0.2484800
## genreOther:thtr_rel_year_BIN(2000,2010]                      1.0337153
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010]  3.1887923
## genreAnimation:thtr_rel_year_BIN(2010,2014]                         NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014]  3.1866282
## genreComedy:thtr_rel_year_BIN(2010,2014]                     3.3648032
## genreDocumentary:thtr_rel_year_BIN(2010,2014]                2.4276554
## genreDrama:thtr_rel_year_BIN(2010,2014]                      1.5905918
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014]         NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014]         1.8144927
## genreOther:thtr_rel_year_BIN(2010,2014]                      1.8826352
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014]  3.5438994
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes                 0.1553447
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes                 0.1916048
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes                        NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes                        NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes             0.1421702
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes             0.4185223
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes             0.1086283
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes             0.1172106
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes             1.1352685
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes                 0.4608510
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes                 0.2367964
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes                 0.2363320
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes                 1.3605037
##                                                             t value
## (Intercept)                                                   3.045
## title_typeFeature Film                                       -0.400
## title_typeTV Movie                                           -0.397
## genreAnimation                                                1.466
## genreArt House & International                                0.455
## genreComedy                                                  -0.057
## genreDocumentary                                              0.469
## genreDrama                                                   13.656
## genreMusical & Performing Arts                               -0.420
## genreMystery & Suspense                                       7.873
## genreOther                                                    0.190
## genreScience Fiction & Fantasy                               -0.250
## runtime                                                       5.664
## mpaa_ratingNC-17                                             -0.267
## mpaa_ratingPG                                                 0.321
## mpaa_ratingPG-13                                              0.195
## mpaa_ratingR                                                 -0.245
## mpaa_ratingUnrated                                           -0.047
## imdb_num_votes                                               20.606
## best_pic_nomyes                                              -0.036
## thtr_rel_year_BIN(1980,1990]                                  0.058
## thtr_rel_year_BIN(1990,2000]                                 -1.757
## thtr_rel_year_BIN(2000,2010]                                 -7.014
## thtr_rel_year_BIN(2010,2014]                                 -0.673
## best_pic_winyes                                              -0.153
## best_dir_winyes                                              -4.408
## studioWarner Bros. Pictures                                   3.224
## studioSony Pictures Home Entertainment                        8.339
## studioUniversal Pictures                                     14.270
## studioWarner Home Video                                       6.030
## studioOther                                                  10.778
## genreAnimation:thtr_rel_year_BIN(1980,1990]                      NA
## genreArt House & International:thtr_rel_year_BIN(1980,1990]  -0.508
## genreComedy:thtr_rel_year_BIN(1980,1990]                     -0.147
## genreDocumentary:thtr_rel_year_BIN(1980,1990]                -0.226
## genreDrama:thtr_rel_year_BIN(1980,1990]                      -6.577
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990]   0.227
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990]         -1.219
## genreOther:thtr_rel_year_BIN(1980,1990]                      -0.248
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990]  -0.396
## genreAnimation:thtr_rel_year_BIN(1990,2000]                      NA
## genreArt House & International:thtr_rel_year_BIN(1990,2000]   0.127
## genreComedy:thtr_rel_year_BIN(1990,2000]                      0.862
## genreDocumentary:thtr_rel_year_BIN(1990,2000]                 0.197
## genreDrama:thtr_rel_year_BIN(1990,2000]                      -1.945
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000]   0.730
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000]          0.606
## genreOther:thtr_rel_year_BIN(1990,2000]                      -0.053
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000]  -0.035
## genreAnimation:thtr_rel_year_BIN(2000,2010]                      NA
## genreArt House & International:thtr_rel_year_BIN(2000,2010]   0.753
## genreComedy:thtr_rel_year_BIN(2000,2010]                      0.496
## genreDocumentary:thtr_rel_year_BIN(2000,2010]                 1.194
## genreDrama:thtr_rel_year_BIN(2000,2010]                       7.276
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010]   1.133
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010]          5.093
## genreOther:thtr_rel_year_BIN(2000,2010]                       3.716
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010]   0.638
## genreAnimation:thtr_rel_year_BIN(2010,2014]                      NA
## genreArt House & International:thtr_rel_year_BIN(2010,2014]   0.041
## genreComedy:thtr_rel_year_BIN(2010,2014]                      0.185
## genreDocumentary:thtr_rel_year_BIN(2010,2014]                 0.431
## genreDrama:thtr_rel_year_BIN(2010,2014]                      -0.187
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014]      NA
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014]         -0.377
## genreOther:thtr_rel_year_BIN(2010,2014]                       0.022
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014]   0.044
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes                 -0.137
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes                 -0.797
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes                     NA
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes                     NA
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes             -7.382
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes              0.759
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes             -4.039
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes              0.148
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes             -0.089
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes                  3.395
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes                  2.054
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes                  1.919
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes                  1.055
##                                                             Pr(>|t|)    
## (Intercept)                                                 0.002463 ** 
## title_typeFeature Film                                      0.689229    
## title_typeTV Movie                                          0.691758    
## genreAnimation                                              0.143465    
## genreArt House & International                              0.649143    
## genreComedy                                                 0.954550    
## genreDocumentary                                            0.639606    
## genreDrama                                                   < 2e-16 ***
## genreMusical & Performing Arts                              0.674613    
## genreMystery & Suspense                                     2.64e-14 ***
## genreOther                                                  0.849145    
## genreScience Fiction & Fantasy                              0.802748    
## runtime                                                     2.64e-08 ***
## mpaa_ratingNC-17                                            0.789469    
## mpaa_ratingPG                                               0.748024    
## mpaa_ratingPG-13                                            0.845118    
## mpaa_ratingR                                                0.806922    
## mpaa_ratingUnrated                                          0.962419    
## imdb_num_votes                                               < 2e-16 ***
## best_pic_nomyes                                             0.971267    
## thtr_rel_year_BIN(1980,1990]                                0.953553    
## thtr_rel_year_BIN(1990,2000]                                0.079548 .  
## thtr_rel_year_BIN(2000,2010]                                8.60e-12 ***
## thtr_rel_year_BIN(2010,2014]                                0.501352    
## best_pic_winyes                                             0.878844    
## best_dir_winyes                                             1.31e-05 ***
## studioWarner Bros. Pictures                                 0.001356 ** 
## studioSony Pictures Home Entertainment                      9.31e-16 ***
## studioUniversal Pictures                                     < 2e-16 ***
## studioWarner Home Video                                     3.43e-09 ***
## studioOther                                                  < 2e-16 ***
## genreAnimation:thtr_rel_year_BIN(1980,1990]                       NA    
## genreArt House & International:thtr_rel_year_BIN(1980,1990] 0.611670    
## genreComedy:thtr_rel_year_BIN(1980,1990]                    0.883257    
## genreDocumentary:thtr_rel_year_BIN(1980,1990]               0.821493    
## genreDrama:thtr_rel_year_BIN(1980,1990]                     1.34e-10 ***
## genreMusical & Performing Arts:thtr_rel_year_BIN(1980,1990] 0.820151    
## genreMystery & Suspense:thtr_rel_year_BIN(1980,1990]        0.223310    
## genreOther:thtr_rel_year_BIN(1980,1990]                     0.804152    
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1980,1990] 0.692000    
## genreAnimation:thtr_rel_year_BIN(1990,2000]                       NA    
## genreArt House & International:thtr_rel_year_BIN(1990,2000] 0.898979    
## genreComedy:thtr_rel_year_BIN(1990,2000]                    0.389327    
## genreDocumentary:thtr_rel_year_BIN(1990,2000]               0.843550    
## genreDrama:thtr_rel_year_BIN(1990,2000]                     0.052364 .  
## genreMusical & Performing Arts:thtr_rel_year_BIN(1990,2000] 0.465519    
## genreMystery & Suspense:thtr_rel_year_BIN(1990,2000]        0.544947    
## genreOther:thtr_rel_year_BIN(1990,2000]                     0.957452    
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(1990,2000] 0.971705    
## genreAnimation:thtr_rel_year_BIN(2000,2010]                       NA    
## genreArt House & International:thtr_rel_year_BIN(2000,2010] 0.451953    
## genreComedy:thtr_rel_year_BIN(2000,2010]                    0.620325    
## genreDocumentary:thtr_rel_year_BIN(2000,2010]               0.233152    
## genreDrama:thtr_rel_year_BIN(2000,2010]                     1.55e-12 ***
## genreMusical & Performing Arts:thtr_rel_year_BIN(2000,2010] 0.257763    
## genreMystery & Suspense:thtr_rel_year_BIN(2000,2010]        5.21e-07 ***
## genreOther:thtr_rel_year_BIN(2000,2010]                     0.000228 ***
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2000,2010] 0.523480    
## genreAnimation:thtr_rel_year_BIN(2010,2014]                       NA    
## genreArt House & International:thtr_rel_year_BIN(2010,2014] 0.967004    
## genreComedy:thtr_rel_year_BIN(2010,2014]                    0.853165    
## genreDocumentary:thtr_rel_year_BIN(2010,2014]               0.666919    
## genreDrama:thtr_rel_year_BIN(2010,2014]                     0.851357    
## genreMusical & Performing Arts:thtr_rel_year_BIN(2010,2014]       NA    
## genreMystery & Suspense:thtr_rel_year_BIN(2010,2014]        0.706060    
## genreOther:thtr_rel_year_BIN(2010,2014]                     0.982164    
## genreScience Fiction & Fantasy:thtr_rel_year_BIN(2010,2014] 0.964755    
## thtr_rel_year_BIN(1980,1990]:best_pic_winyes                0.890748    
## thtr_rel_year_BIN(1990,2000]:best_pic_winyes                0.425630    
## thtr_rel_year_BIN(2000,2010]:best_pic_winyes                      NA    
## thtr_rel_year_BIN(2010,2014]:best_pic_winyes                      NA    
## thtr_rel_year_BIN(1969,1980]:best_actress_winyes            7.64e-13 ***
## thtr_rel_year_BIN(1980,1990]:best_actress_winyes            0.448296    
## thtr_rel_year_BIN(1990,2000]:best_actress_winyes            6.32e-05 ***
## thtr_rel_year_BIN(2000,2010]:best_actress_winyes            0.882777    
## thtr_rel_year_BIN(2010,2014]:best_actress_winyes            0.929373    
## thtr_rel_year_BIN(1980,1990]:best_dir_winyes                0.000748 ***
## thtr_rel_year_BIN(1990,2000]:best_dir_winyes                0.040555 *  
## thtr_rel_year_BIN(2000,2010]:best_dir_winyes                0.055627 .  
## thtr_rel_year_BIN(2010,2014]:best_dir_winyes                0.291886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.739 on 448 degrees of freedom
## Multiple R-squared:  0.8762, Adjusted R-squared:  0.8563 
## F-statistic: 44.05 on 72 and 448 DF,  p-value: < 2.2e-16

Part 5: Model Interpretation

Firstly, we identify a few NA’s in the model. That happened because there was no examples in the training set the agreed with the two factor variables, e.g. there was no animation movie released between years 1980 and 1990 in the dataset.

Lets take a look at the statistical significant parameter estimation: Statistical significant we mean that, with a 95% confidence, that parameter estimator is different than 0:

Variable P.Value Estimate
genreDrama < 2e-16 1.4754281
genreMystery & Suspense 2.64e-14 1.3790490
runtime 2.64e-08 0.6906618
imdb_num_votes 2e-16 0.0027573
thtr_rel_year_BIN(2000,2010] 8.60e-12 -2.0806625
studioWarner Bros. Pictures 0.001356 0.5746303
studioSony Pictures Home Entertainment 9.31e-16 1.4204896
studioUniversal Pictures < 2e-16 1.9739412
studioWarner Home Video 3.43e-09 0.9032382
studioOther < 2e-16 1.1241572
genreDrama:thtr_rel_year_BIN(1980,1990] 1.34e-10 -1.2330254
genreDrama:thtr_rel_year_BIN(2000,2010] 1.55e-12 1.3823521
genreMystery & Suspense:thtr_rel_year_BIN(2000,2010] 5.21e-07 1.2653928
genreOther:thtr_rel_year_BIN(2000,2010] 0.000228 3.8415854
thtr_rel_year_BIN(1969,1980]:best_actress_winyes 7.64e-13 -1.0495287
thtr_rel_year_BIN(1990,2000]:best_actress_winyes 6.32e-05 -0.4387228
thtr_rel_year_BIN(1980,1990]:best_dir_winyes 0.000748 1.5644755
thtr_rel_year_BIN(1990,2000]:best_dir_winyes 0.040555 0.4863802

All the following results can be interpreted as the correlation of each paramenter with the imdb_rating response variable, when all the other attributes remain constant. Most of the attributes were positive correlated with the response variable.

IQR(medianImputTrainingPredictors$runtime)
## [1] 0.3833333

Part 6: Prediction

As we have discussed in section 3, declaring the model efficiency by the estimated training error could be a little unrealistic because the model tries to minimize the training set error. The need to evaluate the training set by the adjusted R2 was to delimit the number of parameters we would include in our model, preventing overfitting. As the original training dataset had to go throught a preprocess step, we did the same procedure with the testing dataset, with all the parameters learned with the training data, including the median imputation and the steps in the ‘moviesVarEngineer’ function. We only do not create a weight array because it was only used when estimating the \(w_{i}\) parameters of the weighted least squares. A good way to evaluate regression models is the Root Mean Square Error - RMSE, which makes an estimate of the standard error made by the model in unseen data.

medianImputeTestingData <- predict(medianImput, testing)
medianImputTestingPredictors <- medianImputeTestingData[,-c(responseCols,noInterestVars)]
#medianImputTestingPredictors <- #cbind(medianImputTestingPredictors,medianImputeTestingData[,responseCols[1]])
medianImputTestingPredictors$imdb_rating <- medianImputeTestingData[,responseCols[1]]

medianImputTestingPredictors <- moviesVarEngineer(medianImputTestingPredictors,'testing')

testingPredictions <- predict(final.model,medianImputTestingPredictors)
## Warning in predict.lm(final.model, medianImputTestingPredictors):
## prediction from a rank-deficient fit may be misleading
RMSE(testingPredictions,medianImputTestingPredictors$imdb_rating,na.rm = T)
## [1] 1.158517

Part 7: Conclusion and Next Steps

The feature engineering process was the responsible in greatly improving the adjusted R2 and the model seemed to generalize well in new unseen data, as we obtained a relatively low RMSE.

A probable solution for the NA’s presented in the model was to remove the specific interactions from the model. A way to accomplish this would be to use the dummyVars function in the caret package. This function splits the factor in n columns, one for each factor level, as a dummy variable, i.e. 1 if the variable has that factor level and 0 otherwise.

Besides the NA’s, there was one more point to improve in the model. As I have created interaction terms with factor with many levels, some linear dependency was introduced with them. As we can see in the testing dataset prediction section, the model complained with a few warnings about predicting in a rank-deficient matrix (dataset). That probably is a linear dependency between some of my variables and that could make the parameter estimates unstable and, consequently, not accurate for new data. The NZV and Correlation analysis taken with the final model could not identify and eliminate the probable dependent columns, so further investigation would have to be done to improve the model stability.

It is important to notice that I have tried to build a model using the standard lm function and the stepwise automated model selection algorithm, more specifically the AIC stepwise model selection. Only before the first first raw model, I managed to extract a better model with the stepwise selection, after that, the criterious feature selection, transformation and engineering processes always gave a better fit in the training set.

The project show here only used the ‘imdb_rating’ variable as a response variable. Similar analysis could be made using the other two continuous variables (‘critics_score’ and ‘audience_score’), a logistic regression could be modelled and fit for the binary response variable ‘audience_rating’ and a multinomial regression model could be built for the multilevel response variable ‘critics_rating’.

There is a lot of research that can be done and the one explained in this project shows that we can accurately predict how people will rate (like) the movies from its general characteristics.

.