Data Mining Final Project - IMDB Database

Maria Paracha and Catherine Williams
June 5, 2019

PROBLEM

  • IMDB has immense amounts of data regarding movies and wants to understand which combination of feature values contribute to bringing a film to success and consequently winning awards
  • We will help predict number of awards (oscar, golden globes etc.) a movie will win

DATA SET DETAILS

  • 36 predictor variables

    • 28 discrete
    • 8 continuous
  • Label is continuous = number of wins

  • Ground Truth

    • Number of Wins = Number of Prizes/Awards won by the movie

Source: https://www.kaggle.com/gabrielegalimberti/movies-example-for-machine-learning-activities/

INSTALLING LIBRARIES AND PACKAGES

#checks if package is installed, if not, installs it. Then loads all packages
ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
      install.packages(new.pkg, dependencies = TRUE)
      sapply(pkg, require, character.only = TRUE)
}
packages <- c("tidyverse","caret","performance","Amelia",
              "rpart","rpart.plot","randomForest",
              "fastDummies","corrplot")
ipak(packages)

theme_set(theme_classic()) #applies classic theme to all charts

IMPORTING & VIEWING DATA

movies <- read.csv2("MACHINE_LEARNING_FINAL.csv")
 [1] "ï..title"         "year"             "lifetime_gross"  
 [4] "ratingInteger"    "ratingCount"      "duration"        
 [7] "nrOfWins"         "nrOfNominations"  "nrOfPhotos"      
[10] "nrOfNewsArticles" "nrOfUserReviews"  "nrOfGenre"       
[13] "Action"           "Adult"            "Adventure"       
[16] "Animation"        "Biography"        "Comedy"          
[19] "Crime"            "Documentary"      "Drama"           
[22] "Family"           "Fantasy"          "Horror"          
[25] "Music"            "Musical"          "Mystery"         
[28] "News"             "RealityTV"        "Romance"         
[31] "SciFi"            "Short"            "Sport"           
[34] "TalkShow"         "Thriller"         "War"             
[37] "Western"         

DATA EXPLORATION

  • No missing values or NAs
  • Excluded few columns from the data set that were not needed (Title, Adult, RealityTV, TalkShow)
  • Changed data to appropriate data types where needed
  • Removed outliers from the year, rating count, and number of photos columns
  • Some variables are not normally distributed therefore used log/sqrt to make them more normally distributed

DATA EXPLORATION

  • No missing values

plot of chunk unnamed-chunk-4

DATA EXPLORATION

  • Removed outliers
  • Deleted unimportant columns
#remove title and columns that do not have more than 1 unique value
movies <- movies[, sapply(movies, function(col) length(unique(col))) > 1] %>% select(-1)

#Exclude outliers for year
outliers <- boxplot(movies$year, main="Boxplot for year")$out
movies <- movies[-which(movies$year %in% outliers),]

#Exclude outliers for ratingCount
outliers <- boxplot(movies$ratingCount, main="Boxplot for ratingCount")$out
movies <- movies[-which(movies$ratingCount %in% outliers),]

#Exclude outliers for nrOfPhotos
outliers <- boxplot(movies$nrOfPhotos, main="Boxplot for nrOfPhotos")$out
movies <- movies[-which(movies$nrOfPhotos %in% outliers),]

DATA EXPLORATION

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

DATA EXPLORATION

Function to create charts comparing all variables against the label

gg_plot <- function(x_col, y_col=movies$nrOfWins, data=movies){
  if(is.numeric(data[[x_col]])){
      p1 <- data %>% ggplot(mapping=aes_string(x_col, y_col))+
        geom_jitter(alpha=0.5)+
        geom_smooth(method="loess", se=FALSE)+
        labs(title=str_c("Awards vs ", x_col), y="label")
      p1 %>% print()
      h <- hist(data[[x_col]], breaks = "FD", plot = FALSE) #Freedman-Diaconis rule
      p2 <- ggplot(data, aes_string(x_col))+
        geom_histogram(aes(y = ..density..), breaks = h$breaks, alpha = 0.3, col = "white")+
        geom_density(size = 1) +
        labs(title=str_c("Histogram and density for ", x_col))
      p2 %>% print()    }
  else{
    p3 <- ggplot(data, aes_string(x_col, y_col))+
      geom_boxplot()+
      geom_hline(yintercept=mean(y_col), color="red")+
      geom_hline(yintercept=median(y_col), color="blue", linetype="dashed")+
      labs(title=str_c("Awards: Number of Wins by ", x_col), subtitle="Showing mean(red), median(blue)")
    p3 %>% print()    }
}

DATA EXPLORATION

Histogram to show distribution of label

plot of chunk unnamed-chunk-8

DATA EXPLORATION (Conclusions)

  • Recent years have more data points. This could be due to increased user adoption on IMDB.
  • there are positive relationships between the nrOfWins and lifetime_gross, ratingCount, nrOfPhotos, nrOfNewsArticles, and nrOfUserReviews.
  • The higher the rating, the higher the number of awards won with a few outliers. There is a large increase with ratings of 8 and 9.
  • Movie duration doesn't necessarily relate to number of awards.
  • Unsurprisingly, the number of nominations has a very strong positive relationship with the number of awards.
  • Movie genres do not appear to play a big role in number of awards except for Biography and War films.

CHECKING CORRELATIONS

  • Few variables with high correlations (>|0.5|)
  • There are also some moderate correlations (|0.3| to |0.5|)
#view correlations, drop the insignificant relationships, sort by highest to lowest, and visualize results graphically
corr_simple <- function(data=movies,drop="nrOfWins"){
  df_cor <- data %>% mutate_if(is.factor, as.numeric) %>% select(-drop)
  corr <- cor(df_cor)
  corr[lower.tri(corr,diag=TRUE)] <- NA  #Prepare to drop duplicates and correlations of 1
  corr[corr == 1] <- NA #drop perfect correlations
  corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
  corr <- na.omit(corr) #remove the NA values from above
  corr <- subset(corr, abs(Freq) > 0.3) #select significant values
  corr <- corr[order(-abs(corr$Freq)),] #Sort by highest correlation
  print(corr)
  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  #plot correlations visually
  corrplot::corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
}

CHECKING CORRELATIONS

                Var1             Var2       Freq
260      ratingCount  nrOfUserReviews  0.6775490
98    lifetime_gross      ratingCount  0.5786108
196      ratingCount       nrOfPhotos  0.5484434
228      ratingCount nrOfNewsArticles  0.5386525
194   lifetime_gross       nrOfPhotos  0.4447820
263       nrOfPhotos  nrOfUserReviews  0.4181773
264 nrOfNewsArticles  nrOfUserReviews  0.3741623
230  nrOfNominations nrOfNewsArticles  0.3701706
164      ratingCount  nrOfNominations  0.3560082
258   lifetime_gross  nrOfUserReviews  0.3557660
163    ratingInteger  nrOfNominations  0.3520855
549         duration            Drama  0.3501868
231       nrOfPhotos nrOfNewsArticles  0.3480180
396        Adventure        Animation  0.3443654
225             year nrOfNewsArticles  0.3356237
943           Comedy         Thriller -0.3125063
547    ratingInteger            Drama  0.3100243
226   lifetime_gross nrOfNewsArticles  0.3072152

CHECKING CORRELATIONS

plot of chunk unnamed-chunk-11

FEATURE ENGINEERING

  • Created new variable “Popularity” to remove correlations
  • Performed Log/sqrt transformations to improve distributions
  • Normalize data so model is not skewed by different types of measurements
#feature engineering
movies <- movies %>% mutate(Popularity = ratingCount+nrOfUserReviews+nrOfNewsArticles+nrOfPhotos) %>% select(-ratingCount,-nrOfUserReviews,-nrOfNewsArticles,-nrOfPhotos)

#transformations
movies <- movies %>% 
   mutate(year.log=log(year),
          duration.log=log(duration),
          Popularity.log=log(Popularity),
          lifetime_gross.log=log(lifetime_gross),
          nrOfWins.sqr = sqrt(nrOfWins),
          nrOfNominations.sqr = sqrt(nrOfNominations)) %>% 
  select(-year,-duration,-Popularity,-nrOfNominations,-lifetime_gross)

#dropped column
movies <- movies %>% select(-lifetime_gross.log)

#normalize dataframe
normalize <- function(x)(x - mean(x, na.rm=T))/sd(x, na.rm=T)
movies <- movies %>% mutate_at(vars(year.log,duration.log,Popularity.log,nrOfNominations.sqr), normalize)

DATA EXPLORATION

Histogram to show distribution of label after transformation

plot of chunk unnamed-chunk-13

TRAINING & TESTING DATA

  • Training set = 70%
  • Testing set = 30%

Created dummy variables to convert factors into numbers for linear regression

#convert to numeric dummy variables
movies_num <- dummy_cols(movies) %>% select(-ratingInteger,-nrOfGenre)
movies_num <- movies_num %>% mutate_if(is.logical,as.integer)

#for use with linear regression
set.seed(123)
train_num <- movies_num %>% sample_frac(0.7)
test_num <- movies_num %>% setdiff(train_num)

#for use with regression trees
set.seed(123)
train <- movies %>% sample_frac(0.7)
test <- movies %>% setdiff(train)

LINEAR REGRESSION MODELS

Different Linear Regression Models

  • By AIC
  • By P-Value
  • By Important Features

R-Squared Results

  • 76.16%
  • 76.08%
  • 76.08%

LINEAR REGRESSION MODEL (AIC)

       AIC        R2 R2_adjusted      RMSE
1 5017.302 0.7647038   0.7630284 0.7270008
                 names   overall
13 nrOfNominations.sqr 65.142762
14     ratingInteger_8 12.249629
12            year.log  4.522397
15     ratingInteger_9  3.835766
7               Family  3.509310
11            Thriller  3.267216
1               Action  2.912037
2            Adventure  2.832014
4            Biography  2.586796
6          Documentary  2.542071
10               Sport  2.078357
3            Animation  1.965759
8              Fantasy  1.898556
9              Mystery  1.897670
5               Comedy  1.704279
16     ratingInteger_6  1.526456

LINEAR REGRESSION MODEL (P-Value)

       AIC        R2 R2_adjusted      RMSE
1 5027.757 0.7631967   0.7617226 0.7293254
                 names    overall
10 nrOfNominations.sqr 64.1876690
11     ratingInteger_8 12.6982590
9             year.log  4.4751804
12     ratingInteger_9  4.0276171
5               Family  3.2606036
8             Thriller  3.2054662
4          Documentary  2.9272752
2            Adventure  2.5750520
1               Action  2.4912836
13     ratingInteger_7  2.2800969
3            Animation  2.1772591
6              Fantasy  1.8125349
14     ratingInteger_3  1.0098226
7              Musical  0.6886044

LINEAR REGRESSION MODEL (Important)

       AIC        R2 R2_adjusted      RMSE
1 5027.757 0.7631967   0.7617226 0.7293254
                 names    overall
12 nrOfNominations.sqr 63.1644253
13     ratingInteger_8 12.5246257
11            year.log  4.7560935
14     ratingInteger_9  3.9357266
7               Family  3.3057406
2            Biography  3.1713532
6          Documentary  3.0497574
10            Thriller  3.0424179
4            Adventure  2.6022861
1               Action  2.5970273
5            Animation  1.8491190
8              Fantasy  1.6776378
3                  War  0.9333965
15      Popularity.log  0.8682510
9              Musical  0.3862927

REGRESSION TREE (Results)

  • R-squared: 82.64%

plot of chunk unnamed-chunk-17plot of chunk unnamed-chunk-17

REGRESSION TREE (Visualization)

plot of chunk unnamed-chunk-18

RANDOM FOREST (Results)

plot of chunk unnamed-chunk-19plot of chunk unnamed-chunk-19

MODEL PERFORMANCE (With Test Set)

  • AIC Model = Better model with a lower AIC number, and residual standard error with better R-squared value
  • Models by p-value and important features = Predicted ratings greater than 9 for few movies (scale is from 2-9)
lmodel_AIC R-squared:  67.56 % 
lmodel_p R-squared:  67.5 % 
lmodel_i R-squared:  67.35 % 
dtModel R-squared:  58.47 % 
rfModel R-squared:  60.6 % 

MODEL IMPROVEMENTS?

  • To improve accuracy results, nrOfWins could have been turned into a binary variable for whether an award was earned or not. Doing a really quick generalized linear model of this (not shown since it is out of scope of this project) resulted in ~95% accuracy against the test set. This project is predicting number of awards, however, and not whether an award was received or not.
  • More improvements can be achieved in the data set if we also collect other important information that movie review committee takes into consideration for instance actors/actresses, producer,song tracks. These basic movie stats are likely not enough to explain all the variance with number of movie awards.

CONCLUSIONS

  • lmodel_AIC is the best model with an R-squared of 67.56% against the test data.
  • It shows that nrOfWins.sqr is modeled by nrOfNominations.sqr, ratingInteger_8, year.log, ratingInteger_9, Family, Thriller, Action, Adventure, Biography, Documentary, Sport, Animation, Fantasy, Mystery, Comedy, and ratingInteger_6.
  • The residuals compared to fitted values are in a fairly straight line and close to 0 except at the higher end. The residuals plot is a fairly normal distribution. The quantile quantile plot is reasonably straight, except at the ends where there is more noise.
  • All of the other models have very similar results with the decision tree being the worst at 58.47%.