Introduction:

The project details about the detailed analysis of the movie dataset that contains information from IMDB for a random sample of movies. The data represents 651 randomly sampled movies released between 1972 to 2014 in the Unites States. This data frame contains 651 observations (rows), each representing a movie, and 33 variables (columns).

Objective:

The purpose of this project is to evaluate different regression models and develop an accurate model to understand what attributes make a movie popular. Is a movie’s popularity, as measured by audience score, related to the type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)? Similarly, if a movie will win an Oscar based on audience rating, critics rating, audience score, critics score, and genre? Therefore, I am going to figure out to answer these types of questions. In the meantime, I will be doing Exploratory Data Analysis to clean and wrangle our dataset for performing the regression analysis, henceforth approaching to learn something peculiar about movies.

Analysis:

Through the EDA and regression modeling, I will ensure the accurate model which produces movies that are more likely to be liked by the target audience.

Getting familiar with the Dataset:

The movies data frame contains 651 observations (rows), each representing a movie, and 33 variables (columns).

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2) # visualization
library(ggthemes) # visualization
library(scales) # visualization
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(dplyr) # data manipulation
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(assertive)
## 
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
## 
##     is_atomic, is_character, is_double, is_empty, is_formula,
##     is_function, is_integer, is_list, is_logical, is_null, is_numeric,
##     is_vector
## The following object is masked from 'package:tibble':
## 
##     has_rownames
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rpart) # decision tree
library(rpart.plot) # decision tree plot
library(Metrics) # gini and information
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(randomForest) # random forest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
load("/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.RData")
write.csv(movies,file="/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.csv")
movie <- read.csv("/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.csv")
str(movie)
## 'data.frame':    651 obs. of  33 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : chr  "Feature Film" "Feature Film" "Feature Film" "Feature Film" ...
##  $ genre           : chr  "Drama" "Drama" "Comedy" "Drama" ...
##  $ runtime         : int  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : chr  "R" "PG-13" "R" "PG" ...
##  $ studio          : chr  "Indomina Media Inc." "Warner Bros. Pictures" "Sony Pictures Classics" "Columbia Pictures" ...
##  $ thtr_rel_year   : int  2013 2001 1996 1993 2004 2009 1986 1996 2012 2012 ...
##  $ thtr_rel_month  : int  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : int  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : int  2013 2001 2001 2001 2005 2010 2003 2004 2013 2012 ...
##  $ dvd_rel_month   : int  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : int  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : chr  "Rotten" "Certified Fresh" "Certified Fresh" "Certified Fresh" ...
##  $ critics_score   : int  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : chr  "Upright" "Upright" "Upright" "Upright" ...
##  $ audience_score  : int  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : chr  "no" "no" "no" "no" ...
##  $ best_pic_win    : chr  "no" "no" "no" "no" ...
##  $ best_actor_win  : chr  "no" "no" "no" "yes" ...
##  $ best_actress_win: chr  "no" "no" "no" "no" ...
##  $ best_dir_win    : chr  "no" "no" "no" "yes" ...
##  $ top200_box      : chr  "no" "no" "no" "no" ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movie)
##        X            title            title_type           genre          
##  Min.   :  1.0   Length:651         Length:651         Length:651        
##  1st Qu.:163.5   Class :character   Class :character   Class :character  
##  Median :326.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :326.0                                                           
##  3rd Qu.:488.5                                                           
##  Max.   :651.0                                                           
##                                                                          
##     runtime      mpaa_rating           studio          thtr_rel_year 
##  Min.   : 39.0   Length:651         Length:651         Min.   :1970  
##  1st Qu.: 92.0   Class :character   Class :character   1st Qu.:1990  
##  Median :103.0   Mode  :character   Mode  :character   Median :2000  
##  Mean   :105.8                                         Mean   :1998  
##  3rd Qu.:115.8                                         3rd Qu.:2007  
##  Max.   :267.0                                         Max.   :2014  
##  NA's   :1                                                           
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes   critics_rating    
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Length:651        
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Class :character  
##  Median :15.00   Median :6.600   Median : 15116   Mode  :character  
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                     
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300                     
##  Max.   :31.00   Max.   :9.000   Max.   :893008                     
##  NA's   :8                                                          
##  critics_score    audience_rating    audience_score  best_pic_nom      
##  Min.   :  1.00   Length:651         Min.   :11.00   Length:651        
##  1st Qu.: 33.00   Class :character   1st Qu.:46.00   Class :character  
##  Median : 61.00   Mode  :character   Median :65.00   Mode  :character  
##  Mean   : 57.69                      Mean   :62.36                     
##  3rd Qu.: 83.00                      3rd Qu.:80.00                     
##  Max.   :100.00                      Max.   :97.00                     
##                                                                        
##  best_pic_win       best_actor_win     best_actress_win   best_dir_win      
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   top200_box          director            actor1             actor2         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor3             actor4             actor5            imdb_url        
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     rt_url         
##  Length:651        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Oscar winning movies that were also Box office hits

top_200_boxoffice = movie %>% 
  dplyr::select(title, genre,critics_rating, audience_rating, imdb_rating, audience_score, top200_box, best_actor_win, best_actress_win,best_dir_win, best_pic_win)%>%
  filter((best_pic_win =="yes" | best_actor_win == "yes" | best_actress_win == "yes" |  best_dir_win == "yes") & top200_box == "yes")

#head(top_200_boxoffice)
ggplot(top_200_boxoffice, aes(x = audience_score , y = reorder(title, audience_score))) +
  geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Audience Score") + ylab("Movie Title") + ggtitle("Oscar winning box office hits")

Number of movies based on critics rating

df <- movie %>%
 count(critics_rating) %>%
  mutate(percent = n*100/sum(n)) %>%
  arrange(desc(n))

ggplot(df, aes(x=percent , y= critics_rating, fill = critics_rating)) +
  geom_bar(stat="identity") + xlab("Percentage of the movies") +
  ylab("Critics Rating") 

Top movies at every year based on imdb_rating

moviepoint <- distinct_all(movie %>%
  dplyr::select(title, imdb_rating, thtr_rel_year) %>%
  filter(thtr_rel_year>= 1970 & thtr_rel_year<= 2000) %>%
  slice_max(imdb_rating, n = 10) %>%
  arrange(thtr_rel_year)) 

moviepoint
##                        title imdb_rating thtr_rel_year
## 1     The Godfather, Part II         9.0          1974
## 2                 Annie Hall         8.1          1977
## 3                Sans Soleil         8.2          1983
## 4                     Aliens         8.4          1986
## 5         The Thin Blue Line         8.1          1988
## 6              Groundhog Day         8.1          1993
## 7  In the Name of the Father         8.1          1993
## 8              Trainspotting         8.2          1996
## 9          Good Will Hunting         8.3          1997
## 10                   Memento         8.5          2000
moviepoint %>%
  ggplot(mapping= aes(
  thtr_rel_year, reorder(title,thtr_rel_year) , fill = imdb_rating)) +
  geom_bar(stat="identity") + xlab("Year of the movie release") +
  ylab("Movie Name") + coord_cartesian(xlim = c(1970, 2005))

Best of rotten

movierotten <- distinct_all(movie %>%
filter(critics_rating == 'Rotten') %>%
  dplyr::select(title, imdb_rating, thtr_rel_year) %>%
    arrange(desc(imdb_rating))) 

## Total Number of Rotten Movies are 307

movierotten_10 <- movierotten%>%
  slice_max(imdb_rating, n = 10)

movierotten_10 %>%
  ggplot(mapping= aes(
  imdb_rating, reorder(title,imdb_rating) , fill = imdb_rating)) +
  geom_bar(stat="identity") + xlab("IMDB Rating") +
  ylab("Movie Name") + coord_cartesian(xlim = c(7, 8)) +
  scale_fill_continuous(type = "viridis") 

Variation of Audience Score for Top movies

# Top 200 movies by critics rating then find no of movies with critic score 100 
#and plot movie wrt audience score for CR 100 

criticmov <- distinct_all(movie %>%
               dplyr::select(title, critics_score, audience_score) %>%
               slice_max(critics_score, n = 200) %>%
               arrange(desc(critics_score)))

## to filter the data of the critic movie with value as 100
criticmov_100 <- criticmov %>%
                subset(critics_score == 100) %>%
                arrange(desc(audience_score))

## Total 19 movies are there with critic score as 100

criticmov_100 %>%
  ggplot(mapping= aes(reorder(title,-audience_score) ,
  audience_score, fill = factor(audience_score))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
   geom_bar(stat="identity") + xlab("Movie Name") + ylab("Audience Score") 

Top 20 Studio

top_studio = movie %>%
  dplyr::select(studio, audience_score) %>%
  group_by(studio) %>%
  summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
  arrange(desc(avg_audience_score)) %>% top_n(n=20)
## Selecting by avg_audience_score
ggplot(top_studio, aes(x = avg_audience_score , y = reorder(studio,-avg_audience_score))) +
  geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Studio") + ggtitle("Favourite Studios")

Top rated movie of every genre

top_movie_genre = movie %>%
 group_by(genre) %>%
 slice_max(audience_score, n = 1, with_ties = FALSE) %>%
 dplyr::select(title, title_type, imdb_rating, audience_rating, audience_score, genre, title_type)

ggplot(top_movie_genre, aes(x = genre , y = title, size = audience_score, color = title_type )) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  geom_point() + xlab("Genre") + ylab("Movie") + ggtitle("Top movie of each Genre by audience score")

Bottom 5 Studio

bottom_studio = movie %>%
  dplyr::select(studio, audience_score) %>%
  group_by(studio) %>%
  summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
  arrange(desc(avg_audience_score)) %>% top_n(n=5)
## Selecting by avg_audience_score
ggplot(bottom_studio, aes(x = avg_audience_score , y = reorder(studio,avg_audience_score))) +
  geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Studio") + ggtitle("Least Favorite Studios")

Categories by average audience score

top_category = movie %>%
  dplyr::select(title, title_type, audience_score) %>%
  group_by(title_type) %>%
  summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
  arrange(desc(avg_audience_score)) 

ggplot(top_category, aes(x = avg_audience_score , y = reorder(title_type,-avg_audience_score))) +
  geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Category") + ggtitle("Favourite Category")

Top rated movie of each category

top_movie_category = movie %>%
 group_by(title_type) %>%
 slice_max(audience_score, n = 1, with_ties = FALSE) %>%
 dplyr::select(title, title_type, imdb_rating, audience_rating, audience_score, genre, title_type)

ggplot(top_movie_category, aes(x = title_type , y = title, size = audience_score, color = genre )) +
  geom_point() + xlab("Genre") + ylab("Movie") + ggtitle("Top movie of each Category by audience score")

Genre by average audience score

top_genre = movie %>%
  dplyr::select(genre, audience_score) %>%
  group_by(genre) %>%
  summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
  arrange(desc(avg_audience_score)) %>% 
  top_n(n=20)
## Selecting by avg_audience_score
ggplot(top_genre, aes(x = avg_audience_score , y = reorder(genre,-avg_audience_score))) +
  geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Genre") + ggtitle("Favourite Genre")

Movies with high critics rating but low audience score lies in which MPAA Rating

critic_aud_mov <-distinct_all(movie %>%
                                dplyr::select(title, critics_score, audience_score, imdb_rating, mpaa_rating)%>%
                                slice_max(critics_score, n = 200)%>%arrange(desc(audience_score)))

criticmov_100 <- critic_aud_mov%>%
  filter(critics_score==100)%>%slice_min(audience_score, n = 10)

## Found that the total 19 movies are there with critic score as 100

criticmov_100%>%ggplot(mapping=aes(x = audience_score, y = title, color=factor(mpaa_rating), shape =factor(mpaa_rating),size = 100)) +
  geom_point(stat = "identity")+ xlab("Audience Score")+ylab("Movie Name")

##Split Train and Test Data

set.seed(123)
split <- sample(sample(nrow(movie), .80*nrow(movie)))
train <- movie[split, ]
test <- movie[-split, ]
dim(train)
## [1] 520  33
dim(test)
## [1] 131  33

Statistics of imdb_rating, imdb_num_votes, critics_score, audience_score

library(knitr)
minv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(min) %>% sapply(round,2)
maxv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(max) %>% sapply(round,2)
meanv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(mean) %>% sapply(round,2)
medianv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(median) %>% sapply(round,2)
df <- rbind(minv, maxv, meanv, medianv)
rownames(df) <- c("min", "max", "mean", "median")
kable(df)
imdb_rating imdb_num_votes critics_score audience_score
min 2.30 180.00 2.00 13.00
max 9.00 893008.00 100.00 97.00
mean 6.48 56821.32 57.88 61.76
median 6.60 15470.00 63.00 64.00
p1 <- ggplot(data = train, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "skyblue", binwidth = .3)
p2 <- ggplot(data = train, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 40000, alpha = 0.5)
p3 <- ggplot(data = train, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
p4 <- ggplot(data = train, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(p1, p2, p3, p4,  nrow = 2, ncol = 2)

quantile(train$imdb_rating, c(0, 0.25, 0.5, 0.75, 0.9, 1))
##   0%  25%  50%  75%  90% 100% 
##  2.3  5.9  6.6  7.3  7.7  9.0

Split Train and Test Data

library(relaimpo)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
movie_analysis <- movie %>%
  summarise(director = as.numeric(as.factor(as.character(director))),
         genre = as.numeric(as.factor(as.character(genre))),
         studio = as.numeric(as.factor(as.character(studio))),
         top200_box = as.numeric(as.factor(top200_box)),
         best_pic_win = as.numeric(as.factor(best_pic_win)),
         imdb_rating = imdb_rating,
         audience_score = audience_score,
         critics_score = critics_score,
         .groups = 'drop')

set.seed(123)
split <- sample(sample(nrow(movie_analysis), .80*nrow(movie_analysis)))
train <- movie_analysis[split, ]
test <- movie_analysis[-split, ]
dim(train)
## [1] 520   8
dim(test)
## [1] 131   8
train <- train[complete.cases(train),]
test <- test[complete.cases(test),]

##Stepwise Feature Selection

#Stepwise
fullmod <- lm(imdb_rating ~ 1, data = train)
nullmod <- lm(imdb_rating ~ ., data=train)
reg1C <- step(nullmod, scope = list(lower = fullmod, upper = nullmod),
              direction="both")
## Start:  AIC=-789.61
## imdb_rating ~ director + genre + studio + top200_box + best_pic_win + 
##     audience_score + critics_score
## 
##                  Df Sum of Sq    RSS     AIC
## - director        1     0.001 105.09 -791.60
## - top200_box      1     0.033 105.12 -791.45
## - best_pic_win    1     0.290 105.38 -790.20
## <none>                        105.09 -789.61
## - genre           1     0.898 105.98 -787.27
## - studio          1     1.153 106.24 -786.05
## - critics_score   1    19.180 124.27 -706.11
## - audience_score  1   130.063 235.15 -380.84
## 
## Step:  AIC=-791.6
## imdb_rating ~ genre + studio + top200_box + best_pic_win + audience_score + 
##     critics_score
## 
##                  Df Sum of Sq    RSS     AIC
## - top200_box      1     0.033 105.12 -793.44
## - best_pic_win    1     0.289 105.38 -792.20
## <none>                        105.09 -791.60
## + director        1     0.001 105.09 -789.61
## - genre           1     0.897 105.98 -789.27
## - studio          1     1.152 106.24 -788.04
## - critics_score   1    19.529 124.62 -706.68
## - audience_score  1   132.255 237.34 -378.10
## 
## Step:  AIC=-793.44
## imdb_rating ~ genre + studio + best_pic_win + audience_score + 
##     critics_score
## 
##                  Df Sum of Sq    RSS     AIC
## - best_pic_win    1     0.310 105.43 -793.94
## <none>                        105.12 -793.44
## + top200_box      1     0.033 105.09 -791.60
## + director        1     0.002 105.12 -791.45
## - genre           1     0.878 106.00 -791.20
## - studio          1     1.148 106.27 -789.90
## - critics_score   1    19.724 124.84 -707.75
## - audience_score  1   132.407 237.53 -379.71
## 
## Step:  AIC=-793.94
## imdb_rating ~ genre + studio + audience_score + critics_score
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        105.43 -793.94
## + best_pic_win    1     0.310 105.12 -793.44
## + top200_box      1     0.055 105.38 -792.20
## + director        1     0.000 105.43 -791.94
## - genre           1     0.911 106.34 -791.55
## - studio          1     1.148 106.58 -790.42
## - critics_score   1    20.018 125.45 -707.28
## - audience_score  1   133.073 238.50 -379.61
reg1C
## 
## Call:
## lm(formula = imdb_rating ~ genre + studio + audience_score + 
##     critics_score, data = train)
## 
## Coefficients:
##    (Intercept)           genre          studio  audience_score   critics_score  
##      3.5006839       0.0187028       0.0007946       0.0356436       0.0100028
str(summary(reg1C))
## List of 11
##  $ call         : language lm(formula = imdb_rating ~ genre + studio + audience_score + critics_score,      data = train)
##  $ terms        :Classes 'terms', 'formula'  language imdb_rating ~ genre + studio + audience_score + critics_score
##   .. ..- attr(*, "variables")= language list(imdb_rating, genre, studio, audience_score, critics_score)
##   .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:5] "imdb_rating" "genre" "studio" "audience_score" ...
##   .. .. .. ..$ : chr [1:4] "genre" "studio" "audience_score" "critics_score"
##   .. ..- attr(*, "term.labels")= chr [1:4] "genre" "studio" "audience_score" "critics_score"
##   .. ..- attr(*, "order")= int [1:4] 1 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(imdb_rating, genre, studio, audience_score, critics_score)
##   .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:5] "imdb_rating" "genre" "studio" "audience_score" ...
##  $ residuals    : Named num [1:510] -0.3429 0.0341 -0.892 -0.0623 -0.0544 ...
##   ..- attr(*, "names")= chr [1:510] "354" "577" "152" "398" ...
##  $ coefficients : num [1:5, 1:4] 3.500684 0.018703 0.000795 0.035644 0.010003 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:5] FALSE FALSE FALSE FALSE FALSE
##   ..- attr(*, "names")= chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
##  $ sigma        : num 0.457
##  $ df           : int [1:3] 5 505 5
##  $ r.squared    : num 0.812
##  $ adj.r.squared: num 0.811
##  $ fstatistic   : Named num [1:3] 546 4 505
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:5, 1:5] 3.75e-02 -1.90e-03 -6.19e-05 -3.20e-04 4.46e-05 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
##   .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
##  - attr(*, "class")= chr "summary.lm"
# calculate relative importance
relImportance1C <- calc.relimp(nullmod, type = "lmg", rela = F)  

#print string and execute command 
cat('Relative Importances: \n')
## Relative Importances:
# Sort
sort(round(relImportance1C$lmg, 3), decreasing=TRUE)
## audience_score  critics_score          genre   best_pic_win         studio 
##          0.495          0.296          0.008          0.006          0.004 
##     top200_box       director 
##          0.004          0.001

Linear Regression

lm_model <- lm(formula = imdb_rating ~ genre + studio + audience_score + 
    critics_score, data = train)

summary(lm_model)
## 
## Call:
## lm(formula = imdb_rating ~ genre + studio + audience_score + 
##     critics_score, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27385 -0.19929  0.01029  0.27695  1.01185 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.5006839  0.0884779  39.566   <2e-16 ***
## genre          0.0187028  0.0089555   2.088   0.0373 *  
## studio         0.0007946  0.0003389   2.345   0.0194 *  
## audience_score 0.0356436  0.0014118  25.247   <2e-16 ***
## critics_score  0.0100028  0.0010215   9.792   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4569 on 505 degrees of freedom
## Multiple R-squared:  0.8122, Adjusted R-squared:  0.8107 
## F-statistic: 545.9 on 4 and 505 DF,  p-value: < 2.2e-16

Decision Tree

# Train the model (to predict 'default')
dm_model <- rpart(imdb_rating ~ genre + audience_score + studio +
    critics_score, data = train,
                      method = "class")

# Look at the model output                      
rpart.plot(x = dm_model, yesno = 2, type = 0,  box.palette = "green", extra = 0)

Find Inpurity of the Decision Model

# Train a gini-based model, splitting the tree based on gini index.
dm_model_gini <- rpart(formula = imdb_rating ~ ., 
                       data = train, 
                       method = "class",
                       parms = list(split = "gini"))
# Train an information-based model, splitting the tree based on information index.
dm_model_info <- rpart(formula = imdb_rating ~ ., 
                       data = train, 
                       method = "class",
                       parms = list(split = "information"))
# Generate predictions on the validation set using the gini model
pred1 <- predict(object = dm_model_gini, 
             newdata = test,
             type = "class")

# Generate predictions on the validation set using the information model
pred2 <- predict(object = dm_model_info, 
             newdata = test,
             type = "class")
# Compare classification error
ce(actual = test$imdb_rating, 
   predicted = pred1)
## [1] 0.870229
ce(actual = test$imdb_rating, 
   predicted = pred2)
## [1] 0.8625954

Mutation of Dataset for oscar prediction with different tree based models

movie_rf_analysis <- movie %>%
  summarise(director = as.numeric(as.factor(as.character(director))),
            genre = as.factor(as.character(genre)),
            studio = as.numeric(as.factor(as.character(studio))),
            top200_box = as.factor(top200_box),
            best_pic_win = as.factor(best_pic_win),
            imdb_rating = imdb_rating,
            audience_score = audience_score,
            critics_score = critics_score,
            imdb_num_votes = imdb_num_votes,
            oscar = as.factor(ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes" | best_pic_nom == "yes", "yes", "no")))

str(movie_rf_analysis)
## 'data.frame':    651 obs. of  10 variables:
##  $ director      : num  320 407 90 309 463 95 508 280 344 267 ...
##  $ genre         : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ studio        : num  91 202 167 34 13 163 147 118 88 84 ...
##  $ top200_box    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ imdb_rating   : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ audience_score: int  73 81 91 76 27 86 76 47 89 66 ...
##  $ critics_score : int  45 96 91 80 33 91 57 17 90 83 ...
##  $ imdb_num_votes: int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ oscar         : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
set.seed(123)
split <- sample(sample(nrow(movie_rf_analysis), .75*nrow(movie_rf_analysis)))
train <- movie_rf_analysis[split, ]
test <- movie_rf_analysis[-split, ]
dim(train)
## [1] 488  10
dim(test)
## [1] 163  10
train <- train[complete.cases(train),]
test <- test[complete.cases(test),]

Logistic Regression to predict oscar labels

# Make training and test set
glm_model <- glm(oscar ~ ., data = train, family = "binomial")
summary(glm_model)
## 
## Call:
## glm(formula = oscar ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7772  -0.7963  -0.5028   0.5868   2.8132  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -5.726e+00  1.462e+00  -3.917 8.96e-05 ***
## director                       -1.650e-04  7.596e-04  -0.217 0.828063    
## genreAnimation                 -1.539e+01  1.581e+03  -0.010 0.992234    
## genreArt House & International -1.558e+01  1.347e+03  -0.012 0.990772    
## genreComedy                     2.851e-01  5.220e-01   0.546 0.584994    
## genreDocumentary               -1.879e+00  1.134e+00  -1.657 0.097526 .  
## genreDrama                      9.664e-01  4.254e-01   2.272 0.023114 *  
## genreHorror                    -1.556e+01  9.882e+02  -0.016 0.987437    
## genreMusical & Performing Arts -1.606e-01  1.187e+00  -0.135 0.892380    
## genreMystery & Suspense         1.039e+00  5.100e-01   2.038 0.041590 *  
## genreOther                      9.524e-01  7.378e-01   1.291 0.196768    
## genreScience Fiction & Fantasy -6.382e-01  1.236e+00  -0.516 0.605596    
## studio                          7.124e-03  2.152e-03   3.311 0.000929 ***
## top200_boxyes                   4.599e-01  6.979e-01   0.659 0.509943    
## best_pic_winyes                 1.755e+01  1.702e+03   0.010 0.991770    
## imdb_rating                     7.329e-01  3.277e-01   2.236 0.025325 *  
## audience_score                 -2.885e-02  1.356e-02  -2.127 0.033456 *  
## critics_score                  -1.000e-04  6.503e-03  -0.015 0.987730    
## imdb_num_votes                  2.972e-06  1.339e-06   2.220 0.026396 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 549.34  on 477  degrees of freedom
## Residual deviance: 456.43  on 459  degrees of freedom
## AIC: 494.43
## 
## Number of Fisher Scoring iterations: 16
pred_oscar <- predict(glm_model,
                    newdata = test,
                    type = "response")

pred_oscar = ifelse(pred_oscar > 0.5, "yes", "no")

table(pred_oscar, test$oscar)
##           
## pred_oscar  no yes
##        no  108  39
##        yes   6  10
mean(pred_oscar == test$oscar)
## [1] 0.7239264

Remove overfitting issues out of movie analysis

# remove overfitting by choosing significant independent variables
glm.overfit = glm(formula = oscar ~ genre + director + studio + critics_score + imdb_rating, family = "binomial", 
              data = train)

glm.overfit.oscar = predict(glm.overfit, newdata = test, type = "response")
glm.overfit.oscar = ifelse(glm.overfit.oscar > 0.5, "yes", "no")
table(glm.overfit.oscar, test$oscar)
##                  
## glm.overfit.oscar  no yes
##               no  111  40
##               yes   3   9
mean(glm.overfit.oscar == test$oscar)
## [1] 0.7361963

Comparison of Random Forest with Logistics Regression for the oscar prediction

# Create a Random Forest model with default parameters
rm_model <- randomForest(oscar ~ genre + director + studio + critics_score + imdb_rating, data = train, importance = TRUE)
rm_model
## 
## Call:
##  randomForest(formula = oscar ~ genre + director + studio + critics_score +      imdb_rating, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 28.45%
## Confusion matrix:
##      no yes class.error
## no  315  38   0.1076487
## yes  98  27   0.7840000
summary(model1)
## 
## Call:
## lm(formula = RBI ~ H + X2B, data = Batting)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.017  -1.401   0.255   0.924  94.396 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.254759   0.032719  -7.786 6.97e-15 ***
## H            0.331172   0.001591 208.095  < 2e-16 ***
## X2B          0.799292   0.008624  92.681  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.745 on 106670 degrees of freedom
##   (756 observations deleted due to missingness)
## Multiple R-squared:  0.8897, Adjusted R-squared:  0.8897 
## F-statistic: 4.304e+05 on 2 and 106670 DF,  p-value: < 2.2e-16
# Generate predicted classes using the model object
oscar_prediction <- predict(object = rm_model,  
                            newdata = test,   
                            type = "class")  

# Calculate the confusion matrix for the test set
confusionMatrix(data = oscar_prediction,       
                reference = test$oscar)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  106  31
##        yes   8  18
##                                          
##                Accuracy : 0.7607         
##                  95% CI : (0.6878, 0.824)
##     No Information Rate : 0.6994         
##     P-Value [Acc > NIR] : 0.050074       
##                                          
##                   Kappa : 0.3431         
##                                          
##  Mcnemar's Test P-Value : 0.000427       
##                                          
##             Sensitivity : 0.9298         
##             Specificity : 0.3673         
##          Pos Pred Value : 0.7737         
##          Neg Pred Value : 0.6923         
##              Prevalence : 0.6994         
##          Detection Rate : 0.6503         
##    Detection Prevalence : 0.8405         
##       Balanced Accuracy : 0.6486         
##                                          
##        'Positive' Class : no             
## 

APPENDICES

Appendix A:

Getting Acquainted with the data through some EDAs:

Determined genre categorization by average audience score - The genre category has been determined by finding the movies based on Audience score for every genre from the movies Dataset.

Determined top Movies Every Year - This is to determine the top 10 movies based on the ratings given by IMDB in the years 1970 to 2000 in the movies dataset

Determined Top 20 Favorite Studios - The top 20 studios has been determined by calculating the average Audience score for each studio category from the movies Dataset.

Determined top 200 Oscar winning box office hits - The top 200 Oscar winning Box Office hits has been determined by taking into account films that has won an award in at least one of these categories – Best Actor, Best Actress, Best Director, Best Picture, and has also been a part of the Top 200 Box office hits.

Determined the best of Rotten - This is to determine the top 10 movies in the Rotten category of the Critics ratings based on the ratings given by IMDB in the movies dataset.

Appendix B:

Spliting Train and Test for regression analysis:

• Using knitr, derived minimum, maximum, mean and median of imdb rating, imdb num votes, critics score and audience score respectively. • Calculated the quantile of imdb rating to get the number of values present in the range of percentages.

Implementation of Linear Regression:

Determined correlation among imdb rating, imdb num votes, critics score, and audience score and found there is a positive correlation between the 4 parameters and positively skewed.

Feature selection using stepwise:

• For feature selection, we used the forward, backward and stepwise methods and finalized on stepwise for extracting AIC and Relative Importance(RI). Parameters considered to determine the best predicter for IMDB rating were director, genre, studio, top200_box, best_pic_win, audience_score, critics_score. The stepwise selection results with the audience score as the most significant with Adjusted R Squared 83.4% and AIC=-775.84. • Detailed result of the analysis was done by reading the linear regression and its relative importance thereby concluding 81.10 as the Adjusted R-squared. Parameters chosen to predict IMDB Rating were genre, studio, audience_score, critics_score as they came out significant using the stepwise feature selection.

Decision Tree Model:

Taking audience_score and critics_score as the most significant parameters (highest Relative Importance in our previous analysis), we perform the Decision Tre based model to predict IMDB Rating.

Gini and Information model using metrics:

• Using metrics, created two models to determine the difference in classification error of the gini and information based model which basically give a measure of impurity in the results • Found detailed result of the analysis; interpreting the low classification error for information based model as compared to gini based model.

Appendix C:

Implementation of Logistic Regression:

Consider a scenario where we need to classify whether a movie wins an oscar or not. If we use linear regression for this problem, there is a need for setting up a threshold based on which classification can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not malignant which can lead to serious consequence in real time. From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.

• The logistic regression has an analysis of 72.39264% • The logistic regression analysis after removing overfitting has an accuracy of 73.61963%

Implementation of Random Forest:

Using randomForest to dive deeper in the analysis, interpreted the performance of predicting movies with Oscar winning title after considering the parameters of best actor, best actress, best director, best pic nomination, imdb rating, audience score, studio, genre, and critics score.

• The accuracy is 77.3% which clearly indicates that the performance is good enough to get the right predicted value. • The confusion matrices which represent that 18 movies that win Oscar and 102 movies that did not win Oscar predicted right set of labels.