The project details about the detailed analysis of the movie dataset that contains information from IMDB for a random sample of movies. The data represents 651 randomly sampled movies released between 1972 to 2014 in the Unites States. This data frame contains 651 observations (rows), each representing a movie, and 33 variables (columns).
The purpose of this project is to evaluate different regression models and develop an accurate model to understand what attributes make a movie popular. Is a movie’s popularity, as measured by audience score, related to the type of movie, genre, runtime, imdb rating, imdb number of votes, critics rating, critics score, audience rating, Oscar awards obtained (actor, actress, director and picture)? Similarly, if a movie will win an Oscar based on audience rating, critics rating, audience score, critics score, and genre? Therefore, I am going to figure out to answer these types of questions. In the meantime, I will be doing Exploratory Data Analysis to clean and wrangle our dataset for performing the regression analysis, henceforth approaching to learn something peculiar about movies.
Through the EDA and regression modeling, I will ensure the accurate model which produces movies that are more likely to be liked by the target audience.
The movies data frame contains 651 observations (rows), each representing a movie, and 33 variables (columns).
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2) # visualization
library(ggthemes) # visualization
library(scales) # visualization
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(dplyr) # data manipulation
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(assertive)
##
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
##
## is_atomic, is_character, is_double, is_empty, is_formula,
## is_function, is_integer, is_list, is_logical, is_null, is_numeric,
## is_vector
## The following object is masked from 'package:tibble':
##
## has_rownames
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart) # decision tree
library(rpart.plot) # decision tree plot
library(Metrics) # gini and information
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(randomForest) # random forest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
load("/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.RData")
write.csv(movies,file="/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.csv")
movie <- read.csv("/Users/arunasingh/Downloads/Projects/UC/Data Science/DAM/moviedataset.csv")
str(movie)
## 'data.frame': 651 obs. of 33 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : chr "Feature Film" "Feature Film" "Feature Film" "Feature Film" ...
## $ genre : chr "Drama" "Drama" "Comedy" "Drama" ...
## $ runtime : int 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : chr "R" "PG-13" "R" "PG" ...
## $ studio : chr "Indomina Media Inc." "Warner Bros. Pictures" "Sony Pictures Classics" "Columbia Pictures" ...
## $ thtr_rel_year : int 2013 2001 1996 1993 2004 2009 1986 1996 2012 2012 ...
## $ thtr_rel_month : int 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : int 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : int 2013 2001 2001 2001 2005 2010 2003 2004 2013 2012 ...
## $ dvd_rel_month : int 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : int 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : chr "Rotten" "Certified Fresh" "Certified Fresh" "Certified Fresh" ...
## $ critics_score : int 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : chr "Upright" "Upright" "Upright" "Upright" ...
## $ audience_score : int 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : chr "no" "no" "no" "no" ...
## $ best_pic_win : chr "no" "no" "no" "no" ...
## $ best_actor_win : chr "no" "no" "no" "yes" ...
## $ best_actress_win: chr "no" "no" "no" "no" ...
## $ best_dir_win : chr "no" "no" "no" "yes" ...
## $ top200_box : chr "no" "no" "no" "no" ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
summary(movie)
## X title title_type genre
## Min. : 1.0 Length:651 Length:651 Length:651
## 1st Qu.:163.5 Class :character Class :character Class :character
## Median :326.0 Mode :character Mode :character Mode :character
## Mean :326.0
## 3rd Qu.:488.5
## Max. :651.0
##
## runtime mpaa_rating studio thtr_rel_year
## Min. : 39.0 Length:651 Length:651 Min. :1970
## 1st Qu.: 92.0 Class :character Class :character 1st Qu.:1990
## Median :103.0 Mode :character Mode :character Median :2000
## Mean :105.8 Mean :1998
## 3rd Qu.:115.8 3rd Qu.:2007
## Max. :267.0 Max. :2014
## NA's :1
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Length:651
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Class :character
## Median :15.00 Median :6.600 Median : 15116 Mode :character
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom
## Min. : 1.00 Length:651 Min. :11.00 Length:651
## 1st Qu.: 33.00 Class :character 1st Qu.:46.00 Class :character
## Median : 61.00 Mode :character Median :65.00 Mode :character
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_win best_actor_win best_actress_win best_dir_win
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## top200_box director actor1 actor2
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5 imdb_url
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## rt_url
## Length:651
## Class :character
## Mode :character
##
##
##
##
top_200_boxoffice = movie %>%
dplyr::select(title, genre,critics_rating, audience_rating, imdb_rating, audience_score, top200_box, best_actor_win, best_actress_win,best_dir_win, best_pic_win)%>%
filter((best_pic_win =="yes" | best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes") & top200_box == "yes")
#head(top_200_boxoffice)
ggplot(top_200_boxoffice, aes(x = audience_score , y = reorder(title, audience_score))) +
geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Audience Score") + ylab("Movie Title") + ggtitle("Oscar winning box office hits")
df <- movie %>%
count(critics_rating) %>%
mutate(percent = n*100/sum(n)) %>%
arrange(desc(n))
ggplot(df, aes(x=percent , y= critics_rating, fill = critics_rating)) +
geom_bar(stat="identity") + xlab("Percentage of the movies") +
ylab("Critics Rating")
moviepoint <- distinct_all(movie %>%
dplyr::select(title, imdb_rating, thtr_rel_year) %>%
filter(thtr_rel_year>= 1970 & thtr_rel_year<= 2000) %>%
slice_max(imdb_rating, n = 10) %>%
arrange(thtr_rel_year))
moviepoint
## title imdb_rating thtr_rel_year
## 1 The Godfather, Part II 9.0 1974
## 2 Annie Hall 8.1 1977
## 3 Sans Soleil 8.2 1983
## 4 Aliens 8.4 1986
## 5 The Thin Blue Line 8.1 1988
## 6 Groundhog Day 8.1 1993
## 7 In the Name of the Father 8.1 1993
## 8 Trainspotting 8.2 1996
## 9 Good Will Hunting 8.3 1997
## 10 Memento 8.5 2000
moviepoint %>%
ggplot(mapping= aes(
thtr_rel_year, reorder(title,thtr_rel_year) , fill = imdb_rating)) +
geom_bar(stat="identity") + xlab("Year of the movie release") +
ylab("Movie Name") + coord_cartesian(xlim = c(1970, 2005))
movierotten <- distinct_all(movie %>%
filter(critics_rating == 'Rotten') %>%
dplyr::select(title, imdb_rating, thtr_rel_year) %>%
arrange(desc(imdb_rating)))
## Total Number of Rotten Movies are 307
movierotten_10 <- movierotten%>%
slice_max(imdb_rating, n = 10)
movierotten_10 %>%
ggplot(mapping= aes(
imdb_rating, reorder(title,imdb_rating) , fill = imdb_rating)) +
geom_bar(stat="identity") + xlab("IMDB Rating") +
ylab("Movie Name") + coord_cartesian(xlim = c(7, 8)) +
scale_fill_continuous(type = "viridis")
# Top 200 movies by critics rating then find no of movies with critic score 100
#and plot movie wrt audience score for CR 100
criticmov <- distinct_all(movie %>%
dplyr::select(title, critics_score, audience_score) %>%
slice_max(critics_score, n = 200) %>%
arrange(desc(critics_score)))
## to filter the data of the critic movie with value as 100
criticmov_100 <- criticmov %>%
subset(critics_score == 100) %>%
arrange(desc(audience_score))
## Total 19 movies are there with critic score as 100
criticmov_100 %>%
ggplot(mapping= aes(reorder(title,-audience_score) ,
audience_score, fill = factor(audience_score))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_bar(stat="identity") + xlab("Movie Name") + ylab("Audience Score")
top_studio = movie %>%
dplyr::select(studio, audience_score) %>%
group_by(studio) %>%
summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
arrange(desc(avg_audience_score)) %>% top_n(n=20)
## Selecting by avg_audience_score
ggplot(top_studio, aes(x = avg_audience_score , y = reorder(studio,-avg_audience_score))) +
geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Studio") + ggtitle("Favourite Studios")
top_movie_genre = movie %>%
group_by(genre) %>%
slice_max(audience_score, n = 1, with_ties = FALSE) %>%
dplyr::select(title, title_type, imdb_rating, audience_rating, audience_score, genre, title_type)
ggplot(top_movie_genre, aes(x = genre , y = title, size = audience_score, color = title_type )) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
geom_point() + xlab("Genre") + ylab("Movie") + ggtitle("Top movie of each Genre by audience score")
bottom_studio = movie %>%
dplyr::select(studio, audience_score) %>%
group_by(studio) %>%
summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
arrange(desc(avg_audience_score)) %>% top_n(n=5)
## Selecting by avg_audience_score
ggplot(bottom_studio, aes(x = avg_audience_score , y = reorder(studio,avg_audience_score))) +
geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Studio") + ggtitle("Least Favorite Studios")
top_category = movie %>%
dplyr::select(title, title_type, audience_score) %>%
group_by(title_type) %>%
summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
arrange(desc(avg_audience_score))
ggplot(top_category, aes(x = avg_audience_score , y = reorder(title_type,-avg_audience_score))) +
geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Category") + ggtitle("Favourite Category")
top_movie_category = movie %>%
group_by(title_type) %>%
slice_max(audience_score, n = 1, with_ties = FALSE) %>%
dplyr::select(title, title_type, imdb_rating, audience_rating, audience_score, genre, title_type)
ggplot(top_movie_category, aes(x = title_type , y = title, size = audience_score, color = genre )) +
geom_point() + xlab("Genre") + ylab("Movie") + ggtitle("Top movie of each Category by audience score")
top_genre = movie %>%
dplyr::select(genre, audience_score) %>%
group_by(genre) %>%
summarise(avg_audience_score = sum(audience_score)/length(audience_score), .groups = 'drop') %>%
arrange(desc(avg_audience_score)) %>%
top_n(n=20)
## Selecting by avg_audience_score
ggplot(top_genre, aes(x = avg_audience_score , y = reorder(genre,-avg_audience_score))) +
geom_bar(stat="identity",color='skyblue',fill='steelblue') + xlab("Average Audience Score") + ylab("Genre") + ggtitle("Favourite Genre")
critic_aud_mov <-distinct_all(movie %>%
dplyr::select(title, critics_score, audience_score, imdb_rating, mpaa_rating)%>%
slice_max(critics_score, n = 200)%>%arrange(desc(audience_score)))
criticmov_100 <- critic_aud_mov%>%
filter(critics_score==100)%>%slice_min(audience_score, n = 10)
## Found that the total 19 movies are there with critic score as 100
criticmov_100%>%ggplot(mapping=aes(x = audience_score, y = title, color=factor(mpaa_rating), shape =factor(mpaa_rating),size = 100)) +
geom_point(stat = "identity")+ xlab("Audience Score")+ylab("Movie Name")
##Split Train and Test Data
set.seed(123)
split <- sample(sample(nrow(movie), .80*nrow(movie)))
train <- movie[split, ]
test <- movie[-split, ]
dim(train)
## [1] 520 33
dim(test)
## [1] 131 33
library(knitr)
minv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(min) %>% sapply(round,2)
maxv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(max) %>% sapply(round,2)
meanv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(mean) %>% sapply(round,2)
medianv <- train %>% dplyr::select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(median) %>% sapply(round,2)
df <- rbind(minv, maxv, meanv, medianv)
rownames(df) <- c("min", "max", "mean", "median")
kable(df)
| imdb_rating | imdb_num_votes | critics_score | audience_score | |
|---|---|---|---|---|
| min | 2.30 | 180.00 | 2.00 | 13.00 |
| max | 9.00 | 893008.00 | 100.00 | 97.00 |
| mean | 6.48 | 56821.32 | 57.88 | 61.76 |
| median | 6.60 | 15470.00 | 63.00 | 64.00 |
p1 <- ggplot(data = train, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "skyblue", binwidth = .3)
p2 <- ggplot(data = train, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 40000, alpha = 0.5)
p3 <- ggplot(data = train, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
p4 <- ggplot(data = train, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
quantile(train$imdb_rating, c(0, 0.25, 0.5, 0.75, 0.9, 1))
## 0% 25% 50% 75% 90% 100%
## 2.3 5.9 6.6 7.3 7.7 9.0
library(relaimpo)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
## The following object is masked from 'package:caret':
##
## cluster
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
movie_analysis <- movie %>%
summarise(director = as.numeric(as.factor(as.character(director))),
genre = as.numeric(as.factor(as.character(genre))),
studio = as.numeric(as.factor(as.character(studio))),
top200_box = as.numeric(as.factor(top200_box)),
best_pic_win = as.numeric(as.factor(best_pic_win)),
imdb_rating = imdb_rating,
audience_score = audience_score,
critics_score = critics_score,
.groups = 'drop')
set.seed(123)
split <- sample(sample(nrow(movie_analysis), .80*nrow(movie_analysis)))
train <- movie_analysis[split, ]
test <- movie_analysis[-split, ]
dim(train)
## [1] 520 8
dim(test)
## [1] 131 8
train <- train[complete.cases(train),]
test <- test[complete.cases(test),]
##Stepwise Feature Selection
#Stepwise
fullmod <- lm(imdb_rating ~ 1, data = train)
nullmod <- lm(imdb_rating ~ ., data=train)
reg1C <- step(nullmod, scope = list(lower = fullmod, upper = nullmod),
direction="both")
## Start: AIC=-789.61
## imdb_rating ~ director + genre + studio + top200_box + best_pic_win +
## audience_score + critics_score
##
## Df Sum of Sq RSS AIC
## - director 1 0.001 105.09 -791.60
## - top200_box 1 0.033 105.12 -791.45
## - best_pic_win 1 0.290 105.38 -790.20
## <none> 105.09 -789.61
## - genre 1 0.898 105.98 -787.27
## - studio 1 1.153 106.24 -786.05
## - critics_score 1 19.180 124.27 -706.11
## - audience_score 1 130.063 235.15 -380.84
##
## Step: AIC=-791.6
## imdb_rating ~ genre + studio + top200_box + best_pic_win + audience_score +
## critics_score
##
## Df Sum of Sq RSS AIC
## - top200_box 1 0.033 105.12 -793.44
## - best_pic_win 1 0.289 105.38 -792.20
## <none> 105.09 -791.60
## + director 1 0.001 105.09 -789.61
## - genre 1 0.897 105.98 -789.27
## - studio 1 1.152 106.24 -788.04
## - critics_score 1 19.529 124.62 -706.68
## - audience_score 1 132.255 237.34 -378.10
##
## Step: AIC=-793.44
## imdb_rating ~ genre + studio + best_pic_win + audience_score +
## critics_score
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 0.310 105.43 -793.94
## <none> 105.12 -793.44
## + top200_box 1 0.033 105.09 -791.60
## + director 1 0.002 105.12 -791.45
## - genre 1 0.878 106.00 -791.20
## - studio 1 1.148 106.27 -789.90
## - critics_score 1 19.724 124.84 -707.75
## - audience_score 1 132.407 237.53 -379.71
##
## Step: AIC=-793.94
## imdb_rating ~ genre + studio + audience_score + critics_score
##
## Df Sum of Sq RSS AIC
## <none> 105.43 -793.94
## + best_pic_win 1 0.310 105.12 -793.44
## + top200_box 1 0.055 105.38 -792.20
## + director 1 0.000 105.43 -791.94
## - genre 1 0.911 106.34 -791.55
## - studio 1 1.148 106.58 -790.42
## - critics_score 1 20.018 125.45 -707.28
## - audience_score 1 133.073 238.50 -379.61
reg1C
##
## Call:
## lm(formula = imdb_rating ~ genre + studio + audience_score +
## critics_score, data = train)
##
## Coefficients:
## (Intercept) genre studio audience_score critics_score
## 3.5006839 0.0187028 0.0007946 0.0356436 0.0100028
str(summary(reg1C))
## List of 11
## $ call : language lm(formula = imdb_rating ~ genre + studio + audience_score + critics_score, data = train)
## $ terms :Classes 'terms', 'formula' language imdb_rating ~ genre + studio + audience_score + critics_score
## .. ..- attr(*, "variables")= language list(imdb_rating, genre, studio, audience_score, critics_score)
## .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:5] "imdb_rating" "genre" "studio" "audience_score" ...
## .. .. .. ..$ : chr [1:4] "genre" "studio" "audience_score" "critics_score"
## .. ..- attr(*, "term.labels")= chr [1:4] "genre" "studio" "audience_score" "critics_score"
## .. ..- attr(*, "order")= int [1:4] 1 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(imdb_rating, genre, studio, audience_score, critics_score)
## .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
## .. .. ..- attr(*, "names")= chr [1:5] "imdb_rating" "genre" "studio" "audience_score" ...
## $ residuals : Named num [1:510] -0.3429 0.0341 -0.892 -0.0623 -0.0544 ...
## ..- attr(*, "names")= chr [1:510] "354" "577" "152" "398" ...
## $ coefficients : num [1:5, 1:4] 3.500684 0.018703 0.000795 0.035644 0.010003 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:5] FALSE FALSE FALSE FALSE FALSE
## ..- attr(*, "names")= chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
## $ sigma : num 0.457
## $ df : int [1:3] 5 505 5
## $ r.squared : num 0.812
## $ adj.r.squared: num 0.811
## $ fstatistic : Named num [1:3] 546 4 505
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:5, 1:5] 3.75e-02 -1.90e-03 -6.19e-05 -3.20e-04 4.46e-05 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
## .. ..$ : chr [1:5] "(Intercept)" "genre" "studio" "audience_score" ...
## - attr(*, "class")= chr "summary.lm"
# calculate relative importance
relImportance1C <- calc.relimp(nullmod, type = "lmg", rela = F)
#print string and execute command
cat('Relative Importances: \n')
## Relative Importances:
# Sort
sort(round(relImportance1C$lmg, 3), decreasing=TRUE)
## audience_score critics_score genre best_pic_win studio
## 0.495 0.296 0.008 0.006 0.004
## top200_box director
## 0.004 0.001
lm_model <- lm(formula = imdb_rating ~ genre + studio + audience_score +
critics_score, data = train)
summary(lm_model)
##
## Call:
## lm(formula = imdb_rating ~ genre + studio + audience_score +
## critics_score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27385 -0.19929 0.01029 0.27695 1.01185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5006839 0.0884779 39.566 <2e-16 ***
## genre 0.0187028 0.0089555 2.088 0.0373 *
## studio 0.0007946 0.0003389 2.345 0.0194 *
## audience_score 0.0356436 0.0014118 25.247 <2e-16 ***
## critics_score 0.0100028 0.0010215 9.792 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4569 on 505 degrees of freedom
## Multiple R-squared: 0.8122, Adjusted R-squared: 0.8107
## F-statistic: 545.9 on 4 and 505 DF, p-value: < 2.2e-16
# Train the model (to predict 'default')
dm_model <- rpart(imdb_rating ~ genre + audience_score + studio +
critics_score, data = train,
method = "class")
# Look at the model output
rpart.plot(x = dm_model, yesno = 2, type = 0, box.palette = "green", extra = 0)
# Train a gini-based model, splitting the tree based on gini index.
dm_model_gini <- rpart(formula = imdb_rating ~ .,
data = train,
method = "class",
parms = list(split = "gini"))
# Train an information-based model, splitting the tree based on information index.
dm_model_info <- rpart(formula = imdb_rating ~ .,
data = train,
method = "class",
parms = list(split = "information"))
# Generate predictions on the validation set using the gini model
pred1 <- predict(object = dm_model_gini,
newdata = test,
type = "class")
# Generate predictions on the validation set using the information model
pred2 <- predict(object = dm_model_info,
newdata = test,
type = "class")
# Compare classification error
ce(actual = test$imdb_rating,
predicted = pred1)
## [1] 0.870229
ce(actual = test$imdb_rating,
predicted = pred2)
## [1] 0.8625954
movie_rf_analysis <- movie %>%
summarise(director = as.numeric(as.factor(as.character(director))),
genre = as.factor(as.character(genre)),
studio = as.numeric(as.factor(as.character(studio))),
top200_box = as.factor(top200_box),
best_pic_win = as.factor(best_pic_win),
imdb_rating = imdb_rating,
audience_score = audience_score,
critics_score = critics_score,
imdb_num_votes = imdb_num_votes,
oscar = as.factor(ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes" | best_pic_nom == "yes", "yes", "no")))
str(movie_rf_analysis)
## 'data.frame': 651 obs. of 10 variables:
## $ director : num 320 407 90 309 463 95 508 280 344 267 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ studio : num 91 202 167 34 13 163 147 118 88 84 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ audience_score: int 73 81 91 76 27 86 76 47 89 66 ...
## $ critics_score : int 45 96 91 80 33 91 57 17 90 83 ...
## $ imdb_num_votes: int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ oscar : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
set.seed(123)
split <- sample(sample(nrow(movie_rf_analysis), .75*nrow(movie_rf_analysis)))
train <- movie_rf_analysis[split, ]
test <- movie_rf_analysis[-split, ]
dim(train)
## [1] 488 10
dim(test)
## [1] 163 10
train <- train[complete.cases(train),]
test <- test[complete.cases(test),]
# Make training and test set
glm_model <- glm(oscar ~ ., data = train, family = "binomial")
summary(glm_model)
##
## Call:
## glm(formula = oscar ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7772 -0.7963 -0.5028 0.5868 2.8132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.726e+00 1.462e+00 -3.917 8.96e-05 ***
## director -1.650e-04 7.596e-04 -0.217 0.828063
## genreAnimation -1.539e+01 1.581e+03 -0.010 0.992234
## genreArt House & International -1.558e+01 1.347e+03 -0.012 0.990772
## genreComedy 2.851e-01 5.220e-01 0.546 0.584994
## genreDocumentary -1.879e+00 1.134e+00 -1.657 0.097526 .
## genreDrama 9.664e-01 4.254e-01 2.272 0.023114 *
## genreHorror -1.556e+01 9.882e+02 -0.016 0.987437
## genreMusical & Performing Arts -1.606e-01 1.187e+00 -0.135 0.892380
## genreMystery & Suspense 1.039e+00 5.100e-01 2.038 0.041590 *
## genreOther 9.524e-01 7.378e-01 1.291 0.196768
## genreScience Fiction & Fantasy -6.382e-01 1.236e+00 -0.516 0.605596
## studio 7.124e-03 2.152e-03 3.311 0.000929 ***
## top200_boxyes 4.599e-01 6.979e-01 0.659 0.509943
## best_pic_winyes 1.755e+01 1.702e+03 0.010 0.991770
## imdb_rating 7.329e-01 3.277e-01 2.236 0.025325 *
## audience_score -2.885e-02 1.356e-02 -2.127 0.033456 *
## critics_score -1.000e-04 6.503e-03 -0.015 0.987730
## imdb_num_votes 2.972e-06 1.339e-06 2.220 0.026396 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 549.34 on 477 degrees of freedom
## Residual deviance: 456.43 on 459 degrees of freedom
## AIC: 494.43
##
## Number of Fisher Scoring iterations: 16
pred_oscar <- predict(glm_model,
newdata = test,
type = "response")
pred_oscar = ifelse(pred_oscar > 0.5, "yes", "no")
table(pred_oscar, test$oscar)
##
## pred_oscar no yes
## no 108 39
## yes 6 10
mean(pred_oscar == test$oscar)
## [1] 0.7239264
# remove overfitting by choosing significant independent variables
glm.overfit = glm(formula = oscar ~ genre + director + studio + critics_score + imdb_rating, family = "binomial",
data = train)
glm.overfit.oscar = predict(glm.overfit, newdata = test, type = "response")
glm.overfit.oscar = ifelse(glm.overfit.oscar > 0.5, "yes", "no")
table(glm.overfit.oscar, test$oscar)
##
## glm.overfit.oscar no yes
## no 111 40
## yes 3 9
mean(glm.overfit.oscar == test$oscar)
## [1] 0.7361963
# Create a Random Forest model with default parameters
rm_model <- randomForest(oscar ~ genre + director + studio + critics_score + imdb_rating, data = train, importance = TRUE)
rm_model
##
## Call:
## randomForest(formula = oscar ~ genre + director + studio + critics_score + imdb_rating, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 28.45%
## Confusion matrix:
## no yes class.error
## no 315 38 0.1076487
## yes 98 27 0.7840000
summary(model1)
##
## Call:
## lm(formula = RBI ~ H + X2B, data = Batting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.017 -1.401 0.255 0.924 94.396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.254759 0.032719 -7.786 6.97e-15 ***
## H 0.331172 0.001591 208.095 < 2e-16 ***
## X2B 0.799292 0.008624 92.681 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.745 on 106670 degrees of freedom
## (756 observations deleted due to missingness)
## Multiple R-squared: 0.8897, Adjusted R-squared: 0.8897
## F-statistic: 4.304e+05 on 2 and 106670 DF, p-value: < 2.2e-16
# Generate predicted classes using the model object
oscar_prediction <- predict(object = rm_model,
newdata = test,
type = "class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = oscar_prediction,
reference = test$oscar)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 106 31
## yes 8 18
##
## Accuracy : 0.7607
## 95% CI : (0.6878, 0.824)
## No Information Rate : 0.6994
## P-Value [Acc > NIR] : 0.050074
##
## Kappa : 0.3431
##
## Mcnemar's Test P-Value : 0.000427
##
## Sensitivity : 0.9298
## Specificity : 0.3673
## Pos Pred Value : 0.7737
## Neg Pred Value : 0.6923
## Prevalence : 0.6994
## Detection Rate : 0.6503
## Detection Prevalence : 0.8405
## Balanced Accuracy : 0.6486
##
## 'Positive' Class : no
##
Getting Acquainted with the data through some EDAs:
Determined genre categorization by average audience score - The genre category has been determined by finding the movies based on Audience score for every genre from the movies Dataset.
Determined top Movies Every Year - This is to determine the top 10 movies based on the ratings given by IMDB in the years 1970 to 2000 in the movies dataset
Determined Top 20 Favorite Studios - The top 20 studios has been determined by calculating the average Audience score for each studio category from the movies Dataset.
Determined top 200 Oscar winning box office hits - The top 200 Oscar winning Box Office hits has been determined by taking into account films that has won an award in at least one of these categories – Best Actor, Best Actress, Best Director, Best Picture, and has also been a part of the Top 200 Box office hits.
Determined the best of Rotten - This is to determine the top 10 movies in the Rotten category of the Critics ratings based on the ratings given by IMDB in the movies dataset.
Spliting Train and Test for regression analysis:
• Using knitr, derived minimum, maximum, mean and median of imdb rating, imdb num votes, critics score and audience score respectively. • Calculated the quantile of imdb rating to get the number of values present in the range of percentages.
Implementation of Linear Regression:
Determined correlation among imdb rating, imdb num votes, critics score, and audience score and found there is a positive correlation between the 4 parameters and positively skewed.
Feature selection using stepwise:
• For feature selection, we used the forward, backward and stepwise methods and finalized on stepwise for extracting AIC and Relative Importance(RI). Parameters considered to determine the best predicter for IMDB rating were director, genre, studio, top200_box, best_pic_win, audience_score, critics_score. The stepwise selection results with the audience score as the most significant with Adjusted R Squared 83.4% and AIC=-775.84. • Detailed result of the analysis was done by reading the linear regression and its relative importance thereby concluding 81.10 as the Adjusted R-squared. Parameters chosen to predict IMDB Rating were genre, studio, audience_score, critics_score as they came out significant using the stepwise feature selection.
Decision Tree Model:
Taking audience_score and critics_score as the most significant parameters (highest Relative Importance in our previous analysis), we perform the Decision Tre based model to predict IMDB Rating.
Gini and Information model using metrics:
• Using metrics, created two models to determine the difference in classification error of the gini and information based model which basically give a measure of impurity in the results • Found detailed result of the analysis; interpreting the low classification error for information based model as compared to gini based model.
Implementation of Logistic Regression:
Consider a scenario where we need to classify whether a movie wins an oscar or not. If we use linear regression for this problem, there is a need for setting up a threshold based on which classification can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not malignant which can lead to serious consequence in real time. From this example, it can be inferred that linear regression is not suitable for classification problem. Linear regression is unbounded, and this brings logistic regression into picture. Their value strictly ranges from 0 to 1.
• The logistic regression has an analysis of 72.39264% • The logistic regression analysis after removing overfitting has an accuracy of 73.61963%
Implementation of Random Forest:
Using randomForest to dive deeper in the analysis, interpreted the performance of predicting movies with Oscar winning title after considering the parameters of best actor, best actress, best director, best pic nomination, imdb rating, audience score, studio, genre, and critics score.
• The accuracy is 77.3% which clearly indicates that the performance is good enough to get the right predicted value. • The confusion matrices which represent that 18 movies that win Oscar and 102 movies that did not win Oscar predicted right set of labels.