Analysis of Internet Movie Database
Team members: Pallavi Saitu, Chenggang Tu, Yash Rajiv Pillai, Bhaskara Satya Devarapalli, Saeed Shokri
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
# Including libraries
library(readr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.5.2
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
#importing the data set
movies_data = read.csv("/Users/pallavisaitu/Downloads/IMDBMovieData.csv")
summary(movies_data)
## color director_name num_critic_for_reviews
## : 19 : 104 Min. : 1.0
## Black and White: 209 Steven Spielberg: 26 1st Qu.: 50.0
## Color :4815 Woody Allen : 22 Median :110.0
## Clint Eastwood : 20 Mean :140.2
## Martin Scorsese : 20 3rd Qu.:195.0
## Ridley Scott : 17 Max. :813.0
## (Other) :4834 NA's :50
## duration director_facebook_likes actor_3_facebook_likes
## Min. : 7.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 93.0 1st Qu.: 7.0 1st Qu.: 133.0
## Median :103.0 Median : 49.0 Median : 371.5
## Mean :107.2 Mean : 686.5 Mean : 645.0
## 3rd Qu.:118.0 3rd Qu.: 194.5 3rd Qu.: 636.0
## Max. :511.0 Max. :23000.0 Max. :23000.0
## NA's :15 NA's :104 NA's :23
## actor_2_name actor_1_facebook_likes gross
## Morgan Freeman : 20 Min. : 0 Min. : 162
## Charlize Theron: 15 1st Qu.: 614 1st Qu.: 5340988
## Brad Pitt : 14 Median : 988 Median : 25517500
## : 13 Mean : 6560 Mean : 48468408
## James Franco : 11 3rd Qu.: 11000 3rd Qu.: 62309438
## Meryl Streep : 11 Max. :640000 Max. :760505847
## (Other) :4959 NA's :7 NA's :884
## genres actor_1_name
## Drama : 236 Robert De Niro : 49
## Comedy : 209 Johnny Depp : 41
## Comedy|Drama : 191 Nicolas Cage : 33
## Comedy|Drama|Romance: 187 J.K. Simmons : 31
## Comedy|Romance : 158 Bruce Willis : 30
## Drama|Romance : 152 Denzel Washington: 30
## (Other) :3910 (Other) :4829
## movie_title num_voted_users
## Ben-Hur\xe5\xca : 3 Min. : 5
## Halloween\xe5\xca : 3 1st Qu.: 8594
## Home\xe5\xca : 3 Median : 34359
## King Kong\xe5\xca : 3 Mean : 83668
## Pan\xe5\xca : 3 3rd Qu.: 96309
## The Fast and the Furious\xe5\xca: 3 Max. :1689764
## (Other) :5025
## cast_total_facebook_likes actor_3_name facenumber_in_poster
## Min. : 0 : 23 Min. : 0.000
## 1st Qu.: 1411 Ben Mendelsohn: 8 1st Qu.: 0.000
## Median : 3090 John Heard : 8 Median : 1.000
## Mean : 9699 Steve Coogan : 8 Mean : 1.371
## 3rd Qu.: 13756 Anne Hathaway : 7 3rd Qu.: 2.000
## Max. :656730 Jon Gries : 7 Max. :43.000
## (Other) :4982 NA's :13
## plot_keywords
## : 153
## based on novel : 4
## 1940s|child hero|fantasy world|orphan|reference to peter pan : 3
## alien friendship|alien invasion|australia|flying car|mother daughter relationship: 3
## animal name in title|ape abducts a woman|gorilla|island|king kong : 3
## assistant|experiment|frankenstein|medical student|scientist : 3
## (Other) :4874
## movie_imdb_link
## http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1: 3
## http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1: 3
## (Other) :5025
## num_user_for_reviews language country content_rating
## Min. : 1.0 English :4704 USA :3807 R :2118
## 1st Qu.: 65.0 French : 73 UK : 448 PG-13 :1461
## Median : 156.0 Spanish : 40 France : 154 PG : 701
## Mean : 272.8 Hindi : 28 Canada : 126 : 303
## 3rd Qu.: 326.0 Mandarin: 26 Germany : 97 Not Rated: 116
## Max. :5060.0 German : 19 Australia: 55 G : 112
## NA's :21 (Other) : 153 (Other) : 356 (Other) : 232
## budget title_year actor_2_facebook_likes imdb_score
## Min. :2.180e+02 Min. :1916 Min. : 0 Min. :1.600
## 1st Qu.:6.000e+06 1st Qu.:1999 1st Qu.: 281 1st Qu.:5.800
## Median :2.000e+07 Median :2005 Median : 595 Median :6.600
## Mean :3.975e+07 Mean :2002 Mean : 1652 Mean :6.442
## 3rd Qu.:4.500e+07 3rd Qu.:2011 3rd Qu.: 918 3rd Qu.:7.200
## Max. :1.222e+10 Max. :2016 Max. :137000 Max. :9.500
## NA's :492 NA's :108 NA's :13
## aspect_ratio movie_facebook_likes
## Min. : 1.18 Min. : 0
## 1st Qu.: 1.85 1st Qu.: 0
## Median : 2.35 Median : 166
## Mean : 2.22 Mean : 7526
## 3rd Qu.: 2.35 3rd Qu.: 3000
## Max. :16.00 Max. :349000
## NA's :329
str(movies_data)
## 'data.frame': 5043 obs. of 28 variables:
## $ color : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ director_name : Factor w/ 2399 levels "","\xcc\xe4mile Gaudreault",..: 926 799 2027 379 605 106 2030 1652 1225 553 ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
## $ actor_1_name : Factor w/ 2098 levels "","\xcc\xd2lafur Darri \xcc\xd2lafsson",..: 303 982 354 1968 527 441 786 221 337 34 ...
## $ movie_title : Factor w/ 4917 levels "[Rec] 2\xe5\xca",..: 397 2731 3279 3707 3332 1960 3289 3459 398 1630 ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : Factor w/ 3522 levels "","\xcc\xd2scar Jaenada",..: 3442 1393 3134 1769 1 2714 1969 2162 3018 2941 ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
## $ movie_imdb_link : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
## $ country : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
## $ content_rating : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
#check which columns have NAs
colSums(sapply(movies_data, is.na))
## color director_name
## 0 0
## num_critic_for_reviews duration
## 50 15
## director_facebook_likes actor_3_facebook_likes
## 104 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross genres
## 884 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 13
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 21 0
## country content_rating
## 0 0
## budget title_year
## 492 108
## actor_2_facebook_likes imdb_score
## 13 0
## aspect_ratio movie_facebook_likes
## 329 0
movies_data= na.omit(movies_data)
movies_data<- movies_data[!duplicated(movies_data), ]
dim(movies_data)
## [1] 3768 28
sum(complete.cases(movies_data))
## [1] 3768
#Now we have the same number of complete cases
#Visualization of ratings by year
ggplot(movies_data, aes(x=title_year,y=imdb_score))+ geom_smooth()+ labs(x = "Year movie was released", y = "IMDB score", title = "Figure 1: Ratings by Year") +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#The ratings and year have negetive relationship. Data appears to be skewed, lets examine by year.
#Checking distribution of movies by year
ggplot(movies_data, aes(title_year))+ geom_bar(col="red", fill="sky blue")+ scale_x_continuous(breaks=seq(min(movies_data$title_year), max(movies_data$title_year), by = 5)) + labs(x = "Year movie was released", y = "Movie Count", title = "Figure 2: Barplot of Movie released") + theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))
#Our data seems to be skewed towards more recent years. WE see steady levels of below 50 movies per year up until the 90's where it starts to increase to 100+ films per year.
freqcat <-movies_data %>%
group_by(director_name) %>%
summarise(avg_imdb = mean(imdb_score)) %>%
arrange(desc(avg_imdb)) %>%
top_n(20, avg_imdb)
plot1<- ggplot(freqcat, aes(x=reorder(director_name,-avg_imdb), avg_imdb, fill=avg_imdb)) + geom_point() +
ggtitle("Figure 3: Top 20 best rated Directors") + xlab("Director Name") + ylab("Imdb Rating") +
theme(axis.title=element_text(size=12, face="bold"),
axis.text.x=element_text(size=12, angle=90), legend.position="null")
plot(plot1)
freqcat2 <-movies_data %>%
group_by(actor_1_name) %>%
summarise(avg_imdb1 = mean(imdb_score)) %>%
arrange(desc(avg_imdb1)) %>%
top_n(20, avg_imdb1)
plot2<- ggplot(freqcat2, aes(x=reorder(actor_1_name,-avg_imdb1), avg_imdb1, fill=avg_imdb1)) + geom_point() +
ggtitle("Figure 4: Top 20 best rated Actors") + xlab("Actor Name") + ylab("Imdb Rating") +
theme(axis.title=element_text(size=12, face="bold"),
axis.text.x=element_text(size=12, angle=90), legend.position="null")
plot(plot2)
movies_data2<- movies_data%>%
mutate(profitpct = ((gross - budget)/budget)*100)
movies_data2<-unique(movies_data2)
#Highest Grossing Movies
freqcat3<- movies_data2 %>%
arrange(desc(gross)) %>%
top_n(10, gross)
plot3<- ggplot(freqcat3, aes(x=reorder(movie_title,-gross), y=gross/1000000, fill=gross)) + geom_bar(stat="identity") +
ggtitle("Figure 6: Top 10 Highest Grossing Movies") + ylab("Gross Earnings in Millions (USD) ") +geom_text(label=freqcat3$movie_title, vjust=1, angle= 45)+
theme(axis.title=element_text(size=12, face="bold"),
axis.text.x=element_blank(),legend.position="null")+scale_fill_gradient(low="orange", high="red")
plot(plot3)
#Most Profitable movies
freqcat4<- movies_data2 %>%
arrange(desc(profitpct)) %>%
top_n(12, profitpct)
freqcat4<- freqcat4[-5,]
freqcat4<- freqcat4[-10,]
plot4<- ggplot(freqcat4, aes(x=budget/1000, y=profitpct/1000)) + geom_point() +
ggtitle("Figure 7: Most Profitable Movies") + ylab("Profit % ") + xlab(" Budget in Thousands (USD)") +geom_text_repel(aes(label=movie_title))+
theme(axis.title=element_text(size=12, face="bold"),
axis.text.x=element_text(size=12, face="bold"),legend.position="null")
plot(plot4)
# OF the Highest Grossing what is the most Profitable one?
freqcat5<- movies_data2 %>%
arrange(desc(gross)) %>%
top_n(10, gross)%>%
top_n(10,profitpct)
plot5<- ggplot(freqcat5, aes(x=budget/1000000, y=profitpct/1000)) + geom_point()+
ggtitle("Figure 8: Highest Grossing and Most Profitable Movies") + ylab("Profit % ") + xlab("Budget in Millions (USD) ") +geom_text_repel(aes(label=movie_title))+
theme(axis.title=element_text(size=12, face="bold"),
axis.text.x=element_text(size=12, face="bold"),legend.position="null")
plot(plot5)
## critical acclaim vs box office
plot6<- freqcat3%>%
ggplot(aes(x = imdb_score, y = gross/1000000 )) +
geom_point()+geom_text_repel(aes(label=movie_title))+ ylab("Gross Earnings")+ xlab("IMDB Score")+ggtitle("Figure 9: Commercial Success Vs Critical acclaim" )
plot(plot6)
## best rated
freqcat7<- movies_data2 %>%
arrange(desc(num_voted_users)) %>%
top_n(10, imdb_score)%>%
top_n(10,num_voted_users)
plot7<- freqcat7%>%
ggplot(aes(x = imdb_score, y = num_voted_users)) +
geom_point()+geom_text_repel(aes(label=movie_title))+ ylab("Number of Votes")+ xlab("IMDB Score")+ggtitle("Figure 5: Best Rated Movies on IMDB" )
plot(plot7)
#Correlation heat map
ggcorr(movies_data, label = TRUE, label_round = 2, label_size = 2, size = 2, hjust = .85) +
ggtitle("Figure 10: Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
## Warning in ggcorr(movies_data, label = TRUE, label_round = 2, label_size
## = 2, : data in column(s) 'color', 'director_name', 'actor_2_name',
## 'genres', 'actor_1_name', 'movie_title', 'actor_3_name', 'plot_keywords',
## 'movie_imdb_link', 'language', 'country', 'content_rating' are not numeric
## and were ignored
#There doesn't seem to be any significant correlation between the rattings and any of the continiuous variables.
# Multiple Linear Regression
model1_fit <- lm(imdb_score ~ movie_facebook_likes + gross + num_critic_for_reviews, data=movies_data)
summary(model1_fit)
##
## Call:
## lm(formula = imdb_score ~ movie_facebook_likes + gross + num_critic_for_reviews,
## data = movies_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8950 -0.5680 0.0709 0.6805 2.5482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.009e+00 2.939e-02 204.468 < 2e-16 ***
## movie_facebook_likes 3.649e-06 1.061e-06 3.439 0.00059 ***
## gross 9.579e-10 2.613e-10 3.665 0.00025 ***
## num_critic_for_reviews 2.261e-03 1.931e-04 11.710 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9869 on 3764 degrees of freedom
## Multiple R-squared: 0.1257, Adjusted R-squared: 0.1251
## F-statistic: 180.5 on 3 and 3764 DF, p-value: < 2.2e-16
# Multiple Linear Regression
model2_fit <- lm(imdb_score ~ num_voted_users +
duration +
num_critic_for_reviews,
data=movies_data)
summary(model2_fit)
##
## Call:
## lm(formula = imdb_score ~ num_voted_users + duration + num_critic_for_reviews,
## data = movies_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7842 -0.5063 0.0913 0.6281 2.5347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.913e+00 7.508e-02 65.44 < 2e-16 ***
## num_voted_users 2.440e-06 1.242e-07 19.64 < 2e-16 ***
## duration 1.068e-02 6.821e-04 15.65 < 2e-16 ***
## num_critic_for_reviews 7.360e-04 1.469e-04 5.01 5.7e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8949 on 3764 degrees of freedom
## Multiple R-squared: 0.2812, Adjusted R-squared: 0.2806
## F-statistic: 490.8 on 3 and 3764 DF, p-value: < 2.2e-16
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(model2_fit)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.