R Notebook

               Analysis of Internet Movie Database

Team members: Pallavi Saitu, Chenggang Tu, Yash Rajiv Pillai, Bhaskara Satya Devarapalli, Saeed Shokri

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

# Including libraries
library(readr)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 3.5.2

library(GGally)

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

#importing the data set

movies_data = read.csv("/Users/pallavisaitu/Downloads/IMDBMovieData.csv")
summary(movies_data)

##               color               director_name  num_critic_for_reviews
##                  :  19                   : 104   Min.   :  1.0         
##   Black and White: 209   Steven Spielberg:  26   1st Qu.: 50.0         
##  Color           :4815   Woody Allen     :  22   Median :110.0         
##                          Clint Eastwood  :  20   Mean   :140.2         
##                          Martin Scorsese :  20   3rd Qu.:195.0         
##                          Ridley Scott    :  17   Max.   :813.0         
##                          (Other)         :4834   NA's   :50            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  133.0       
##  Median :103.0   Median :   49.0         Median :  371.5       
##  Mean   :107.2   Mean   :  686.5         Mean   :  645.0       
##  3rd Qu.:118.0   3rd Qu.:  194.5         3rd Qu.:  636.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :104             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  20   Min.   :     0         Min.   :      162  
##  Charlize Theron:  15   1st Qu.:   614         1st Qu.:  5340988  
##  Brad Pitt      :  14   Median :   988         Median : 25517500  
##                 :  13   Mean   :  6560         Mean   : 48468408  
##  James Franco   :  11   3rd Qu.: 11000         3rd Qu.: 62309438  
##  Meryl Streep   :  11   Max.   :640000         Max.   :760505847  
##  (Other)        :4959   NA's   :7              NA's   :884        
##                   genres                actor_1_name 
##  Drama               : 236   Robert De Niro   :  49  
##  Comedy              : 209   Johnny Depp      :  41  
##  Comedy|Drama        : 191   Nicolas Cage     :  33  
##  Comedy|Drama|Romance: 187   J.K. Simmons     :  31  
##  Comedy|Romance      : 158   Bruce Willis     :  30  
##  Drama|Romance       : 152   Denzel Washington:  30  
##  (Other)             :3910   (Other)          :4829  
##  movie_title                             num_voted_users  
##  Ben-Hur\xe5\xca                 :   3   Min.   :      5  
##  Halloween\xe5\xca               :   3   1st Qu.:   8594  
##  Home\xe5\xca                    :   3   Median :  34359  
##  King Kong\xe5\xca               :   3   Mean   :  83668  
##  Pan\xe5\xca                     :   3   3rd Qu.:  96309  
##  The Fast and the Furious\xe5\xca:   3   Max.   :1689764  
##  (Other)                         :5025                    
##  cast_total_facebook_likes         actor_3_name  facenumber_in_poster
##  Min.   :     0                          :  23   Min.   : 0.000      
##  1st Qu.:  1411            Ben Mendelsohn:   8   1st Qu.: 0.000      
##  Median :  3090            John Heard    :   8   Median : 1.000      
##  Mean   :  9699            Steve Coogan  :   8   Mean   : 1.371      
##  3rd Qu.: 13756            Anne Hathaway :   7   3rd Qu.: 2.000      
##  Max.   :656730            Jon Gries     :   7   Max.   :43.000      
##                            (Other)       :4982   NA's   :13          
##                                                                            plot_keywords 
##                                                                                   : 153  
##  based on novel                                                                   :   4  
##  1940s|child hero|fantasy world|orphan|reference to peter pan                     :   3  
##  alien friendship|alien invasion|australia|flying car|mother daughter relationship:   3  
##  animal name in title|ape abducts a woman|gorilla|island|king kong                :   3  
##  assistant|experiment|frankenstein|medical student|scientist                      :   3  
##  (Other)                                                                          :4874  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0077651/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0232500/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt0360717/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt1976009/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2224026/?ref_=fn_tt_tt_1:   3  
##  http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1:   3  
##  (Other)                                             :5025  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4704   USA      :3807   R        :2118  
##  1st Qu.:  65.0       French  :  73   UK       : 448   PG-13    :1461  
##  Median : 156.0       Spanish :  40   France   : 154   PG       : 701  
##  Mean   : 272.8       Hindi   :  28   Canada   : 126            : 303  
##  3rd Qu.: 326.0       Mandarin:  26   Germany  :  97   Not Rated: 116  
##  Max.   :5060.0       German  :  19   Australia:  55   G        : 112  
##  NA's   :21           (Other) : 153   (Other)  : 356   (Other)  : 232  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   281         1st Qu.:5.800  
##  Median :2.000e+07   Median :2005   Median :   595         Median :6.600  
##  Mean   :3.975e+07   Mean   :2002   Mean   :  1652         Mean   :6.442  
##  3rd Qu.:4.500e+07   3rd Qu.:2011   3rd Qu.:   918         3rd Qu.:7.200  
##  Max.   :1.222e+10   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :492         NA's   :108    NA's   :13                            
##   aspect_ratio   movie_facebook_likes
##  Min.   : 1.18   Min.   :     0      
##  1st Qu.: 1.85   1st Qu.:     0      
##  Median : 2.35   Median :   166      
##  Mean   : 2.22   Mean   :  7526      
##  3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :16.00   Max.   :349000      
##  NA's   :329

str(movies_data)

## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","\xcc\xe4mile Gaudreault",..: 926 799 2027 379 605 106 2030 1652 1225 553 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","\xcc\xd2lafur Darri \xcc\xd2lafsson",..: 303 982 354 1968 527 441 786 221 337 34 ...
##  $ movie_title              : Factor w/ 4917 levels "[Rec] 2\xe5\xca",..: 397 2731 3279 3707 3332 1960 3289 3459 398 1630 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","\xcc\xd2scar Jaenada",..: 3442 1393 3134 1769 1 2714 1969 2162 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

#check which columns have NAs
colSums(sapply(movies_data, is.na))

##                     color             director_name 
##                         0                         0 
##    num_critic_for_reviews                  duration 
##                        50                        15 
##   director_facebook_likes    actor_3_facebook_likes 
##                       104                        23 
##              actor_2_name    actor_1_facebook_likes 
##                         0                         7 
##                     gross                    genres 
##                       884                         0 
##              actor_1_name               movie_title 
##                         0                         0 
##           num_voted_users cast_total_facebook_likes 
##                         0                         0 
##              actor_3_name      facenumber_in_poster 
##                         0                        13 
##             plot_keywords           movie_imdb_link 
##                         0                         0 
##      num_user_for_reviews                  language 
##                        21                         0 
##                   country            content_rating 
##                         0                         0 
##                    budget                title_year 
##                       492                       108 
##    actor_2_facebook_likes                imdb_score 
##                        13                         0 
##              aspect_ratio      movie_facebook_likes 
##                       329                         0

movies_data= na.omit(movies_data)
movies_data<- movies_data[!duplicated(movies_data), ]
dim(movies_data)

## [1] 3768   28

sum(complete.cases(movies_data))

## [1] 3768

#Now we have the same number of complete cases

#Visualization of ratings by year

ggplot(movies_data, aes(x=title_year,y=imdb_score))+ geom_smooth()+ labs(x = "Year movie was released", y = "IMDB score", title = "Figure 1: Ratings by Year")  +theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#The ratings and year have negetive relationship. Data appears to be skewed, lets examine by year.

#Checking distribution of movies by year
ggplot(movies_data, aes(title_year))+ geom_bar(col="red", fill="sky blue")+ scale_x_continuous(breaks=seq(min(movies_data$title_year), max(movies_data$title_year), by = 5)) +   labs(x = "Year movie was released", y = "Movie Count", title = "Figure 2: Barplot of Movie released") +   theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))

#Our data seems to be skewed towards more recent years. WE see steady levels of below 50 movies per year up until the 90's where it starts to increase to 100+ films per year.

freqcat <-movies_data %>%
  group_by(director_name) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(20, avg_imdb) 




plot1<- ggplot(freqcat, aes(x=reorder(director_name,-avg_imdb), avg_imdb, fill=avg_imdb)) + geom_point() + 
  ggtitle("Figure 3: Top 20 best rated Directors") + xlab("Director Name") + ylab("Imdb Rating")  + 
  theme(axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") 
plot(plot1)

freqcat2 <-movies_data %>%
  group_by(actor_1_name) %>%
  summarise(avg_imdb1 = mean(imdb_score)) %>%
  arrange(desc(avg_imdb1)) %>%
  top_n(20, avg_imdb1) 

plot2<- ggplot(freqcat2, aes(x=reorder(actor_1_name,-avg_imdb1), avg_imdb1, fill=avg_imdb1)) + geom_point() + 
  ggtitle("Figure 4: Top 20 best rated Actors") + xlab("Actor Name") + ylab("Imdb Rating")  + 
  theme(axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, angle=90), legend.position="null") 
plot(plot2)

movies_data2<- movies_data%>% 
  mutate(profitpct = ((gross - budget)/budget)*100)
movies_data2<-unique(movies_data2)
#Highest Grossing Movies
freqcat3<- movies_data2 %>%
  arrange(desc(gross)) %>%
  top_n(10, gross)

plot3<- ggplot(freqcat3, aes(x=reorder(movie_title,-gross), y=gross/1000000, fill=gross)) + geom_bar(stat="identity") + 
  ggtitle("Figure 6: Top 10 Highest Grossing Movies")  + ylab("Gross Earnings in Millions (USD) ")  +geom_text(label=freqcat3$movie_title, vjust=1, angle= 45)+ 
  theme(axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_blank(),legend.position="null")+scale_fill_gradient(low="orange", high="red") 
plot(plot3)

#Most Profitable movies
freqcat4<- movies_data2 %>%
  arrange(desc(profitpct)) %>%
  top_n(12, profitpct)

freqcat4<- freqcat4[-5,]

freqcat4<- freqcat4[-10,]


plot4<- ggplot(freqcat4, aes(x=budget/1000, y=profitpct/1000)) + geom_point() + 
  ggtitle("Figure 7: Most Profitable Movies")  + ylab("Profit % ") + xlab(" Budget in Thousands (USD)") +geom_text_repel(aes(label=movie_title))+ 
  theme(axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, face="bold"),legend.position="null")
plot(plot4)

# OF the Highest Grossing what is the most Profitable one?

freqcat5<- movies_data2 %>%
  arrange(desc(gross)) %>%
  top_n(10, gross)%>%
  top_n(10,profitpct)



plot5<- ggplot(freqcat5, aes(x=budget/1000000, y=profitpct/1000)) + geom_point()+
  ggtitle("Figure 8: Highest Grossing and Most Profitable Movies")  + ylab("Profit % ") + xlab("Budget in Millions (USD) ") +geom_text_repel(aes(label=movie_title))+ 
  theme(axis.title=element_text(size=12, face="bold"), 
        axis.text.x=element_text(size=12, face="bold"),legend.position="null")
plot(plot5)

## critical acclaim vs box office

plot6<- freqcat3%>%
  ggplot(aes(x = imdb_score, y = gross/1000000 )) + 
  geom_point()+geom_text_repel(aes(label=movie_title))+ ylab("Gross Earnings")+ xlab("IMDB Score")+ggtitle("Figure 9: Commercial Success Vs Critical acclaim" )
plot(plot6)

## best rated
freqcat7<- movies_data2 %>%
  arrange(desc(num_voted_users)) %>%
  top_n(10, imdb_score)%>%
  top_n(10,num_voted_users)

plot7<- freqcat7%>%
  ggplot(aes(x = imdb_score, y = num_voted_users)) + 
  geom_point()+geom_text_repel(aes(label=movie_title))+ ylab("Number of Votes")+ xlab("IMDB Score")+ggtitle("Figure 5: Best Rated Movies on IMDB" )
plot(plot7)

#Correlation heat map

ggcorr(movies_data, label = TRUE, label_round = 2, label_size = 2, size = 2, hjust = .85) +
  ggtitle("Figure 10: Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning in ggcorr(movies_data, label = TRUE, label_round = 2, label_size
## = 2, : data in column(s) 'color', 'director_name', 'actor_2_name',
## 'genres', 'actor_1_name', 'movie_title', 'actor_3_name', 'plot_keywords',
## 'movie_imdb_link', 'language', 'country', 'content_rating' are not numeric
## and were ignored

#There doesn't seem to be any significant correlation between the rattings and any of the continiuous variables.

# Multiple Linear Regression  
model1_fit <- lm(imdb_score ~ movie_facebook_likes + gross + num_critic_for_reviews, data=movies_data)
summary(model1_fit)

## 
## Call:
## lm(formula = imdb_score ~ movie_facebook_likes + gross + num_critic_for_reviews, 
##     data = movies_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8950 -0.5680  0.0709  0.6805  2.5482 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.009e+00  2.939e-02 204.468  < 2e-16 ***
## movie_facebook_likes   3.649e-06  1.061e-06   3.439  0.00059 ***
## gross                  9.579e-10  2.613e-10   3.665  0.00025 ***
## num_critic_for_reviews 2.261e-03  1.931e-04  11.710  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9869 on 3764 degrees of freedom
## Multiple R-squared:  0.1257, Adjusted R-squared:  0.1251 
## F-statistic: 180.5 on 3 and 3764 DF,  p-value: < 2.2e-16

# Multiple Linear Regression  
model2_fit <- lm(imdb_score ~ num_voted_users + 
                   duration +
                   num_critic_for_reviews, 
                 data=movies_data)
summary(model2_fit)

## 
## Call:
## lm(formula = imdb_score ~ num_voted_users + duration + num_critic_for_reviews, 
##     data = movies_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7842 -0.5063  0.0913  0.6281  2.5347 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.913e+00  7.508e-02   65.44  < 2e-16 ***
## num_voted_users        2.440e-06  1.242e-07   19.64  < 2e-16 ***
## duration               1.068e-02  6.821e-04   15.65  < 2e-16 ***
## num_critic_for_reviews 7.360e-04  1.469e-04    5.01  5.7e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8949 on 3764 degrees of freedom
## Multiple R-squared:  0.2812, Adjusted R-squared:  0.2806 
## F-statistic: 490.8 on 3 and 3764 DF,  p-value: < 2.2e-16

# diagnostic plots 
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(model2_fit)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.