Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(grid)
library(gridExtra)
library(corrplot)
library(polycor)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("C:/Users/hlo/Desktop/movies.Rdata")

Part 1: Data

The data were obtained from IMDB and Rotten Tomatoes. The data represents 651 randomly sampled movies produced and released before 2016. There are 32 variables about the movies including: genre of the movie (genre), year the movie is released (year), audience score on Rotten Tomatoes(audience_score), critics score on Rotten Tomatoes(critics_score), and many other interesting variables.

Rotten Tomatoes: A website launched in 1998 dedicated to film reviews. It is widely known as a film review aggregator. The world famous Tomatometer rating represents the percentage of positive professional reviews for films and TV shows and is used by millions every day, to help with their entertainment viewing decisions.

The Tomatometer rating - based on the published opinions of hundreds of film and television critics - is a trusted measurement of movie and TV programming quality for millions of moviegoers.The Tomatometer rating represents the percentage of professional critic reviews that are positive for a given film or television show.A Certified Fresh review is denoted by 75% and higher with required number of reviews. Fresh reviews have a score of 60% to 74%. Moreover, Rotten reviews have score of 59% or less.

IMDB: An online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews.

Scope of Inference: It is an observational study that uses random sampling to select a random sample from the U.S. movies. With random sampling, the results are generalizable to all movies in the range of years released between 1970 and 2014.In observational studies, only associations are shown. Association does not imply causation.

Part 2: Research question

Come up with a research question that you want to answer using these data and a multiple linear regression model. You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. You are welcomed to create new variables based on existing ones. Along with your research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

My friend is a big supporter of IMDB and wants to watch movies with high IMDB ratings. How can I use statistical modeling in recommending movies to him that have high IMDB ratings?

I am curious to figuring out how and what goes into IMDB ratings. Is there a way we can predict how they score movies? This is very interesting, figuring out what goes into their algorithm. * * *

Part 3: Exploratory data analysis

#Lets first drop the variables that are irrelevant #
#audience rating and audience score are also dropped because I am interested in the IMDB rating #

movies$title<-NULL
movies$studio<-NULL
movies$thtr_rel_year<-NULL
movies$thtr_rel_month<-NULL
movies$thtr_rel_day<-NULL
movies$dvd_rel_year<-NULL
movies$dvd_rel_month<-NULL
movies$dvd_rel_day<-NULL


movies$critics_rating<-NULL
movies$audience_rating<-NULL
movies$audience_score<-NULL
movies$actor1<-NULL
movies$actor2<-NULL
movies$actor3<-NULL
movies$actor4<-NULL
movies$actor5<-NULL
movies$imdb_url<-NULL
movies$rt_url<-NULL
movies$director<-NULL
movies$title_type <-NULL
movies$title_type.Documentary<-NULL
movies$title_type.Feature.Film<-NULL
movies$title_type.TV.Movie<-NULL
movies$dvd_rel_year<-NULL
movies$dvd_rel_month<-NULL
movies$dvd_rel_day<-NULL

#Calculate Summary of each variable
summary(movies)

##                 genre        runtime       mpaa_rating   imdb_rating   
##  Drama             :305   Min.   : 39.0   G      : 19   Min.   :1.900  
##  Comedy            : 87   1st Qu.: 92.0   NC-17  :  2   1st Qu.:5.900  
##  Action & Adventure: 65   Median :103.0   PG     :118   Median :6.600  
##  Mystery & Suspense: 59   Mean   :105.8   PG-13  :133   Mean   :6.493  
##  Documentary       : 52   3rd Qu.:115.8   R      :329   3rd Qu.:7.300  
##  Horror            : 23   Max.   :267.0   Unrated: 50   Max.   :9.000  
##  (Other)           : 60   NA's   :1                                    
##  imdb_num_votes   critics_score    best_pic_nom best_pic_win
##  Min.   :   180   Min.   :  1.00   no :629      no :644     
##  1st Qu.:  4546   1st Qu.: 33.00   yes: 22      yes:  7     
##  Median : 15116   Median : 61.00                            
##  Mean   : 57533   Mean   : 57.69                            
##  3rd Qu.: 58301   3rd Qu.: 83.00                            
##  Max.   :893008   Max.   :100.00                            
##                                                             
##  best_actor_win best_actress_win best_dir_win top200_box
##  no :558        no :579          no :608      no :636   
##  yes: 93        yes: 72          yes: 43      yes: 15   
##                                                         
##                                                         
##                                                         
##                                                         
##

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  12 variables:
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

# Plot of the first four graphs #
# Genre, Runtime, MPAA ratings, and IMDB ratings #

ggplot(data=movies , aes(x=genre) ) + geom_bar(color = "blue") + ggtitle("Genre of Movies") + theme(axis.title.x=element_blank()) + theme(axis.text.x = element_text(angle=90))

p2<-ggplot(data = movies, aes(x=runtime, color = (runtime)))+ geom_bar() + ggtitle("Runtime of Movies") + xlim(0,250) + theme(axis.title.x=element_blank())
p3<-ggplot(data=movies , aes(x=mpaa_rating) ) + geom_bar(color = "blue") + ggtitle("MPAA ratings of movies") + theme(axis.title.x=element_blank())
p4<-ggplot(data=movies , aes(x=imdb_rating)) + geom_histogram(fill ="#c0392b") + ggtitle("IMDB Ratings of Movies") + theme(axis.title.x=element_blank())
p5<-ggplot(data=movies , aes(x=imdb_num_votes)) + geom_histogram(fill ="#c0392b") + ggtitle("Number of Votes on IMDB") + theme(axis.title.x=element_blank())
grid.arrange(p2, p3, p4, p5, nrow=2)

## Warning: Removed 2 rows containing non-finite values (stat_count).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=movies , aes(x=critics_score) ) + geom_bar(color = "blue") + ggtitle("Critics Score on Rotten Tomatoes") + theme(axis.title.x=element_blank())

p7<-ggplot(data=movies , aes(x=best_pic_nom )) + geom_bar(color = "blue") + ggtitle("movie was nominated for a best picture?") +
theme(axis.title.x=element_blank())

p8<-ggplot(data=movies , aes(x=best_pic_win )) + geom_bar(color = "blue") + ggtitle("movie won a best picture Oscar?") +
theme(axis.title.x=element_blank())

p9<-ggplot(data=movies , aes(x=best_actor_win) ) + geom_bar(color = "blue") + ggtitle("one of the main actors ever won an oscar?")+ theme(axis.title.x=element_blank())

grid.arrange(p7,p8, p9, nrow=2)

p10<-ggplot(data=movies , aes(x=best_actress_win) ) + geom_bar(color = "blue") + ggtitle("Main actress ever won an oscar?") + theme(plot.title =element_text(size =10))+ theme(axis.title.x=element_blank())

p11<-ggplot(data=movies , aes(x=best_dir_win) ) + geom_bar(color = "blue") + ggtitle("Director ever won an Oscar?")+ theme(axis.title.x=element_blank())+theme(plot.title =element_text(size =10))

p12<-ggplot(data=movies , aes(x=top200_box)) + geom_bar(color = "blue") + ggtitle(" movie in the Top 200 on BoxOfficeMojo?") + theme(axis.title.x=element_blank())+theme(plot.title =element_text(size =10))

grid.arrange(p10,p11,p12, nrow=2)

What kind of findings? For our research question we are trying to use best way to predict the IMDB rating score for a particular movie using all variables excluding ones. From preliminary analysis of variables of interest, excluding ones that are irrelevant.

Graph1: We see that Drama movies occur most in our dataset. Next is Comedy and Action and Adventure. Graph2: Runtime for movies average around the 85 and 100 mins mark the most. Graph3: Most movies are rated R. Next is PG-13 and PG. Graph4: IMDB ratings is our dependent variable. We see that most scores fall within 6 and 8. Graph5: Most movies have small amount of votes less than 125,000. Graph6: RottenTomatoes score is pretty much evenly distributed across all ranges, hard to tell. Graph7: Most movies were not nominated for best picture Oscar. Which makes sense, there is always a small selection of movies nominated. Graph8: Most movies did not win the best picture for Oscar. Graph9: Most main actors in a movie ever won an oscar. Which also makes sense, there is only a small number of main actors who win oscars. Graph10: Most main actresses in a movie ever won an oscar. Which also makes sense, there is only a small number of main actresses who win oscars. Graph11: Most directors of a movie do not win an Oscar. Graph12: Most movies don’t make the top 200 in boxoffice.

From all of these small graphs, it is still hard to tell what makes a movie get score well in terms of the IMDB ratings.

Part 4: Modeling

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

imdb rating as the response variable and interested variable.

#Checking the multi-collinearity can be quite hard in this case, we will calculate the correlations then form a matrix later on # 
#Create correlations for each scenerio to see if there is high strength #

#All possible for Genres #

x1<- movies[,1]
x2<- movies[,2]
x3<- movies[,3]
x4<- movies[,4]
x5<- movies[,5]
x6<- movies[,6]
x7<- movies[,7]
x8<- movies[,8]
x9<- movies[,9]
x10<- movies[,10]
x11<-movies[,11]
x12<-movies[,12]

movies_df = data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12)
hetcor(movies_df)

## Warning in polyserial(y, x, ML = ML, std.err = std.err, bins = bins):
## initial correlation inadmissible, 1.15492467826029, set to 0.9999

## Warning in hetcor.data.frame(movies_df): could not compute polyserial correlation between variables 8 and 5
##     Message: Error in optim(rho, f, control = control, hessian = TRUE, method = "BFGS") : 
##   initial value in 'vmmin' is not finite

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                     genre    runtime mpaa_rating imdb_rating
## genre                   1 Polyserial  Polychoric  Polyserial
## runtime            0.1663          1  Polyserial     Pearson
## mpaa_rating        0.1524    0.06333           1  Polyserial
## imdb_rating        0.1261     0.2682      0.1922           1
## imdb_num_votes    0.06667     0.3472    -0.02687      0.3322
## critics_score      0.1521     0.1725      0.1731      0.7648
## best_pic_nom       0.2396      0.385    -0.08863      0.7025
## best_pic_win       0.1083     0.3285    -0.06131      0.6577
## best_actor_win     0.1575     0.3432     -0.1127      0.1063
## best_actress_win   0.1749     0.2893    -0.07744      0.1341
## best_dir_win       0.1351     0.3417    -0.02947      0.3186
## top200_box       -0.06551     0.2473     -0.3889      0.3296
##                  imdb_num_votes critics_score best_pic_nom best_pic_win
## genre                Polyserial    Polyserial   Polychoric   Polychoric
## runtime                 Pearson       Pearson   Polyserial   Polyserial
## mpaa_rating          Polyserial    Polyserial   Polychoric   Polychoric
## imdb_rating             Pearson       Pearson   Polyserial   Polyserial
## imdb_num_votes                1       Pearson   Polyserial   Polyserial
## critics_score              0.21             1   Polyserial   Polyserial
## best_pic_nom             0.3243        0.6211            1   Polychoric
## best_pic_win               <NA>        0.6383       0.8988            1
## best_actor_win           0.1095       0.07362       0.4337       0.1933
## best_actress_win          0.148        0.1055        0.506       0.5228
## best_dir_win             0.2197        0.2871       0.4286       0.8213
## top200_box               0.3083        0.3107       0.3249       0.3778
##                  best_actor_win best_actress_win best_dir_win top200_box
## genre                Polychoric       Polychoric   Polychoric Polychoric
## runtime              Polyserial       Polyserial   Polyserial Polyserial
## mpaa_rating          Polychoric       Polychoric   Polychoric Polychoric
## imdb_rating          Polyserial       Polyserial   Polyserial Polyserial
## imdb_num_votes       Polyserial       Polyserial   Polyserial Polyserial
## critics_score        Polyserial       Polyserial   Polyserial Polyserial
## best_pic_nom         Polychoric       Polychoric   Polychoric Polychoric
## best_pic_win         Polychoric       Polychoric   Polychoric Polychoric
## best_actor_win                1       Polychoric   Polychoric Polychoric
## best_actress_win         0.2923                1   Polychoric Polychoric
## best_dir_win             0.2561           0.2217            1 Polychoric
## top200_box               0.1917           0.2613       0.1728          1
## 
## Standard Errors:
##                    genre runtime mpaa_rating imdb_rating imdb_num_votes
## genre                                                                  
## runtime          0.03935                                               
## mpaa_rating       0.0422  0.0421                                       
## imdb_rating      0.03988 0.03643     0.03992                           
## imdb_num_votes   0.04103 0.03453     0.04225     0.03493               
## critics_score    0.03943 0.03809     0.04045     0.01632        0.03752
## best_pic_nom      0.1004 0.05998       0.104      0.0606        0.05169
## best_pic_win      0.1547 0.07915      0.1567     0.09643              0
## best_actor_win   0.06322 0.04775     0.06604      0.0627        0.05344
## best_actress_win 0.06923 0.05265     0.07118      0.0698        0.05241
## best_dir_win     0.07899 0.05454     0.08473      0.0817        0.05176
## top200_box        0.1063 0.07607      0.1038      0.1203        0.05407
##                  critics_score best_pic_nom best_pic_win best_actor_win
## genre                                                                  
## runtime                                                                
## mpaa_rating                                                            
## imdb_rating                                                            
## imdb_num_votes                                                         
## critics_score                                                          
## best_pic_nom           0.08296                                         
## best_pic_win            0.1136      0.06216                            
## best_actor_win         0.06147        0.109       0.1928               
## best_actress_win        0.0661       0.1047       0.1493        0.08818
## best_dir_win           0.07649       0.1256      0.08863         0.1032
## top200_box              0.1151       0.1831       0.2362         0.1485
##                  best_actress_win best_dir_win
## genre                                         
## runtime                                       
## mpaa_rating                                   
## imdb_rating                                   
## imdb_num_votes                                
## critics_score                                 
## best_pic_nom                                  
## best_pic_win                                  
## best_actor_win                                
## best_actress_win                              
## best_dir_win               0.1114             
## top200_box                 0.1484       0.1792
## 
## n = 650 
## 
## P-values for Tests of Bivariate Normality:
##                       genre    runtime mpaa_rating imdb_rating
## genre                                                         
## runtime           8.997e-33                                   
## mpaa_rating       5.245e-31  1.175e-07                        
## imdb_rating       7.244e-44  9.352e-10   2.027e-09            
## imdb_num_votes   5.219e-202 3.517e-174  9.025e-178  1.235e-180
## critics_score      7.67e-44  3.492e-13   8.408e-17    3.74e-11
## best_pic_nom        0.08457   0.003172       0.307    0.009211
## best_pic_win         0.8521  0.0001519       0.749     0.01493
## best_actor_win     0.001096    0.01492      0.1162     0.02571
## best_actress_win  9.125e-05   0.002484        0.26    0.006493
## best_dir_win         0.3004    0.03626     0.03195      0.0445
## top200_box           0.1519    0.03164      0.9536      0.0435
##                  imdb_num_votes critics_score best_pic_nom best_pic_win
## genre                                                                  
## runtime                                                                
## mpaa_rating                                                            
## imdb_rating                                                            
## imdb_num_votes                                                         
## critics_score        4.917e-186                                        
## best_pic_nom         5.486e-188     4.717e-11                          
## best_pic_win                  0     1.715e-10         <NA>             
## best_actor_win       1.575e-180     2.323e-10         <NA>         <NA>
## best_actress_win     7.927e-180     1.964e-10         <NA>         <NA>
## best_dir_win           3.4e-180     4.608e-10         <NA>         <NA>
## top200_box           2.434e-185     2.254e-10         <NA>         <NA>
##                  best_actor_win best_actress_win best_dir_win
## genre                                                        
## runtime                                                      
## mpaa_rating                                                  
## imdb_rating                                                  
## imdb_num_votes                                               
## critics_score                                                
## best_pic_nom                                                 
## best_pic_win                                                 
## best_actor_win                                               
## best_actress_win           <NA>                              
## best_dir_win               <NA>             <NA>             
## top200_box                 <NA>             <NA>         <NA>

#Decided not to drop any variables here, it seems that maybe best_pic_nom and best_pic_win may have some correlation
# with a few variables that you can see like best_actress_win and best_dir_win.
# We will fit a good model anyways because we will use the backwards elimination to get rid some of these variables later on #


#Model with most variables included except variables that are irrelevant 
#We will perform backward elimination to find the best model 

#mpaa_rating dropped
movies_model1 = lm(imdb_rating ~ genre + runtime+ mpaa_rating +imdb_num_votes+ critics_score + best_actor_win + best_actress_win + top200_box+ best_dir_win+ best_pic_nom+ best_pic_win + best_dir_win , data = movies)
                     
summary(movies_model1)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + imdb_num_votes + 
##     critics_score + best_actor_win + best_actress_win + top200_box + 
##     best_dir_win + best_pic_nom + best_pic_win + best_dir_win, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85771 -0.32706  0.04526  0.39610  1.82697 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.577e+00  2.307e-01  19.836  < 2e-16 ***
## genreAnimation                 -3.124e-01  2.498e-01  -1.251  0.21148    
## genreArt House & International  5.566e-01  1.934e-01   2.878  0.00414 ** 
## genreComedy                    -9.472e-02  1.070e-01  -0.885  0.37652    
## genreDocumentary                8.226e-01  1.460e-01   5.633 2.68e-08 ***
## genreDrama                      2.150e-01  9.421e-02   2.283  0.02278 *  
## genreHorror                    -1.310e-01  1.592e-01  -0.823  0.41082    
## genreMusical & Performing Arts  5.592e-01  2.064e-01   2.709  0.00693 ** 
## genreMystery & Suspense         1.444e-01  1.200e-01   1.203  0.22934    
## genreOther                     -2.183e-02  1.810e-01  -0.121  0.90405    
## genreScience Fiction & Fantasy -4.337e-01  2.269e-01  -1.911  0.05646 .  
## runtime                         4.836e-03  1.556e-03   3.108  0.00197 ** 
## mpaa_ratingNC-17               -5.458e-01  4.834e-01  -1.129  0.25929    
## mpaa_ratingPG                  -2.354e-01  1.762e-01  -1.336  0.18199    
## mpaa_ratingPG-13               -3.187e-01  1.821e-01  -1.750  0.08061 .  
## mpaa_ratingR                   -2.081e-01  1.757e-01  -1.184  0.23672    
## mpaa_ratingUnrated             -2.865e-01  2.000e-01  -1.433  0.15247    
## imdb_num_votes                  2.047e-06  2.725e-07   7.510 2.04e-13 ***
## critics_score                   2.343e-02  1.069e-03  21.905  < 2e-16 ***
## best_actor_winyes               3.051e-03  7.593e-02   0.040  0.96797    
## best_actress_winyes            -1.158e-02  8.400e-02  -0.138  0.89044    
## top200_boxyes                  -1.876e-01  1.786e-01  -1.050  0.29403    
## best_dir_winyes                 3.634e-02  1.100e-01   0.330  0.74137    
## best_pic_nomyes                 1.443e-01  1.670e-01   0.864  0.38771    
## best_pic_winyes                -3.772e-01  2.949e-01  -1.279  0.20137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6362 on 625 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6688, Adjusted R-squared:  0.6561 
## F-statistic: 52.58 on 24 and 625 DF,  p-value: < 2.2e-16

#best_actor_win dropped

movies_model2 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score + best_actor_win + best_actress_win + top200_box+ best_dir_win+ best_pic_nom+ best_pic_win + best_dir_win , data = movies)

summary(movies_model2)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score + best_actor_win + best_actress_win + top200_box + 
##     best_dir_win + best_pic_nom + best_pic_win + best_dir_win, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91689 -0.32569  0.03866  0.38061  1.80392 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.386e+00  1.751e-01  25.047  < 2e-16 ***
## genreAnimation                 -1.699e-01  2.286e-01  -0.743  0.45766    
## genreArt House & International  5.424e-01  1.896e-01   2.861  0.00436 ** 
## genreComedy                    -1.193e-01  1.060e-01  -1.126  0.26068    
## genreDocumentary                7.834e-01  1.318e-01   5.945 4.59e-09 ***
## genreDrama                      2.037e-01  9.198e-02   2.215  0.02713 *  
## genreHorror                    -1.249e-01  1.561e-01  -0.800  0.42382    
## genreMusical & Performing Arts  5.463e-01  2.054e-01   2.660  0.00802 ** 
## genreMystery & Suspense         1.502e-01  1.173e-01   1.281  0.20065    
## genreOther                     -3.164e-02  1.802e-01  -0.176  0.86070    
## genreScience Fiction & Fantasy -4.153e-01  2.268e-01  -1.831  0.06755 .  
## runtime                         4.348e-03  1.533e-03   2.836  0.00471 ** 
## imdb_num_votes                  2.025e-06  2.673e-07   7.577 1.27e-13 ***
## critics_score                   2.372e-02  1.035e-03  22.910  < 2e-16 ***
## best_actor_winyes              -1.043e-03  7.559e-02  -0.014  0.98899    
## best_actress_winyes            -1.547e-02  8.388e-02  -0.184  0.85375    
## top200_boxyes                  -1.681e-01  1.763e-01  -0.954  0.34061    
## best_dir_winyes                 4.190e-02  1.098e-01   0.382  0.70293    
## best_pic_nomyes                 1.406e-01  1.667e-01   0.844  0.39925    
## best_pic_winyes                -3.566e-01  2.941e-01  -1.212  0.22589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6364 on 630 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.6559 
## F-statistic:  66.1 on 19 and 630 DF,  p-value: < 2.2e-16

#best_actress_win dropped

movies_model3 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score +  best_actress_win + top200_box+ best_dir_win+ best_pic_nom+ best_pic_win + best_dir_win , data = movies)

summary(movies_model3)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score + best_actress_win + top200_box + best_dir_win + 
##     best_pic_nom + best_pic_win + best_dir_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91685 -0.32568  0.03846  0.38072  1.80402 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.386e+00  1.737e-01  25.254  < 2e-16 ***
## genreAnimation                 -1.700e-01  2.284e-01  -0.744  0.45702    
## genreArt House & International  5.425e-01  1.893e-01   2.867  0.00429 ** 
## genreComedy                    -1.194e-01  1.059e-01  -1.127  0.26028    
## genreDocumentary                7.834e-01  1.316e-01   5.951 4.41e-09 ***
## genreDrama                      2.037e-01  9.188e-02   2.217  0.02699 *  
## genreHorror                    -1.248e-01  1.559e-01  -0.801  0.42348    
## genreMusical & Performing Arts  5.464e-01  2.052e-01   2.663  0.00794 ** 
## genreMystery & Suspense         1.501e-01  1.168e-01   1.285  0.19918    
## genreOther                     -3.166e-02  1.801e-01  -0.176  0.86050    
## genreScience Fiction & Fantasy -4.152e-01  2.265e-01  -1.833  0.06724 .  
## runtime                         4.344e-03  1.505e-03   2.887  0.00402 ** 
## imdb_num_votes                  2.025e-06  2.668e-07   7.591 1.14e-13 ***
## critics_score                   2.372e-02  1.035e-03  22.929  < 2e-16 ***
## best_actress_winyes            -1.553e-02  8.368e-02  -0.186  0.85279    
## top200_boxyes                  -1.682e-01  1.761e-01  -0.955  0.33978    
## best_dir_winyes                 4.182e-02  1.096e-01   0.382  0.70286    
## best_pic_nomyes                 1.404e-01  1.654e-01   0.849  0.39647    
## best_pic_winyes                -3.563e-01  2.932e-01  -1.215  0.22471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6359 on 631 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.6564 
## F-statistic: 69.89 on 18 and 631 DF,  p-value: < 2.2e-16

#best_dir_win dropped
movies_model4 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score+ top200_box+ best_pic_nom+ best_pic_win + best_dir_win , data = movies)

summary(movies_model4)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score + top200_box + best_pic_nom + best_pic_win + 
##     best_dir_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91566 -0.33007  0.03968  0.38063  1.80605 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.390e+00  1.724e-01  25.459  < 2e-16 ***
## genreAnimation                 -1.724e-01  2.278e-01  -0.757  0.44948    
## genreArt House & International  5.412e-01  1.890e-01   2.864  0.00432 ** 
## genreComedy                    -1.212e-01  1.054e-01  -1.151  0.25027    
## genreDocumentary                7.827e-01  1.315e-01   5.953 4.37e-09 ***
## genreDrama                      2.016e-01  9.112e-02   2.212  0.02729 *  
## genreHorror                    -1.254e-01  1.557e-01  -0.805  0.42108    
## genreMusical & Performing Arts  5.466e-01  2.050e-01   2.666  0.00787 ** 
## genreMystery & Suspense         1.477e-01  1.160e-01   1.274  0.20330    
## genreOther                     -3.299e-02  1.798e-01  -0.183  0.85451    
## genreScience Fiction & Fantasy -4.153e-01  2.263e-01  -1.835  0.06701 .  
## runtime                         4.310e-03  1.492e-03   2.888  0.00401 ** 
## imdb_num_votes                  2.026e-06  2.666e-07   7.598 1.09e-13 ***
## critics_score                   2.372e-02  1.034e-03  22.947  < 2e-16 ***
## top200_boxyes                  -1.701e-01  1.756e-01  -0.968  0.33321    
## best_pic_nomyes                 1.370e-01  1.643e-01   0.834  0.40470    
## best_pic_winyes                -3.588e-01  2.926e-01  -1.226  0.22067    
## best_dir_winyes                 4.174e-02  1.095e-01   0.381  0.70319    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6354 on 632 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6569 
## F-statistic: 74.11 on 17 and 632 DF,  p-value: < 2.2e-16

#best_pic_nom dropped
movies_model5 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score+ top200_box+ best_pic_win , data = movies)

summary(movies_model5)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score + top200_box + best_pic_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9113 -0.3260  0.0391  0.3932  1.8061 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.361e+00  1.688e-01  25.830  < 2e-16 ***
## genreAnimation                 -1.718e-01  2.276e-01  -0.755  0.45054    
## genreArt House & International  5.397e-01  1.887e-01   2.861  0.00437 ** 
## genreComedy                    -1.192e-01  1.052e-01  -1.133  0.25758    
## genreDocumentary                7.781e-01  1.310e-01   5.940 4.69e-09 ***
## genreDrama                      2.041e-01  9.089e-02   2.245  0.02509 *  
## genreHorror                    -1.224e-01  1.555e-01  -0.787  0.43171    
## genreMusical & Performing Arts  5.421e-01  2.048e-01   2.647  0.00831 ** 
## genreMystery & Suspense         1.485e-01  1.159e-01   1.281  0.20063    
## genreOther                     -2.192e-02  1.790e-01  -0.122  0.90255    
## genreScience Fiction & Fantasy -4.137e-01  2.261e-01  -1.830  0.06772 .  
## runtime                         4.544e-03  1.460e-03   3.113  0.00194 ** 
## imdb_num_votes                  2.057e-06  2.639e-07   7.796 2.63e-14 ***
## critics_score                   2.385e-02  1.022e-03  23.343  < 2e-16 ***
## top200_boxyes                  -1.741e-01  1.754e-01  -0.993  0.32117    
## best_pic_winyes                -2.356e-01  2.570e-01  -0.917  0.35963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6348 on 634 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6655, Adjusted R-squared:  0.6576 
## F-statistic: 84.09 on 15 and 634 DF,  p-value: < 2.2e-16

#best_pic_win dropped 
movies_model6 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score+ top200_box, data = movies)
summary(movies_model6)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.90983 -0.32739  0.03798  0.38616  1.80740 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.381e+00  1.673e-01  26.182  < 2e-16 ***
## genreAnimation                 -1.749e-01  2.275e-01  -0.769  0.44230    
## genreArt House & International  5.356e-01  1.886e-01   2.840  0.00465 ** 
## genreComedy                    -1.249e-01  1.050e-01  -1.190  0.23467    
## genreDocumentary                7.747e-01  1.309e-01   5.917 5.36e-09 ***
## genreDrama                      2.011e-01  9.082e-02   2.214  0.02718 *  
## genreHorror                    -1.269e-01  1.554e-01  -0.817  0.41445    
## genreMusical & Performing Arts  5.417e-01  2.047e-01   2.646  0.00835 ** 
## genreMystery & Suspense         1.461e-01  1.159e-01   1.261  0.20776    
## genreOther                     -1.729e-02  1.789e-01  -0.097  0.92303    
## genreScience Fiction & Fantasy -4.134e-01  2.260e-01  -1.829  0.06790 .  
## runtime                         4.412e-03  1.453e-03   3.037  0.00249 ** 
## imdb_num_votes                  1.995e-06  2.551e-07   7.821 2.18e-14 ***
## critics_score                   2.381e-02  1.020e-03  23.329  < 2e-16 ***
## top200_boxyes                  -1.726e-01  1.754e-01  -0.984  0.32541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6651, Adjusted R-squared:  0.6577 
## F-statistic: 90.06 on 14 and 635 DF,  p-value: < 2.2e-16

#top200_box dropped 
movies_model7 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score, data = movies)
summary(movies_model7)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9094 -0.3305  0.0380  0.3873  1.8059 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.381e+00  1.673e-01  26.185  < 2e-16 ***
## genreAnimation                 -1.634e-01  2.272e-01  -0.719  0.47231    
## genreArt House & International  5.454e-01  1.883e-01   2.896  0.00391 ** 
## genreComedy                    -1.161e-01  1.046e-01  -1.110  0.26758    
## genreDocumentary                7.861e-01  1.304e-01   6.029 2.80e-09 ***
## genreDrama                      2.118e-01  9.017e-02   2.349  0.01915 *  
## genreHorror                    -1.172e-01  1.551e-01  -0.756  0.45006    
## genreMusical & Performing Arts  5.548e-01  2.043e-01   2.716  0.00680 ** 
## genreMystery & Suspense         1.578e-01  1.152e-01   1.369  0.17137    
## genreOther                     -1.039e-02  1.787e-01  -0.058  0.95369    
## genreScience Fiction & Fantasy -4.185e-01  2.260e-01  -1.852  0.06446 .  
## runtime                         4.350e-03  1.451e-03   2.997  0.00283 ** 
## imdb_num_votes                  1.938e-06  2.483e-07   7.804 2.48e-14 ***
## critics_score                   2.374e-02  1.018e-03  23.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 636 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6577 
## F-statistic: 96.92 on 13 and 636 DF,  p-value: < 2.2e-16

#Final Model#
movies_model8 = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score,data = movies)
summary(movies_model8)

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + imdb_num_votes + 
##     critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9094 -0.3305  0.0380  0.3873  1.8059 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.381e+00  1.673e-01  26.185  < 2e-16 ***
## genreAnimation                 -1.634e-01  2.272e-01  -0.719  0.47231    
## genreArt House & International  5.454e-01  1.883e-01   2.896  0.00391 ** 
## genreComedy                    -1.161e-01  1.046e-01  -1.110  0.26758    
## genreDocumentary                7.861e-01  1.304e-01   6.029 2.80e-09 ***
## genreDrama                      2.118e-01  9.017e-02   2.349  0.01915 *  
## genreHorror                    -1.172e-01  1.551e-01  -0.756  0.45006    
## genreMusical & Performing Arts  5.548e-01  2.043e-01   2.716  0.00680 ** 
## genreMystery & Suspense         1.578e-01  1.152e-01   1.369  0.17137    
## genreOther                     -1.039e-02  1.787e-01  -0.058  0.95369    
## genreScience Fiction & Fantasy -4.185e-01  2.260e-01  -1.852  0.06446 .  
## runtime                         4.350e-03  1.451e-03   2.997  0.00283 ** 
## imdb_num_votes                  1.938e-06  2.483e-07   7.804 2.48e-14 ***
## critics_score                   2.374e-02  1.018e-03  23.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 636 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6577 
## F-statistic: 96.92 on 13 and 636 DF,  p-value: < 2.2e-16

#Using backwards elimation and the step function to find the best model with AIC highest
backward.model<- step(movies_model1, data=movies, direction = "backward")

## Start:  AIC=-563.34
## imdb_rating ~ genre + runtime + mpaa_rating + imdb_num_votes + 
##     critics_score + best_actor_win + best_actress_win + top200_box + 
##     best_dir_win + best_pic_nom + best_pic_win + best_dir_win
## 
##                    Df Sum of Sq    RSS     AIC
## - mpaa_rating       5     2.161 255.15 -567.82
## - best_actor_win    1     0.001 252.99 -565.34
## - best_actress_win  1     0.008 253.00 -565.32
## - best_dir_win      1     0.044 253.04 -565.23
## - best_pic_nom      1     0.302 253.30 -564.57
## - top200_box        1     0.446 253.44 -564.20
## - best_pic_win      1     0.662 253.66 -563.65
## <none>                          252.99 -563.34
## - runtime           1     3.911 256.90 -555.37
## - genre            10    27.832 280.83 -515.50
## - imdb_num_votes    1    22.833 275.83 -509.18
## - critics_score     1   194.236 447.23 -195.04
## 
## Step:  AIC=-567.82
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     best_actor_win + best_actress_win + top200_box + best_dir_win + 
##     best_pic_nom + best_pic_win
## 
##                    Df Sum of Sq    RSS     AIC
## - best_actor_win    1     0.000 255.15 -569.82
## - best_actress_win  1     0.014 255.17 -569.78
## - best_dir_win      1     0.059 255.21 -569.67
## - best_pic_nom      1     0.288 255.44 -569.08
## - top200_box        1     0.368 255.52 -568.88
## - best_pic_win      1     0.595 255.75 -568.30
## <none>                          255.15 -567.82
## - runtime           1     3.259 258.41 -561.57
## - imdb_num_votes    1    23.251 278.41 -513.13
## - genre            10    31.208 286.36 -512.82
## - critics_score     1   212.580 467.73 -175.90
## 
## Step:  AIC=-569.82
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     best_actress_win + top200_box + best_dir_win + best_pic_nom + 
##     best_pic_win
## 
##                    Df Sum of Sq    RSS     AIC
## - best_actress_win  1     0.014 255.17 -571.78
## - best_dir_win      1     0.059 255.21 -571.67
## - best_pic_nom      1     0.291 255.45 -571.08
## - top200_box        1     0.369 255.52 -570.88
## - best_pic_win      1     0.597 255.75 -570.30
## <none>                          255.15 -569.82
## - runtime           1     3.370 258.52 -563.29
## - imdb_num_votes    1    23.302 278.46 -515.01
## - genre            10    31.225 286.38 -514.77
## - critics_score     1   212.585 467.74 -177.89
## 
## Step:  AIC=-571.78
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     top200_box + best_dir_win + best_pic_nom + best_pic_win
## 
##                  Df Sum of Sq    RSS     AIC
## - best_dir_win    1     0.059 255.23 -573.63
## - best_pic_nom    1     0.281 255.45 -573.07
## - top200_box      1     0.379 255.55 -572.82
## - best_pic_win    1     0.607 255.78 -572.24
## <none>                        255.17 -571.78
## - runtime         1     3.368 258.54 -565.26
## - imdb_num_votes  1    23.311 278.48 -516.96
## - genre          10    31.257 286.43 -516.67
## - critics_score   1   212.597 467.77 -179.85
## 
## Step:  AIC=-573.63
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     top200_box + best_pic_nom + best_pic_win
## 
##                  Df Sum of Sq    RSS     AIC
## - best_pic_nom    1     0.267 255.49 -574.95
## - top200_box      1     0.388 255.61 -574.65
## - best_pic_win    1     0.549 255.78 -574.24
## <none>                        255.23 -573.63
## - runtime         1     3.617 258.84 -566.48
## - imdb_num_votes  1    23.371 278.60 -518.68
## - genre          10    31.256 286.48 -518.54
## - critics_score   1   215.687 470.91 -177.49
## 
## Step:  AIC=-574.95
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     top200_box + best_pic_win
## 
##                  Df Sum of Sq    RSS     AIC
## - best_pic_win    1     0.339 255.83 -576.09
## - top200_box      1     0.397 255.89 -575.94
## <none>                        255.49 -574.95
## - runtime         1     3.904 259.40 -567.10
## - genre          10    31.091 286.58 -520.31
## - imdb_num_votes  1    24.492 279.99 -517.45
## - critics_score   1   219.594 475.09 -173.76
## 
## Step:  AIC=-576.09
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score + 
##     top200_box
## 
##                  Df Sum of Sq    RSS     AIC
## - top200_box      1     0.390 256.22 -577.10
## <none>                        255.83 -576.09
## - runtime         1     3.716 259.55 -568.72
## - genre          10    31.089 286.92 -521.55
## - imdb_num_votes  1    24.646 280.48 -518.31
## - critics_score   1   219.266 475.10 -175.74
## 
## Step:  AIC=-577.1
## imdb_rating ~ genre + runtime + imdb_num_votes + critics_score
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        256.22 -577.10
## - runtime         1     3.619 259.84 -569.99
## - genre          10    31.632 287.85 -521.44
## - imdb_num_votes  1    24.532 280.76 -519.67
## - critics_score   1   218.987 475.21 -177.59

#Regression Diagnostics#

results = lm(imdb_rating ~ genre + runtime+ imdb_num_votes+ critics_score, data= movies)
plot(results)

#Constant varaibility of residuals #
plot(results$residuals~results$fitted)

#normality of residuals #
qqnorm(results$residuals)

hist(results$residuals)

Linearity: Appears okay, but seems to be more fitted values scattered near the middle. Normality: Appears to be satisfied but some slight curves at the ends. Equal Variance: This the the biggest concern as the scale-location plot looks a bit where variance is more in the lesser value. The approach would be to transform data however, i don’t think it is that serious of an issue moreover it will be quite challenging in transforming and interpretation for this scenerio.

Adjusted R-Squared in this model is 0.6561. Lets start using the backward elimination method of removing all high P-value variables until all the variables were significant.

Part 5: Prediction

Calculate the confidence interval for slopes (95%) point estimate +- margin of error This there a way to create a spectrum of predicted values

imdb_rating ~ genre + runtime + imdb_num_votes + critics_score

A new movie from 2016 for prediction using my model. The movie I picked is one of my favorites this year “Deadpool”. From an easy google search I was able to find out that genre, runtime, imdb_num_votes, critics_score.

genre = “Action and Adventure” runtime = 108 imdb_num_votes = 517,886 Critics_score = 84 (From Rotten Tomatoes)

https://www.rottentomatoes.com/m/deadpool http://www.imdb.com/title/tt1431045/?ref_=adv_li_tt


#Predicting Deadpool IMDB rating. 


predict(movies_model8, data.frame(genre="Action & Adventure",runtime = 108, imdb_num_votes= 517886,
critics_score = 84), interval = "confidence")

confint(movies_model8)

par("mar")
par(mar=c(.2,.2,.2,.2))
pairs(~imdb_rating+imdb_num_votes+critics_score+runtime +genre, data = movies,
main = "Simple Scatterplot Matrix")

From the prediction, we are 95% confident that the imdb_rating will be between 7.849 and 8.106. And in fact from comparing to its current score, it is rated as 8.1 on the site, which is within range.

Part 6: Conclusion

A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question. You should also discuss any shortcomings of your current study (either due to data collection or methodology) and include ideas for possible future research.

In summary to our research question in recommending my friend high IMDB rated movies. From the analysis we find that model with the variables, genre of the movie, runtime of the movie, number of votes on IMDB and rotten tomatoes critics score all have a significant effect on a IMDB high rating score. Movies with longer runtime, high imdb number votes, high rotten tomatoes score, and fall in the genres of (“Art House & International”),(“Documentary”),(" Drama“), (” Musical and Performing Arts“), (”Science Fiction and Fantasy “) will have higher IMDB ratings than other selections. This category selection will be the selection I recommend to my friend.

I learned that many variables and goes into creating a good prediction for linear regression. Moreover, it gets more complex and difficult as we create a model with many variables as there can be many things to consider such as multicollinearity and interaction with variables within.

Some possible shortcomings that I can definitely forsee, as currently I am also taking a course from University of Washington on coursera regarding Machine Learning. It goes over how we can go beynond regression models to get predicting even more accurately by using machine learning. Some shortcomings definitely is this model is still not that strong as we can see the R square adjusted is only 0.6577 which means only roughly 2/3 of the data is being interpreted correctly here. Ideas for possible future research includes creating a model with machine learning using K means nearest neighbor technique to recommend specific movies.

In this scenerio, we are only able to understand what characteristics in creating a model with high IMDB rating score. By using K means model, we can recommend a few specific movies. For example, my friend likes the movie “Deadpool”, based on the ML modeling, we can recommend other movies within the action movies by Marvel entertainment group such as Captain America, Ant-Man, and Iron Man just to name a few.