Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

Generabizability: The data set is comprised of 651 randomly sampled movies produced and released before 2016, therefore, it was used by sampling without doing research on specific groups of people.We use random sampling because if we use non-random sampling our data will be biased and unreliable to use (not good to use) ,and also we should reject the data that make collinearity and parsimony.

Casuality: There is no dependence between data in this dataset, since those variables are not the acutal reasons to change those feeling of audience to rank socres in IMDB system. I’ll analyze much more correlation bwtween those data in my EDA as below.

Part 2: Research question

“I want to make a model to predict imdb rate of 2016. So, the consequeces will be the variables which could be used for do the prediction.”

Thus, I will try to find out 4-5 variabiles which can be used for do the prediction. On the other hand, those variables should be independent to each other and have high correalation with ‘imdb_rating’. I’ll use the methods I’ve learned on this class included ‘ggpairs’ form package ‘GGally’ to overlook the association within dataset, backward selection to select the most appropriated variables, linear modeling to predict 2016 imdb rate through other related variables.

Part 3: Exploratory data analysis

*** Remove non-related data

From knowledge and experience Some of these variables are only there for informational purpose # Remove: imdb_url, rt_url
We do not discuss about the types of movie, just focus on other materials in this dataset. # Remove: title_type, genre, runtime

movies_re1 <- movies %>%
  select(-imdb_url,-rt_url,-title_type, -genre, -runtime)
colnames(movies_re1)

##  [1] "title"            "mpaa_rating"      "studio"          
##  [4] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
##  [7] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [10] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [13] "critics_score"    "audience_rating"  "audience_score"  
## [16] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [19] "best_actress_win" "best_dir_win"     "top200_box"      
## [22] "director"         "actor1"           "actor2"          
## [25] "actor3"           "actor4"           "actor5"

*** Discussion about time information Since I want to predict the information from 2016 dataset, I need to make sure time is not a main reason of rating.

Merge all time related variables into ‘time_theater’ & ‘time_DVD’ variables.

movies$time_theater <- paste( movies$thtr_rel_year,movies$thtr_rel_month, movies$thtr_rel_day, sep="-" )

movies$time_DVD <- paste( movies$dvd_rel_year,movies$dvd_rel_month, movies$dvd_rel_day, sep="-" )

Make the time series table to see if there is a correlation between time and imdb rating.

Also, I’ll separate it into two groups by movie is released in theaters and on DVD. So, there will be two graph showing the relationship between time and rating.

Theater - imdb rating time series plot

movies$time_theater <- as.Date( movies$time_theater)
ggplot(movies, aes(time_theater, imdb_rating)) + geom_line(size=0.3,color="black") + xlab("Date released in Theater") + ylab("IMDB Rating")

DVD - imdb rating time series plot

movies$time_DVD <- as.Date( movies$time_theater)
ggplot(movies, aes(time_DVD, imdb_rating)) + geom_line(size=0.3,color="black") + xlab("Date released on DVD") + ylab("IMDB Rating")

From the Consquence of these graphs, we could know there’s no big difference from years to year which makes an infuence to imdb rate.

To specific, I thought maybe publication month have do something about rating, so I’ll check if there’s correlation between thtr_rel_month and imdb rating. If yes, I won’t remove it.

Theather Released Month - IMDB rating bar plot

movies_thtr_monthly <- movies%>%
  group_by(thtr_rel_month)%>%
    summarise(imdb_rating_mean = mean(imdb_rating))%>%
      mutate(month=month.abb[thtr_rel_month])

ggplot(movies_thtr_monthly,aes(x=month,y=imdb_rating_mean))+geom_bar(stat= "identity",inherit.aes = 1.5)+xlab("Month")+ylab("Mean of Rate")+ theme_classic()

I thought maybe on July, there will be more high rating movie released, but I’m wrong, there’s no huge difference between month.

d.Depends on the research, I’ll remove ‘thtr_rel_year, thtr_rel_month, thtr_rel_day, dvd_rel_year, dvd_rel_month, dvd_rel_day’ # Remove time related variables

movies_re2 <- movies_re1 %>%
  select(-thtr_rel_year, -thtr_rel_month, -thtr_rel_day, -dvd_rel_year, -dvd_rel_month, -dvd_rel_day)
colnames(movies_re2)

##  [1] "title"            "mpaa_rating"      "studio"          
##  [4] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
##  [7] "critics_score"    "audience_rating"  "audience_score"  
## [10] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [13] "best_actress_win" "best_dir_win"     "top200_box"      
## [16] "director"         "actor1"           "actor2"          
## [19] "actor3"           "actor4"           "actor5"

*** Discussion about actoor and director

I’ll remove all those variables since I couldn’t make sure if all of them will film a new movie in 2015, thus those variables is not quite important for my prediction. # Remove director, actor1, actor2, actor3, actor4, actor5 , studio

movies_re3 <- movies_re2 %>%
  select(-director, -actor1, -actor2, -actor3, -actor4, -actor5,-studio)
colnames(movies_re3)

##  [1] "title"            "mpaa_rating"      "imdb_rating"     
##  [4] "imdb_num_votes"   "critics_rating"   "critics_score"   
##  [7] "audience_rating"  "audience_score"   "best_pic_nom"    
## [10] "best_pic_win"     "best_actor_win"   "best_actress_win"
## [13] "best_dir_win"     "top200_box"

Even removed those variables, I still want to see who are the directors and actors could earn the 20 heights rating in imbd, then make a graph to show the list and compare with the prediction of mine.

Which directors have the highest average IMBD rating? Lets display the top twenty directors? B.Z. Goldberg / Christopher Nolan / Dereck Joubert / Jim Brown / Jonathan Karsh

Top 20 Director with the highest rating

temp <- movies%>% select(director,imdb_rating)
temp <- temp %>% group_by(director) %>% summarise(avg=mean(imdb_rating))
temp <- temp %>% arrange(desc(avg))

temp <- temp[1:20,]

ggplot(data=temp,aes(x=director,y=avg))+geom_bar(stat="identity")+coord_flip()+theme_grey()

Which actors nowday ( after 2000 ) have the highest average IMBD rating? Lets display the top twenty directors? Yarko / Jeremy Irons / Pete Seeger / Susan Tom / Bruce Balden / Joel Edgerton

Top 20 Actors with the highest rating

temp1 <- movies %>%
  filter(thtr_rel_year >= 2000 )%>% 
    select(actor1,imdb_rating)
temp1 <- temp1 %>% group_by(actor1) %>% summarise(avg=mean(imdb_rating))
temp1 <- temp1 %>% arrange(desc(avg))

temp1 <- temp1[1:20,]

ggplot(data=temp1,aes(x=actor1,y=avg))+geom_bar(stat="identity")+coord_flip()+theme_grey()

*** Remove all outliers which wouldn’t make influences of prediction.

Make sure the assocation between audience rating and imdb rating is high. (0.8648652)

movies%>%
  summarise(cor(imdb_rating,audience_score))

## Source: local data frame [1 x 1]
## 
##   cor(imdb_rating, audience_score)
##                              (dbl)
## 1                        0.8648652

Remove all outliers imdb <3.3 or > 8.5.

movies_test <- movies_re3 %>%
  filter(imdb_rating >= 3.3 & imdb_rating < 8.5)

See if it makes difference for removing outliers. -> 0.8629269 is lower, so I couldn’t see them as outliers.

movies_test%>%
  summarise(cor(imdb_rating,audience_score))

## Source: local data frame [1 x 1]
## 
##   cor(imdb_rating, audience_score)
##                              (dbl)
## 1                        0.8629269

Suddenly, I found out that there’s a tag of unrated in ‘mpaa_rating’ may chagne the consequence of my prediction, since “unrated” doesn’t mean anthing relate to the movie. Thsu, I’ll try to remove those movies with unrated type to see if it will make a big difference to the assocation between imdb_rating and audience scores. (0.8546815)

movies_test2 <- movies_re3 %>%
  filter(mpaa_rating != "Unrated")

movies_test2 %>%
  summarise(cor(imdb_rating,audience_score))

## Source: local data frame [1 x 1]
## 
##   cor(imdb_rating, audience_score)
##                              (dbl)
## 1                        0.8546815

0.8546815 is lower, but I think I should remove this variable, since it’ll limit my prediction to the new dataset which I need to ignore those movie with Unrated types.

movies_re4 <- movies_re3 %>%
  select(-mpaa_rating)

Finally, after removing all non related variables, I am ready to deal with modeling issue for removing low correlation variables and to make the best prediction model.

Part 4: Modeling

***Check all rest variables. After dealing with non-related variables, I still have 13 variables. Without movie titles and imdb rating.

Pick up numerical variables: imdb_rating imdb_num_votes critics_score audience_score

Factor: critics_rating audience_rating best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box

str(movies_re4)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  13 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

List all correlation coefficients of numerical variables between imdb rating. Then, remove the 4 lowest as lower correlation variables.

movies_re4 %>%
  summarise(cor(imdb_rating,imdb_num_votes),
            cor(imdb_rating,critics_score),
            cor(imdb_rating,audience_score))

## Source: local data frame [1 x 3]
## 
##   cor(imdb_rating, imdb_num_votes) cor(imdb_rating, critics_score)
##                              (dbl)                           (dbl)
## 1                        0.3311525                       0.7650355
## Variables not shown: cor(imdb_rating, audience_score) (dbl)

ggpairs to check the corr:

movies %>%
    select(imdb_rating,imdb_num_votes,
           critics_score, audience_score) %>% 
    ggpairs()

Use Hypothesis Test to check if critics_score and critics_rating are dependent

Hypothesis Test for critics rating and scores

inference(y=critics_score,x=critics_rating,data=movies_re4,statistic = "mean",type = "ht", null = 0 , alternative = "greater",method = "theoretical")

## Warning: Ignoring null value since it's undefined for ANOVA

## Response variable: numerical
## Explanatory variable: categorical (3 levels) 
## n_Certified Fresh = 135, y_bar_Certified Fresh = 86.8593, s_Certified Fresh = 7.2505
## n_Fresh = 209, y_bar_Fresh = 77.3445, s_Fresh = 12.6649
## n_Rotten = 307, y_bar_Rotten = 31.4788, s_Rotten = 15.9091
## 
## ANOVA:
##                 df      Sum_Sq     Mean_Sq         F  p_value
## critics_rating   2 406517.5644 203258.7822 1117.5633 < 0.0001
## Residuals      648 117856.1345    181.8768                   
## Total          650 524373.6989                               
## 
## Pairwise tests - t tests with pooled SD:
##   group1          group2    p.value
## 1  Fresh Certified Fresh  3.179e-10
## 2 Rotten Certified Fresh 5.256e-176
## 4 Rotten           Fresh 1.128e-166

Reject Null Hypothesis, don’t need to remove anyone of them. Repeat the same test between audience_score and audience_rating.

Hypothesis Test for audience rating and scores

inference(y=audience_score,x=audience_rating,data=movies_re4,statistic = "mean",type = "ht", null = 0 , success= "atheist" , alternative = "greater",method = "theoretical")

## Warning: Ignoring success since y is numerical

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Spilled = 275, y_bar_Spilled = 41.9345, s_Spilled = 11.217
## n_Upright = 376, y_bar_Upright = 77.3032, s_Upright = 9.3317
## H0: mu_Spilled =  mu_Upright
## HA: mu_Spilled > mu_Upright
## t = -42.6058, df = 274
## p_value = 1

Still reject Null Hypothesis, don’t need to remove anyone of them.

Remove ‘imdb_num_votes’ which is not related to the imdb rating.

movies_re5 <- movies_re4 %>%
  select(-imdb_num_votes)

***Backward Selection Make sure the rest of variables could make the best prediction model.

Make a prediction model

movie_lm1 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+ best_dir_win+top200_box,data=movies_re5)

summary(movie_lm1)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + critics_rating + 
##     audience_rating + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box, data = movies_re5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30802 -0.21403  0.03354  0.27988  1.18664 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.948464   0.142129  20.745  < 2e-16 ***
## critics_score           0.015858   0.001561  10.156  < 2e-16 ***
## audience_score          0.042576   0.002119  20.094  < 2e-16 ***
## critics_ratingFresh     0.032948   0.056289   0.585 0.558523    
## critics_ratingRotten    0.310520   0.092135   3.370 0.000796 ***
## audience_ratingUpright -0.367011   0.075800  -4.842 1.62e-06 ***
## best_pic_nomyes        -0.008349   0.123027  -0.068 0.945919    
## best_pic_winyes         0.111602   0.216812   0.515 0.606913    
## best_actor_winyes       0.083216   0.054792   1.519 0.129322    
## best_actress_winyes     0.094051   0.061471   1.530 0.126511    
## best_dir_winyes         0.094908   0.080461   1.180 0.238617    
## top200_boxyes           0.005712   0.126594   0.045 0.964027    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4752 on 639 degrees of freedom
## Multiple R-squared:  0.8113, Adjusted R-squared:  0.8081 
## F-statistic: 249.8 on 11 and 639 DF,  p-value: < 2.2e-16

Remove top200_boxyes, best_pic_nom, best_pic_win which owned a higer P-value and see if adjusted R-squared could be more than 0.8081.

movie_lm1 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)
summary(movie_lm1)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + critics_rating + 
##     audience_rating + best_actor_win + best_actress_win + best_dir_win, 
##     data = movies_re5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.30746 -0.21062  0.03402  0.28047  1.18703 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.950315   0.141570  20.840  < 2e-16 ***
## critics_score           0.015859   0.001558  10.178  < 2e-16 ***
## audience_score          0.042593   0.002104  20.242  < 2e-16 ***
## critics_ratingFresh     0.028458   0.054618   0.521 0.602520    
## critics_ratingRotten    0.307161   0.091250   3.366 0.000808 ***
## audience_ratingUpright -0.367188   0.075550  -4.860 1.48e-06 ***
## best_actor_winyes       0.082581   0.053978   1.530 0.126537    
## best_actress_winyes     0.097534   0.060349   1.616 0.106549    
## best_dir_winyes         0.108698   0.076189   1.427 0.154153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4742 on 642 degrees of freedom
## Multiple R-squared:  0.8112, Adjusted R-squared:  0.8089 
## F-statistic: 344.8 on 8 and 642 DF,  p-value: < 2.2e-16

Adjusted R-squared: 0.8089, which means I’ve successfully make a better model. But I still want to try if I could make it better. So, I’ll try to remove “critics_rating, audience_rating, best_actor_win, best_actress_win, best_dir_win”individually to make sure I have the highest adujsted R-squared.(0.8089)

Remove: critics_rating

movie_lm2 <- lm(imdb_rating~critics_score+audience_score+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)

0.8054

Remove: audience_rating

movie_lm3 <- lm(imdb_rating~critics_score+audience_score+critics_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)

0.8021

Remove: best_actor_win

movie_lm4 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actress_win+ best_dir_win,data=movies_re5)

0.8085

Remove: best_actress_win

movie_lm5 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+ best_dir_win,data=movies_re5)

0.8084

Remove: best_dir_win

movie_lm6 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win,data=movies_re5)

0.8086

As the consequence, we don’t need to remove anyone of them and can make a final MODEL to predict 2016 dataset as below:

movie_final_lm <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)

***Model Diagnostics a. Linear relationship X and Y -> Checked. #Residuals Plot

ggplot(data=movie_final_lm,aes(x=movie_final_lm$fitted,y=movies$imdb_rating))+
geom_point()+
xlab("movies$imdb_rating")+
ylab("Resifuals")

Nearly normal residuals -> Checked.

hist(movie_final_lm$residuals)

Constant Variability of residuals -> Checked.

plot(movie_final_lm$residuals ~ movie_final_lm$fitted)

Independence of residuals -> Not sure any pattern -> Checked.

plot(movie_final_lm$residuals)

Final Model as below:

\[ imbd\_rating = 2.950315 + 0.015859 \times critics\_score+ 0.042593 \times audience\_score \] \[ + 0.028458 \times critics\_rating(Fresh)+ 0.307161 \times critics\_rating(Rotten)-0.367188 \times audience\_rating(Upright) \] \[ + 0.082581 \times best\_actor\_win(yes)+ 0.097534 \times best\_actress\_win(yes)+ 0.108698 \times best\_dir\_win(yes) \]

Part 5: Prediction

Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using my the model. ( the data references from :Rotten Tomatoes and IMDB APIs. )

Information courtesy of IMDb (http://www.imdb.com). Used with permission.

Collect data from IMDb and Rottentomatoes

setwd("/Users/weichien/Desktop")
movies_imdb <-read.csv("/Users/weichien/Desktop/moviedata2016.csv")

I’ve selected 11 different kinds of Movie to see if I make a great Prediction Model:

Suicide Squad 6.2 Hacksaw Ridge 8.2 La La Land 8.2 Fantastic Beasts and Where to Find Them 7.4 Hidden Figures 7.8 Moonlight 7.5 Deadpool 8 Jack Reacher: Never Go Back 6.1 Keeping Up with the Joneses 5.8 Ghostbusters 5.3 Colossal 6.3

C. Predcition -> Actual Rate more than 85%

Predict_result <- predict(movie_final_lm,movies_imdb)

Predict <- 
  data.frame (Movie_name=movies_imdb$X...Name,IMBD_rating=movies_imdb$imdb_rating,Predicted_rating=Predict_result) %>%
  mutate(Predict_Actual_Rate=1- (abs(IMBD_rating-Predicted_rating)/IMBD_rating ))

Predict

##                                 Movie_name IMBD_rating Predicted_rating
## 1                            Suicide Squad         6.2         5.927540
## 2                            Hacksaw Ridge         8.2         7.881438
## 3                               La La Land         8.2         7.756894
## 4  Fantastic Beasts and Where to Find Them         7.4         7.148293
## 5                           Hidden Figures         7.8         8.019186
## 6                                Moonlight         7.5         7.627351
## 7                                 Deadpool         8.0         7.748674
## 8                             Jack Reacher         6.1         5.633175
## 9              Keeping Up with the Joneses         5.8         5.177338
## 10                            Ghostbusters         5.3         6.196854
## 11                                Colossal         6.3         6.547822
##    Predict_Actual_Rate
## 1            0.9560549
## 2            0.9611510
## 3            0.9459626
## 4            0.9659855
## 5            0.9718993
## 6            0.9830198
## 7            0.9685843
## 8            0.9234713
## 9            0.8926445
## 10           0.8307823
## 11           0.9606632

ggplot(Predict, aes(x=Movie_name ,y=Predict_Actual_Rate)) +
  geom_bar(stat="identity")+coord_flip()+theme_grey()+
  xlab("Movie")+
  ylab("Actual Rate")

D. Predcition (Correct quantification of uncertainty around this prediction with a prediction interval)

Predict_result95 <- predict(movie_final_lm,movies_imdb,interval = "prediction",level = 0.95)

Predict95 <- 
  data.frame (Movie_name=movies_imdb$X...Name,IMBD_rating=movies_imdb$imdb_rating,Predicted_rating=Predict_result95)

Predict95

##                                 Movie_name IMBD_rating
## 1                            Suicide Squad         6.2
## 2                            Hacksaw Ridge         8.2
## 3                               La La Land         8.2
## 4  Fantastic Beasts and Where to Find Them         7.4
## 5                           Hidden Figures         7.8
## 6                                Moonlight         7.5
## 7                                 Deadpool         8.0
## 8                             Jack Reacher         6.1
## 9              Keeping Up with the Joneses         5.8
## 10                            Ghostbusters         5.3
## 11                                Colossal         6.3
##    Predicted_rating.fit Predicted_rating.lwr Predicted_rating.upr
## 1              5.927540             4.991732             6.863348
## 2              7.881438             6.945140             8.817736
## 3              7.756894             6.808046             8.705741
## 4              7.148293             6.212308             8.084278
## 5              8.019186             7.082819             8.955553
## 6              7.627351             6.686968             8.567734
## 7              7.748674             6.812591             8.684758
## 8              5.633175             4.699472             6.566878
## 9              5.177338             4.243657             6.111020
## 10             6.196854             5.261745             7.131964
## 11             6.547822             5.611754             7.483890

Part 6: Conclusion

It will be separated into four parts about what I’ve learned and discovered in this project. During different periods of modeling, I faced various problems needed to be solved and still could do more to make this model much better. So, I’ll discuss it from How to prepare for modeling, concerning about ambiguous, optimizing my model and consequences to the prediction.

We couldn’t deal with the dataset through our personnel experience since we need to be objective and go through all dataset. For example, some informative data can be deleted or removed easily but some variables you may thought are not related to response variable but that’s wrong. Just like I thought there have something happen between month and rating because I considered movies in summer vocation are much more popular than another season. But the correlation analyzing consequence shows there’s not much differences between them.
We should be careful when dealing with outliers and some variables aren’t appropriate to be put into the model. That’s the main weakness of my prediction model. Just like some variables such as ‘best_actor_win’ which only have one winner per year and there’re only yes and no to be filled into column. It won’t make an important to the consequence and wastes time to record more data for predicting. We should consider what’s the purpose of building this model to determine which variables should be removed or put into model.
We need to double check if the assumption we set at first is reasonable. At first, I put the type of movie into my model since it has a high correlation coefficient value with the IMDB rating score. But the R-squared value in model is extremely low, thus I would like to know the reason. Then I realized that I just ignored there’re lot of “unrated” information in that variable. So, I removed all unrated data to build up a new model and compare with the original one to make sure if removed those data will make my model better. Tell the truth, I still think the type of movie could change the consequence to my prediction. But the data we collected could only provide the information of what kinds of movie may earn a higher rating but not for the prediction of which type of movie should earn higher rating.
Back to the research question, we could know there’re some variables could predict the rating in IMDB such as scores given by audiences and critics from the Rotten Tomato. On the other, critics scores didn’t mean that much as audiences score. In my opinion, I thought critics gave more general score than audience since they need to be fair to all movie makers but audience are easy to show their preferences. In addition, critics scores were collected from only around 50 critics but audience scores were from more than thousands of people. It may be the reason why the ‘audience_score’ and ‘audience_raing’ make more influence into my model.
To sum up, after this project all of us can learn that IMDB and Rotten Tomato are using the same system to evaluate movies, even they collected data through different channels and methods, but they still can give a high related rating for each movie which means their records to movies have a great credibility.

WeiChien Prediction Project for Movies