library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)load("movies.Rdata")Generabizability: The data set is comprised of 651 randomly sampled movies produced and released before 2016, therefore, it was used by sampling without doing research on specific groups of people.We use random sampling because if we use non-random sampling our data will be biased and unreliable to use (not good to use) ,and also we should reject the data that make collinearity and parsimony.
Casuality: There is no dependence between data in this dataset, since those variables are not the acutal reasons to change those feeling of audience to rank socres in IMDB system. I’ll analyze much more correlation bwtween those data in my EDA as below.
“I want to make a model to predict imdb rate of 2016. So, the consequeces will be the variables which could be used for do the prediction.”
Thus, I will try to find out 4-5 variabiles which can be used for do the prediction. On the other hand, those variables should be independent to each other and have high correalation with ‘imdb_rating’. I’ll use the methods I’ve learned on this class included ‘ggpairs’ form package ‘GGally’ to overlook the association within dataset, backward selection to select the most appropriated variables, linear modeling to predict 2016 imdb rate through other related variables.
*** Remove non-related data
From knowledge and experience Some of these variables are only there for informational purpose # Remove: imdb_url, rt_url
We do not discuss about the types of movie, just focus on other materials in this dataset. # Remove: title_type, genre, runtime
movies_re1 <- movies %>%
select(-imdb_url,-rt_url,-title_type, -genre, -runtime)
colnames(movies_re1)## [1] "title" "mpaa_rating" "studio"
## [4] "thtr_rel_year" "thtr_rel_month" "thtr_rel_day"
## [7] "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
## [10] "imdb_rating" "imdb_num_votes" "critics_rating"
## [13] "critics_score" "audience_rating" "audience_score"
## [16] "best_pic_nom" "best_pic_win" "best_actor_win"
## [19] "best_actress_win" "best_dir_win" "top200_box"
## [22] "director" "actor1" "actor2"
## [25] "actor3" "actor4" "actor5"
*** Discussion about time information Since I want to predict the information from 2016 dataset, I need to make sure time is not a main reason of rating.
movies$time_theater <- paste( movies$thtr_rel_year,movies$thtr_rel_month, movies$thtr_rel_day, sep="-" )
movies$time_DVD <- paste( movies$dvd_rel_year,movies$dvd_rel_month, movies$dvd_rel_day, sep="-" )Also, I’ll separate it into two groups by movie is released in theaters and on DVD. So, there will be two graph showing the relationship between time and rating.
movies$time_theater <- as.Date( movies$time_theater)
ggplot(movies, aes(time_theater, imdb_rating)) + geom_line(size=0.3,color="black") + xlab("Date released in Theater") + ylab("IMDB Rating")movies$time_DVD <- as.Date( movies$time_theater)
ggplot(movies, aes(time_DVD, imdb_rating)) + geom_line(size=0.3,color="black") + xlab("Date released on DVD") + ylab("IMDB Rating")From the Consquence of these graphs, we could know there’s no big difference from years to year which makes an infuence to imdb rate.
movies_thtr_monthly <- movies%>%
group_by(thtr_rel_month)%>%
summarise(imdb_rating_mean = mean(imdb_rating))%>%
mutate(month=month.abb[thtr_rel_month])
ggplot(movies_thtr_monthly,aes(x=month,y=imdb_rating_mean))+geom_bar(stat= "identity",inherit.aes = 1.5)+xlab("Month")+ylab("Mean of Rate")+ theme_classic()I thought maybe on July, there will be more high rating movie released, but I’m wrong, there’s no huge difference between month.
d.Depends on the research, I’ll remove ‘thtr_rel_year, thtr_rel_month, thtr_rel_day, dvd_rel_year, dvd_rel_month, dvd_rel_day’ # Remove time related variables
movies_re2 <- movies_re1 %>%
select(-thtr_rel_year, -thtr_rel_month, -thtr_rel_day, -dvd_rel_year, -dvd_rel_month, -dvd_rel_day)
colnames(movies_re2)## [1] "title" "mpaa_rating" "studio"
## [4] "imdb_rating" "imdb_num_votes" "critics_rating"
## [7] "critics_score" "audience_rating" "audience_score"
## [10] "best_pic_nom" "best_pic_win" "best_actor_win"
## [13] "best_actress_win" "best_dir_win" "top200_box"
## [16] "director" "actor1" "actor2"
## [19] "actor3" "actor4" "actor5"
*** Discussion about actoor and director
movies_re3 <- movies_re2 %>%
select(-director, -actor1, -actor2, -actor3, -actor4, -actor5,-studio)
colnames(movies_re3)## [1] "title" "mpaa_rating" "imdb_rating"
## [4] "imdb_num_votes" "critics_rating" "critics_score"
## [7] "audience_rating" "audience_score" "best_pic_nom"
## [10] "best_pic_win" "best_actor_win" "best_actress_win"
## [13] "best_dir_win" "top200_box"
Which directors have the highest average IMBD rating? Lets display the top twenty directors? B.Z. Goldberg / Christopher Nolan / Dereck Joubert / Jim Brown / Jonathan Karsh
temp <- movies%>% select(director,imdb_rating)
temp <- temp %>% group_by(director) %>% summarise(avg=mean(imdb_rating))
temp <- temp %>% arrange(desc(avg))
temp <- temp[1:20,]
ggplot(data=temp,aes(x=director,y=avg))+geom_bar(stat="identity")+coord_flip()+theme_grey()Which actors nowday ( after 2000 ) have the highest average IMBD rating? Lets display the top twenty directors? Yarko / Jeremy Irons / Pete Seeger / Susan Tom / Bruce Balden / Joel Edgerton
temp1 <- movies %>%
filter(thtr_rel_year >= 2000 )%>%
select(actor1,imdb_rating)
temp1 <- temp1 %>% group_by(actor1) %>% summarise(avg=mean(imdb_rating))
temp1 <- temp1 %>% arrange(desc(avg))
temp1 <- temp1[1:20,]
ggplot(data=temp1,aes(x=actor1,y=avg))+geom_bar(stat="identity")+coord_flip()+theme_grey()*** Remove all outliers which wouldn’t make influences of prediction.
movies%>%
summarise(cor(imdb_rating,audience_score))## Source: local data frame [1 x 1]
##
## cor(imdb_rating, audience_score)
## (dbl)
## 1 0.8648652
movies_test <- movies_re3 %>%
filter(imdb_rating >= 3.3 & imdb_rating < 8.5)movies_test%>%
summarise(cor(imdb_rating,audience_score))## Source: local data frame [1 x 1]
##
## cor(imdb_rating, audience_score)
## (dbl)
## 1 0.8629269
Suddenly, I found out that there’s a tag of unrated in ‘mpaa_rating’ may chagne the consequence of my prediction, since “unrated” doesn’t mean anthing relate to the movie. Thsu, I’ll try to remove those movies with unrated type to see if it will make a big difference to the assocation between imdb_rating and audience scores. (0.8546815)
movies_test2 <- movies_re3 %>%
filter(mpaa_rating != "Unrated")
movies_test2 %>%
summarise(cor(imdb_rating,audience_score))## Source: local data frame [1 x 1]
##
## cor(imdb_rating, audience_score)
## (dbl)
## 1 0.8546815
0.8546815 is lower, but I think I should remove this variable, since it’ll limit my prediction to the new dataset which I need to ignore those movie with Unrated types.
movies_re4 <- movies_re3 %>%
select(-mpaa_rating)Finally, after removing all non related variables, I am ready to deal with modeling issue for removing low correlation variables and to make the best prediction model.
***Check all rest variables. After dealing with non-related variables, I still have 13 variables. Without movie titles and imdb rating.
Pick up numerical variables: imdb_rating imdb_num_votes critics_score audience_score
Factor: critics_rating audience_rating best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box
str(movies_re4) ## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 13 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
movies_re4 %>%
summarise(cor(imdb_rating,imdb_num_votes),
cor(imdb_rating,critics_score),
cor(imdb_rating,audience_score))## Source: local data frame [1 x 3]
##
## cor(imdb_rating, imdb_num_votes) cor(imdb_rating, critics_score)
## (dbl) (dbl)
## 1 0.3311525 0.7650355
## Variables not shown: cor(imdb_rating, audience_score) (dbl)
movies %>%
select(imdb_rating,imdb_num_votes,
critics_score, audience_score) %>%
ggpairs()inference(y=critics_score,x=critics_rating,data=movies_re4,statistic = "mean",type = "ht", null = 0 , alternative = "greater",method = "theoretical")## Warning: Ignoring null value since it's undefined for ANOVA
## Response variable: numerical
## Explanatory variable: categorical (3 levels)
## n_Certified Fresh = 135, y_bar_Certified Fresh = 86.8593, s_Certified Fresh = 7.2505
## n_Fresh = 209, y_bar_Fresh = 77.3445, s_Fresh = 12.6649
## n_Rotten = 307, y_bar_Rotten = 31.4788, s_Rotten = 15.9091
##
## ANOVA:
## df Sum_Sq Mean_Sq F p_value
## critics_rating 2 406517.5644 203258.7822 1117.5633 < 0.0001
## Residuals 648 117856.1345 181.8768
## Total 650 524373.6989
##
## Pairwise tests - t tests with pooled SD:
## group1 group2 p.value
## 1 Fresh Certified Fresh 3.179e-10
## 2 Rotten Certified Fresh 5.256e-176
## 4 Rotten Fresh 1.128e-166
Reject Null Hypothesis, don’t need to remove anyone of them. Repeat the same test between audience_score and audience_rating.
inference(y=audience_score,x=audience_rating,data=movies_re4,statistic = "mean",type = "ht", null = 0 , success= "atheist" , alternative = "greater",method = "theoretical")## Warning: Ignoring success since y is numerical
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Spilled = 275, y_bar_Spilled = 41.9345, s_Spilled = 11.217
## n_Upright = 376, y_bar_Upright = 77.3032, s_Upright = 9.3317
## H0: mu_Spilled = mu_Upright
## HA: mu_Spilled > mu_Upright
## t = -42.6058, df = 274
## p_value = 1
Still reject Null Hypothesis, don’t need to remove anyone of them.
movies_re5 <- movies_re4 %>%
select(-imdb_num_votes)***Backward Selection Make sure the rest of variables could make the best prediction model.
movie_lm1 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_pic_nom+best_pic_win+best_actor_win+best_actress_win+ best_dir_win+top200_box,data=movies_re5)
summary(movie_lm1)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + critics_rating +
## audience_rating + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box, data = movies_re5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30802 -0.21403 0.03354 0.27988 1.18664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.948464 0.142129 20.745 < 2e-16 ***
## critics_score 0.015858 0.001561 10.156 < 2e-16 ***
## audience_score 0.042576 0.002119 20.094 < 2e-16 ***
## critics_ratingFresh 0.032948 0.056289 0.585 0.558523
## critics_ratingRotten 0.310520 0.092135 3.370 0.000796 ***
## audience_ratingUpright -0.367011 0.075800 -4.842 1.62e-06 ***
## best_pic_nomyes -0.008349 0.123027 -0.068 0.945919
## best_pic_winyes 0.111602 0.216812 0.515 0.606913
## best_actor_winyes 0.083216 0.054792 1.519 0.129322
## best_actress_winyes 0.094051 0.061471 1.530 0.126511
## best_dir_winyes 0.094908 0.080461 1.180 0.238617
## top200_boxyes 0.005712 0.126594 0.045 0.964027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4752 on 639 degrees of freedom
## Multiple R-squared: 0.8113, Adjusted R-squared: 0.8081
## F-statistic: 249.8 on 11 and 639 DF, p-value: < 2.2e-16
Remove top200_boxyes, best_pic_nom, best_pic_win which owned a higer P-value and see if adjusted R-squared could be more than 0.8081.
movie_lm1 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)
summary(movie_lm1)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + critics_rating +
## audience_rating + best_actor_win + best_actress_win + best_dir_win,
## data = movies_re5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.30746 -0.21062 0.03402 0.28047 1.18703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.950315 0.141570 20.840 < 2e-16 ***
## critics_score 0.015859 0.001558 10.178 < 2e-16 ***
## audience_score 0.042593 0.002104 20.242 < 2e-16 ***
## critics_ratingFresh 0.028458 0.054618 0.521 0.602520
## critics_ratingRotten 0.307161 0.091250 3.366 0.000808 ***
## audience_ratingUpright -0.367188 0.075550 -4.860 1.48e-06 ***
## best_actor_winyes 0.082581 0.053978 1.530 0.126537
## best_actress_winyes 0.097534 0.060349 1.616 0.106549
## best_dir_winyes 0.108698 0.076189 1.427 0.154153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4742 on 642 degrees of freedom
## Multiple R-squared: 0.8112, Adjusted R-squared: 0.8089
## F-statistic: 344.8 on 8 and 642 DF, p-value: < 2.2e-16
Adjusted R-squared: 0.8089, which means I’ve successfully make a better model. But I still want to try if I could make it better. So, I’ll try to remove “critics_rating, audience_rating, best_actor_win, best_actress_win, best_dir_win”individually to make sure I have the highest adujsted R-squared.(0.8089)
Remove: critics_rating
movie_lm2 <- lm(imdb_rating~critics_score+audience_score+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)0.8054
Remove: audience_rating
movie_lm3 <- lm(imdb_rating~critics_score+audience_score+critics_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)0.8021
Remove: best_actor_win
movie_lm4 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actress_win+ best_dir_win,data=movies_re5)0.8085
Remove: best_actress_win
movie_lm5 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+ best_dir_win,data=movies_re5)0.8084
Remove: best_dir_win
movie_lm6 <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win,data=movies_re5)0.8086
As the consequence, we don’t need to remove anyone of them and can make a final MODEL to predict 2016 dataset as below:
movie_final_lm <- lm(imdb_rating~critics_score+audience_score+critics_rating+ audience_rating+best_actor_win+best_actress_win+ best_dir_win,data=movies_re5)***Model Diagnostics a. Linear relationship X and Y -> Checked. #Residuals Plot
ggplot(data=movie_final_lm,aes(x=movie_final_lm$fitted,y=movies$imdb_rating))+
geom_point()+
xlab("movies$imdb_rating")+
ylab("Resifuals")hist(movie_final_lm$residuals)plot(movie_final_lm$residuals ~ movie_final_lm$fitted)plot(movie_final_lm$residuals)\[ imbd\_rating = 2.950315 + 0.015859 \times critics\_score+ 0.042593 \times audience\_score \] \[ + 0.028458 \times critics\_rating(Fresh)+ 0.307161 \times critics\_rating(Rotten)-0.367188 \times audience\_rating(Upright) \] \[ + 0.082581 \times best\_actor\_win(yes)+ 0.097534 \times best\_actress\_win(yes)+ 0.108698 \times best\_dir\_win(yes) \]
Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using my the model. ( the data references from :Rotten Tomatoes and IMDB APIs. )
Information courtesy of IMDb (http://www.imdb.com). Used with permission.
setwd("/Users/weichien/Desktop")
movies_imdb <-read.csv("/Users/weichien/Desktop/moviedata2016.csv")Suicide Squad 6.2 Hacksaw Ridge 8.2 La La Land 8.2 Fantastic Beasts and Where to Find Them 7.4 Hidden Figures 7.8 Moonlight 7.5 Deadpool 8 Jack Reacher: Never Go Back 6.1 Keeping Up with the Joneses 5.8 Ghostbusters 5.3 Colossal 6.3
C. Predcition -> Actual Rate more than 85%
Predict_result <- predict(movie_final_lm,movies_imdb)
Predict <-
data.frame (Movie_name=movies_imdb$X...Name,IMBD_rating=movies_imdb$imdb_rating,Predicted_rating=Predict_result) %>%
mutate(Predict_Actual_Rate=1- (abs(IMBD_rating-Predicted_rating)/IMBD_rating ))Predict## Movie_name IMBD_rating Predicted_rating
## 1 Suicide Squad 6.2 5.927540
## 2 Hacksaw Ridge 8.2 7.881438
## 3 La La Land 8.2 7.756894
## 4 Fantastic Beasts and Where to Find Them 7.4 7.148293
## 5 Hidden Figures 7.8 8.019186
## 6 Moonlight 7.5 7.627351
## 7 Deadpool 8.0 7.748674
## 8 Jack Reacher 6.1 5.633175
## 9 Keeping Up with the Joneses 5.8 5.177338
## 10 Ghostbusters 5.3 6.196854
## 11 Colossal 6.3 6.547822
## Predict_Actual_Rate
## 1 0.9560549
## 2 0.9611510
## 3 0.9459626
## 4 0.9659855
## 5 0.9718993
## 6 0.9830198
## 7 0.9685843
## 8 0.9234713
## 9 0.8926445
## 10 0.8307823
## 11 0.9606632
ggplot(Predict, aes(x=Movie_name ,y=Predict_Actual_Rate)) +
geom_bar(stat="identity")+coord_flip()+theme_grey()+
xlab("Movie")+
ylab("Actual Rate")D. Predcition (Correct quantification of uncertainty around this prediction with a prediction interval)
Predict_result95 <- predict(movie_final_lm,movies_imdb,interval = "prediction",level = 0.95)
Predict95 <-
data.frame (Movie_name=movies_imdb$X...Name,IMBD_rating=movies_imdb$imdb_rating,Predicted_rating=Predict_result95)Predict95 ## Movie_name IMBD_rating
## 1 Suicide Squad 6.2
## 2 Hacksaw Ridge 8.2
## 3 La La Land 8.2
## 4 Fantastic Beasts and Where to Find Them 7.4
## 5 Hidden Figures 7.8
## 6 Moonlight 7.5
## 7 Deadpool 8.0
## 8 Jack Reacher 6.1
## 9 Keeping Up with the Joneses 5.8
## 10 Ghostbusters 5.3
## 11 Colossal 6.3
## Predicted_rating.fit Predicted_rating.lwr Predicted_rating.upr
## 1 5.927540 4.991732 6.863348
## 2 7.881438 6.945140 8.817736
## 3 7.756894 6.808046 8.705741
## 4 7.148293 6.212308 8.084278
## 5 8.019186 7.082819 8.955553
## 6 7.627351 6.686968 8.567734
## 7 7.748674 6.812591 8.684758
## 8 5.633175 4.699472 6.566878
## 9 5.177338 4.243657 6.111020
## 10 6.196854 5.261745 7.131964
## 11 6.547822 5.611754 7.483890
It will be separated into four parts about what I’ve learned and discovered in this project. During different periods of modeling, I faced various problems needed to be solved and still could do more to make this model much better. So, I’ll discuss it from How to prepare for modeling, concerning about ambiguous, optimizing my model and consequences to the prediction.
We couldn’t deal with the dataset through our personnel experience since we need to be objective and go through all dataset. For example, some informative data can be deleted or removed easily but some variables you may thought are not related to response variable but that’s wrong. Just like I thought there have something happen between month and rating because I considered movies in summer vocation are much more popular than another season. But the correlation analyzing consequence shows there’s not much differences between them.
We should be careful when dealing with outliers and some variables aren’t appropriate to be put into the model. That’s the main weakness of my prediction model. Just like some variables such as ‘best_actor_win’ which only have one winner per year and there’re only yes and no to be filled into column. It won’t make an important to the consequence and wastes time to record more data for predicting. We should consider what’s the purpose of building this model to determine which variables should be removed or put into model.
We need to double check if the assumption we set at first is reasonable. At first, I put the type of movie into my model since it has a high correlation coefficient value with the IMDB rating score. But the R-squared value in model is extremely low, thus I would like to know the reason. Then I realized that I just ignored there’re lot of “unrated” information in that variable. So, I removed all unrated data to build up a new model and compare with the original one to make sure if removed those data will make my model better. Tell the truth, I still think the type of movie could change the consequence to my prediction. But the data we collected could only provide the information of what kinds of movie may earn a higher rating but not for the prediction of which type of movie should earn higher rating.
Back to the research question, we could know there’re some variables could predict the rating in IMDB such as scores given by audiences and critics from the Rotten Tomato. On the other, critics scores didn’t mean that much as audiences score. In my opinion, I thought critics gave more general score than audience since they need to be fair to all movie makers but audience are easy to show their preferences. In addition, critics scores were collected from only around 50 critics but audience scores were from more than thousands of people. It may be the reason why the ‘audience_score’ and ‘audience_raing’ make more influence into my model.
To sum up, after this project all of us can learn that IMDB and Rotten Tomato are using the same system to evaluate movies, even they collected data through different channels and methods, but they still can give a high related rating for each movie which means their records to movies have a great credibility.