library(ggplot2)
library(dplyr)
library(statsr)setwd("C:/Users/rahul/Desktop/Duke University/Linear regression and Modeling/Week4")
load("movies.Rdata")
dim(movies)## [1] 651 32
The dataset contains information about 651 movies released before 2016 and information about these are stored across 32 variables.The data is randomly selected not randomly assigned. Consequently any conclusion can only be generlised to population; causality requires random assignment. it is only possible to do an observational study & no causal analysis can be done.
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
As we can see dataset cantains 32 variables
also veriable like audience score,actor1,2,3,4,5, URL based, and studio won’t not taken in consderation. for further analysis.
movies_new <- na.omit(movies) %>%
select (title_type, genre, runtime,
mpaa_rating, thtr_rel_year,imdb_rating, imdb_num_votes, critics_score,
critics_rating, audience_rating, audience_score)
dim(movies_new)## [1] 619 11
head(movies_new)## # A tibble: 6 x 11
## title_type genre runtime mpaa_rating thtr_rel_year imdb_rating
## <fct> <fct> <dbl> <fct> <dbl> <dbl>
## 1 Feature Film Drama 80.0 R 2013 5.50
## 2 Feature Film Drama 101 PG-13 2001 7.30
## 3 Feature Film Comedy 84.0 R 1996 7.60
## 4 Feature Film Drama 139 PG 1993 7.20
## 5 Feature Film Horror 90.0 R 2004 5.10
## 6 Feature Film Drama 142 PG-13 1986 7.20
## # ... with 5 more variables: imdb_num_votes <int>, critics_score <dbl>,
## # critics_rating <fct>, audience_rating <fct>, audience_score <dbl>
Investigate what parameters are major influences on the audience score (audience_score).
ggplot(data= movies_new,aes(x=genre))+geom_bar( fill='skyblue',color='black')+
theme(axis.text.x=element_text(angle = 60, hjust = 1))+
ggtitle("movie count by genre")ggplot(data=movies_new,aes(x=critics_score,y=audience_score))+geom_point()+
geom_smooth(method=lm,se=F)+ggtitle("relationship between critic score Vs audience score")ggplot(data=movies_new, aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method=lm,se=F)+ggtitle("relationship between imdb Vs audience score")ggplot(data=movies_new,aes(x=audience_rating, y=audience_score))+geom_point()+theme(axis.text.x=element_text(angle = 60, hjust = 1))+ geom_smooth(method=lm)+ggtitle("relationship between genre Vs audience score") There seems to be a strong positive linear relationship between critic score Vs audience score
The average runtime of a movie is around 105 and its distribution is slightly right skewed with some outliers movies with runtime around 250 minutes. The critics scores are nearly uniformly distributed, but slightly left skewed.The score given by the audience has a similar behavior has the critics scores, with a nearly uniform left skewed distribution. * * *
In this project we will use linear regression and start by fitting a model with 8 variables(described in the previous section). Backward elimination will help us to define if better results can be obtained by using a smaller set of attributes. The advantage of backward elimination is that it allows to start with all the variables, deleting one variable at a time until there are no improvements in the model.
m_full<-lm(audience_score~critics_score+genre+critics_rating+imdb_rating+audience_rating+mpaa_rating+runtime
+imdb_num_votes+title_type+thtr_rel_year,data=movies_new)
m_full<-summary(m_full)
m_full##
## Call:
## lm(formula = audience_score ~ critics_score + genre + critics_rating +
## imdb_rating + audience_rating + mpaa_rating + runtime + imdb_num_votes +
## title_type + thtr_rel_year, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.7520 -4.4252 0.5607 4.1959 24.3954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.334e+01 6.145e+01 1.356 0.1755
## critics_score 1.046e-02 2.571e-02 0.407 0.6842
## genreAnimation 4.025e+00 2.855e+00 1.410 0.1591
## genreArt House & International -1.723e+00 2.275e+00 -0.758 0.4490
## genreComedy 1.641e+00 1.168e+00 1.405 0.1607
## genreDocumentary 4.479e+00 2.929e+00 1.529 0.1267
## genreDrama -1.170e-01 1.032e+00 -0.113 0.9097
## genreHorror -1.476e+00 1.765e+00 -0.836 0.4032
## genreMusical & Performing Arts 4.636e+00 2.404e+00 1.928 0.0543 .
## genreMystery & Suspense -2.761e+00 1.319e+00 -2.093 0.0368 *
## genreOther 3.546e-01 2.021e+00 0.175 0.8608
## genreScience Fiction & Fantasy -1.491e-01 2.601e+00 -0.057 0.9543
## critics_ratingFresh -2.292e-01 9.075e-01 -0.253 0.8007
## critics_ratingRotten -8.477e-01 1.439e+00 -0.589 0.5562
## imdb_rating 9.321e+00 5.000e-01 18.642 <2e-16 ***
## audience_ratingUpright 2.016e+01 8.081e-01 24.943 <2e-16 ***
## mpaa_ratingNC-17 -8.167e+00 7.193e+00 -1.135 0.2567
## mpaa_ratingPG -1.213e+00 2.037e+00 -0.595 0.5518
## mpaa_ratingPG-13 -1.590e+00 2.121e+00 -0.750 0.4536
## mpaa_ratingR -2.010e+00 2.039e+00 -0.986 0.3247
## mpaa_ratingUnrated -6.683e-01 2.442e+00 -0.274 0.7845
## runtime -4.014e-02 1.711e-02 -2.346 0.0193 *
## imdb_num_votes 5.209e-06 3.175e-06 1.641 0.1014
## title_typeFeature Film 3.378e+00 2.783e+00 1.214 0.2253
## title_typeTV Movie 2.290e+00 4.433e+00 0.517 0.6057
## thtr_rel_year -4.558e-02 3.044e-02 -1.497 0.1349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.876 on 593 degrees of freedom
## Multiple R-squared: 0.8886, Adjusted R-squared: 0.8839
## F-statistic: 189.2 on 25 and 593 DF, p-value: < 2.2e-16
The linear model and anova output summary reveals an adjusted \(R^2\) value of 88.39%. Ths means that only 11.7 of the variability is accounted for in the residuals and the remainder is accounted for in the predictor varibles. This extremely high value may be suspect to error, that would invalidate the reliability of the study.
The multi-collinearity must be evaluated to remove any interdependecies amoung the predictors. This will be left for the model building section
#remove veriable thtr_rel_year
m1<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
imdb_rating+audience_rating+imdb_num_votes+title_type,data=movies_new)
a<-summary(m1)$adj.r.squared
#menove veriable title_type
m2<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
imdb_rating+audience_rating+imdb_num_votes,data=movies_new)
b<-summary(m2)$adj.r.squared
#remove veriable imdb_num_votes
m3<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
imdb_rating+audience_rating,data=movies_new)
c<-summary(m3)$adj.r.squared
#remove veraible runtime
m4<-lm(audience_score~critics_score+mpaa_rating+genre+
imdb_rating+audience_rating+critics_rating,data=movies_new)
d<-summary(m4)$adj.r.squared
#Removing mpaa_rating
m5<-lm(audience_score~critics_score+genre+
imdb_rating+audience_rating+critics_rating,data=movies_new)
e<-summary(m5)$adj.r.squaredAnalysising Adjusted \(R^2\)
cbind(c("m_full","-thtr_rel_year","-title_type","-imdb_num_votes","-runtime","mpaa_rating"),
round(c(m_full$adj.r.squared,a,b,c,d,e),4))## [,1] [,2]
## [1,] "m_full" "0.8839"
## [2,] "-thtr_rel_year" "0.8836"
## [3,] "-title_type" "0.8837"
## [4,] "-imdb_num_votes" "0.8835"
## [5,] "-runtime" "0.8831"
## [6,] "mpaa_rating" "0.8834"
from the output of above table it seems veriable thtr_rel_year,“title_type”,“imdb_num_votes”,“mpaa_rating,imdb_rating and”runtime" are not significant predictors as adjusted \(R^2\)= 0.8889 does not increase infact it remains same indicating that there is weak relationship between them.
Now we have left with 4 veriables critics_score,genre,imdb_rating and audience_rating lets check them step by step
#removing audiance_rating
m6<-lm(audience_score~critics_score+genre+critics_rating+
imdb_rating,data=movies_new)
f<-summary(m6)$adj.r.squared
f## [1] 0.7610629
It seems audience rating is having strong posive correlation with audience score so keep it in model and continue further
#removing imdb_rating
m7<-lm(audience_score~critics_score+genre+audience_rating+critics_rating,data=movies_new)
g<-summary(m7)$adj.r.squared
#removing critic_score
m8<-lm(audience_score~critics_rating+audience_rating+imdb_rating+genre,data=movies_new)
h<-summary(m8)$adj.r.squared
#removing genre
m9<-lm(audience_score~critics_rating+audience_rating+imdb_rating,data=movies_new)
i<-summary(m9)$adj.r.squared
#removing critics rating
m10<-lm(audience_score~audience_rating+imdb_rating+genre,data=movies_new)
j<-summary(m10)$adj.r.squaredAnalysising Adjusted \(R^2\)
cbind(c("m_full","audiance_rating","imdb_rating","critic_score","genre","critics_rating"),
round(c(m_full$adj.r.squared,f,g,h,i,j),4))## [,1] [,2]
## [1,] "m_full" "0.8839"
## [2,] "audiance_rating" "0.7611"
## [3,] "imdb_rating" "0.8078"
## [4,] "critic_score" "0.8836"
## [5,] "genre" "0.8799"
## [6,] "critics_rating" "0.8834"
The final model depicts a Parsimonius Model: the simplest model with the highest predictive power. Only four predictors are used : imdb_rating, audience_rating,critics_rating and genre.from the output of above table it seems veriable critic_score is not significant predictors as adjusted \(R^2\)= 0.885 does not increase infact it remains same indicating that there is weak relationship between them on the other hand audiance_rating,imdb_rating and “critics_rating is having strong relationship between them.
so our final model will be
m8<-lm(audience_score~audience_rating+imdb_rating+genre,data=movies_new)
summary(m8)##
## Call:
## lm(formula = audience_score ~ audience_rating + imdb_rating +
## genre, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.620 -4.502 0.627 4.364 24.975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.2663 2.2718 -5.399 9.61e-08 ***
## audience_ratingUpright 20.5042 0.7937 25.834 < 2e-16 ***
## imdb_rating 9.7107 0.3802 25.543 < 2e-16 ***
## genreAnimation 5.5041 2.6000 2.117 0.0347 *
## genreArt House & International -2.2434 2.1928 -1.023 0.3067
## genreComedy 1.6229 1.1509 1.410 0.1590
## genreDocumentary 1.3614 1.4789 0.921 0.3577
## genreDrama -0.6963 0.9828 -0.708 0.4790
## genreHorror -1.3892 1.7145 -0.810 0.4181
## genreMusical & Performing Arts 2.7493 2.2083 1.245 0.2136
## genreMystery & Suspense -3.2959 1.2818 -2.571 0.0104 *
## genreOther 0.7362 1.9985 0.368 0.7127
## genreScience Fiction & Fantasy 0.5883 2.5888 0.227 0.8203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.89 on 606 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8834
## F-statistic: 391.1 on 12 and 606 DF, p-value: < 2.2e-16
As we started with full model with adjusted \(R^2\)= 88.3 reached to adjusted \(R^2\) = 88.5 with backword elimination with increase in adjusted \(R^2\).
Variables that were excluded from the table are listed below-
runtime - weak linear relationship and not significant thtr_rel_year no linear relationship and not significant rel_year - no linear relationship and not significant rel_month - no linear relationship and not significant mpaa_rating - not significant imdb_num_votes - not significant critics_score - collinearity and not significant
pairs(data=movies_new,~audience_score+critics_rating+audience_rating+imdb_rating+genre) * * * ## Part 5 Model Diagnostics
Checking for the linear relationship between numerical predictor (s) and residual (s) using the code written below.
plot(m8$residuals~movies_new$imdb_rating,pch=20,main="Linearity Condition")
abline(h=0) by looking at plot I can say the plot depicts a complete random scatter around zero; no descernable pattern.
par(mfrow=c(1,2),mar=c(3,2,3,2))
hist(m8$residuals,main="histogram of residuals",col="skyblue")
qqnorm(m8$residuals,pch=20,main="Normal Probability Plot of Residuals")
qqline(m8$residuals) Condition met - the majority of the points lie on the line, but because of skeness, a few points do not. Also note that there are no apparent outliners the histogram confirms the skewness (right skewness) but the distribution still appear to be nearly normal.
par(mfrow=c(1,2))
plot(m8$residuals~m8$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(m8$residuals)~m8$fitted,main='Absolute Residuals vs. Predicted')The plot of predicted values shows that residuals are equally variable for low and heigh values and there is no visible fan pattern.
plot(m8$residuals,pch=20 ,main="Independany Conditions")Above plot depicts residuals being randomly scatterd around zero.
Build test data cases for the movie “The jungle book (2016)” using the data gathered from IMDB and rotten tomatoes website and storing the data in the variable named “new_data” using the following code.
imdb_rating<-7.5
genre <- "Drama"
critics_rating <-"Fresh"
audience_score= 86
audience_rating<- "Upright"
new_data<- data.frame(critics_rating,audience_rating,imdb_rating,genre)predict(m8, new_data)## 1
## 80.3719
predict(m8,new_data,interval="confidence")## fit lwr upr
## 1 80.3719 79.44515 81.29865
Note: The actual audience_score for movie The jungle book (2016) is 86 and model predicted audience sccore is 80.37
With this information we can conclude that we are 95% confident that the actual audience_score for The jungle book (2016) is between 79.44 and 81.29 The model returns an interval that does not includes our predicted value of 86.
Predicting model is not easy task with few veriable to deal with I think model will be more accurate if we add more veriables model as mention in imdb website http://www.imdb.com/title/tt3040964 genre of movie is Adventure, Drama, Family while in my final model I dont have such genre. so I left with no option to mention “Drama” as gente. beacuse of this there is variation in predicted model.
The data is biased toward drama movies, consequently the model was trained primarily by drame movie dataset thus it would have been better to predict the audience score about drama movies.