Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

load("movies.Rdata")

Part 1:Data

This dataset is made by 651 randomly sampled movies. Being an observational study, causation can not be inferred but strong correlation between variables can be indicated

str(movies)

## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

summary(movies)

##     title                  title_type                 genre        runtime     
##  Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
##  Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:115.8  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 37   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :133   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month   thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.00   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.00   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.74   Mean   :14.42   Mean   :2004   Mean   : 6.333  
##  3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.00   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                  NA's   :8      NA's   :8       
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   180   Certified Fresh:135   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546   Fresh          :209   
##  Median :15.00   Median :6.600   Median : 15116   Rotten         :307   
##  Mean   :15.01   Mean   :6.493   Mean   : 57533                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##  NA's   :8                                                              
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:275     Min.   :11.00   no :629      no :644     
##  1st Qu.: 33.00   Upright:376     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.69                   Mean   :62.36                            
##  3rd Qu.: 83.00                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :558        no :579          no :608      no :636    Length:651        
##  yes: 93        yes: 72          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:651         Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##

ggplot(movies, aes(x=imdb_num_votes, y=imdb_rating))+geom_point()

Since the imdb rating moves accordingly with the number of votes, I choose my response variable as imdb rating to search for movie popularity.

Part 2: Research Question

Do genres, critics score,title heading, mpaa rating, audience score,runtime, winning or being nominated to any prize have a relationship with movie popularity (imdb_rating)?

Part 3: Exploratory data analysis

expdata<-movies%>%
  select(title,genre,imdb_rating,audience_score,critics_score,runtime,best_pic_nom,best_dir_win,top200_box)

expdata%>%
  group_by(genre)%>%
  summarize(Mean=mean(imdb_rating),Median=median(imdb_rating),sd=sd(imdb_rating),IQR=IQR(imdb_rating),n())%>%
  arrange(desc(Mean))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 11 x 6
##    genre                      Mean Median    sd   IQR `n()`
##    <fct>                     <dbl>  <dbl> <dbl> <dbl> <int>
##  1 Documentary                7.65   7.6  0.410 0.5      52
##  2 Musical & Performing Arts  7.3    7.55 0.674 0.775    12
##  3 Drama                      6.67   6.8  0.880 1.2     305
##  4 Other                      6.63   6.8  1.14  0.925    16
##  5 Art House & International  6.61   6.5  0.918 1.17     14
##  6 Mystery & Suspense         6.48   6.5  0.826 1.00     59
##  7 Action & Adventure         5.97   6    1.21  1.1      65
##  8 Animation                  5.9    6.4  1.49  1.4       9
##  9 Horror                     5.76   5.9  0.874 0.7      23
## 10 Science Fiction & Fantasy  5.76   5.9  1.73  2.4       9
## 11 Comedy                     5.74   5.7  1.18  1.4      87

We can infer that Documentaries and Musical&Performing Arts have higher imdb rating than other genres. Also Documentaries has lower volatility which mean movie producer can more confidentally expect high median imdb rating.

We can Visualize same result by using box-plot either

ggplot(expdata, aes(x=genre,y=imdb_rating))+geom_boxplot()

To be sure to whether I should add other variables to expdata I explored other variables too

movies %>%
  summarise(cor(thtr_rel_year, dvd_rel_year,use = "na.or.complete"))

## # A tibble: 1 x 1
##   `cor(thtr_rel_year, dvd_rel_year, use = "na.or.complete")`
##                                                        <dbl>
## 1                                                      0.660

ggplot(movies, aes(x=thtr_rel_year,y=imdb_rating))+geom_point()

ggplot(movies, aes(x=dvd_rel_year,y=imdb_rating))+geom_point()

## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(movies, aes(x=director,y=imdb_rating))+geom_point()

No apparent pattern found so it is ok to exclude them.

ggplot(movies, aes(x=title_type,y=imdb_rating))+geom_boxplot()

This box plot once again shows that documentaries, on average has higher median than other title categories. Also feature films have many outliers above the first quartile which means movie producers should consider this probability of being imdb rated under 4. To show this statistically we can write

movies%>%
  group_by(title_type)%>%
  summarize(Mean=mean(imdb_rating),Median=median(imdb_rating),sd=sd(imdb_rating),IQR=IQR(imdb_rating),n())%>%
  arrange(desc(Mean))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 6
##   title_type    Mean Median    sd   IQR `n()`
##   <fct>        <dbl>  <dbl> <dbl> <dbl> <int>
## 1 Documentary   7.67    7.7 0.410  0.4     55
## 2 Feature Film  6.39    6.5 1.06   1.25   591
## 3 TV Movie      6.04    7.3 1.92   3.2      5

By looking at the number of samples with TV movies I should say that i did not interpret the tv movie boxplot because sample size is small (5)

Let’s visualize other variables too

ggplot(movies,aes(x=critics_rating,y=critics_score))+geom_boxplot()

cor(expdata$imdb_rating,expdata$critics_score)#imdb score, critics rating has high correlation

## [1] 0.7650355

cor(expdata$imdb_rating,expdata$audience_score)

## [1] 0.8648652

#imdb score and audience rating has high correlation

ggplot(data=movies,aes(x=factor(imdb_rating),fill=top200_box))+geom_bar()

Actually I was expecting higher imdb ratings for top200 movies but it seems like i should exlude them too



```r
ggplot(data = movies, aes(x = runtime, y = imdb_rating)) + geom_point()+stat_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

#there are outliers and some of them are influential and we also should consider min,max, avg runtime for movies


```r
expdata%>%
  group_by(genre)%>%
  summarize(Mean=mean(runtime),Median=median(runtime),sd=sd(runtime),n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 11 x 5
##    genre                      Mean Median    sd `n()`
##    <fct>                     <dbl>  <dbl> <dbl> <int>
##  1 Action & Adventure        104.    103  16.6     65
##  2 Animation                  87.2    87  12.3      9
##  3 Art House & International 102.    105  13.5     14
##  4 Comedy                     96.9    94  11.7     87
##  5 Documentary                NA      NA  NA       52
##  6 Drama                     111.    107  17.7    305
##  7 Horror                     92.1    91   8.04    23
##  8 Musical & Performing Arts 114.    116. 15.1     12
##  9 Mystery & Suspense        110.    110  18.2     59
## 10 Other                     111.    109  20.7     16
## 11 Science Fiction & Fantasy 101      95  24.1      9

summary(expdata$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    39.0    92.0   103.0   105.8   115.8   267.0       1

runtimelm<- lm(imdb_rating~+runtime,data=expdata)
summary(runtimelm)

## 
## Call:
## lm(formula = imdb_rating ~ +runtime, data = expdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3099 -0.5791  0.0714  0.7572  2.2500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.907873   0.227163  21.605  < 2e-16 ***
## runtime     0.014965   0.002111   7.088 3.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.046 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07195,    Adjusted R-squared:  0.07052 
## F-statistic: 50.24 on 1 and 648 DF,  p-value: 3.564e-12

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

linreg <- lm(imdb_rating ~+genre+runtime,
             data = movies)
plot(linreg$residuals)

hist(linreg$fitted.values)

Anova(linreg)

## Anova Table (Type II tests)
## 
## Response: imdb_rating
##           Sum Sq  Df F value    Pr(>F)    
## genre     156.41  10  18.063 < 2.2e-16 ***
## runtime    37.90   1  43.772  7.82e-11 ***
## Residuals 552.45 638                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

although p values seems nice histogram is not look like normal distrbution this model did not work

ggplot(movies, aes(x=audience_score,y=imdb_rating))+geom_point()

library(car)
a<-lm(imdb_rating~+genre+runtime+best_pic_nom+best_dir_win+top200_box,data=movies)
plot(a$residuals)

hist(a$fitted.values)

Anova(a)

## Anova Table (Type II tests)
## 
## Response: imdb_rating
##              Sum Sq  Df F value    Pr(>F)    
## genre        161.87  10 19.5674 < 2.2e-16 ***
## runtime       16.95   1 20.4900 7.158e-06 ***
## best_pic_nom  14.23   1 17.2039 3.814e-05 ***
## best_dir_win   4.79   1  5.7872   0.01643 *  
## top200_box     5.24   1  6.3341   0.01209 *  
## Residuals    525.31 635                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(a)#Multiple R-squared:  0.3123, Adjusted R-squared:  0.2971

## 
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + best_pic_nom + 
##     best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8389 -0.4875  0.0627  0.5664  2.2026 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.926778   0.240412  20.493  < 2e-16 ***
## genreAnimation                  0.154097   0.325551   0.473  0.63613    
## genreArt House & International  0.728259   0.268766   2.710  0.00692 ** 
## genreComedy                    -0.125789   0.150518  -0.836  0.40364    
## genreDocumentary                1.818039   0.171725  10.587  < 2e-16 ***
## genreDrama                      0.616690   0.126321   4.882 1.33e-06 ***
## genreHorror                    -0.046741   0.222500  -0.210  0.83368    
## genreMusical & Performing Arts  1.275043   0.287253   4.439 1.07e-05 ***
## genreMystery & Suspense         0.442219   0.164919   2.681  0.00752 ** 
## genreOther                      0.492119   0.255351   1.927  0.05440 .  
## genreScience Fiction & Fantasy -0.227645   0.323775  -0.703  0.48225    
## runtime                         0.009391   0.002075   4.527 7.16e-06 ***
## best_pic_nomyes                 0.859731   0.207276   4.148 3.81e-05 ***
## best_dir_winyes                 0.359096   0.149271   2.406  0.01643 *  
## top200_boxyes                   0.612064   0.243195   2.517  0.01209 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9095 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.2971 
## F-statistic: 20.59 on 14 and 635 DF,  p-value: < 2.2e-16

Histogram does not seem normal again

library(car)
linreg <- lm(imdb_rating ~+genre+critics_score+runtime,
             data = movies)
plot(linreg$residuals)

hist(linreg$fitted.values)

Anova(linreg)

## Anova Table (Type II tests)
## 
## Response: imdb_rating
##                Sum Sq  Df  F value    Pr(>F)    
## genre          21.684  10   4.9199 7.079e-07 ***
## critics_score 271.699   1 616.4520 < 2.2e-16 ***
## runtime        12.928   1  29.3320 8.652e-08 ***
## Residuals     280.755 637                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(linreg)

## 
## Call:
## lm(formula = imdb_rating ~ +genre + critics_score + runtime, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77547 -0.34271  0.03323  0.40747  1.83491 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.095442   0.170768  23.982  < 2e-16 ***
## genreAnimation                 -0.166659   0.237643  -0.701   0.4834    
## genreArt House & International  0.394428   0.195927   2.013   0.0445 *  
## genreComedy                    -0.159168   0.109287  -1.456   0.1458    
## genreDocumentary                0.581629   0.133603   4.353 1.56e-05 ***
## genreDrama                      0.114409   0.093403   1.225   0.2211    
## genreHorror                    -0.183407   0.162012  -1.132   0.2580    
## genreMusical & Performing Arts  0.347183   0.211859   1.639   0.1018    
## genreMystery & Suspense         0.112592   0.120381   0.935   0.3500    
## genreOther                      0.001071   0.186950   0.006   0.9954    
## genreScience Fiction & Fantasy -0.413202   0.236343  -1.748   0.0809 .  
## critics_score                   0.025661   0.001034  24.828  < 2e-16 ***
## runtime                         0.007824   0.001445   5.416 8.65e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6639 on 637 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6324, Adjusted R-squared:  0.6255 
## F-statistic: 91.34 on 12 and 637 DF,  p-value: < 2.2e-16

library(car)
b<- lm(imdb_rating ~+genre+critics_score+audience_score+runtime,
             data = movies)
plot(b$residuals)

hist(b$fitted.values)

Anova(b)

## Anova Table (Type II tests)
## 
## Response: imdb_rating
##                 Sum Sq  Df  F value    Pr(>F)    
## genre           10.169  10   4.6892 1.747e-06 ***
## critics_score   26.702   1 123.1272 < 2.2e-16 ***
## audience_score 142.830   1 658.6135 < 2.2e-16 ***
## runtime          5.849   1  26.9706 2.784e-07 ***
## Residuals      137.925 636                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(b)

## 
## Call:
## lm(formula = imdb_rating ~ +genre + critics_score + audience_score + 
##     runtime, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.34430 -0.20090  0.03524  0.27085  1.17364 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.1675348  0.1251241  25.315  < 2e-16 ***
## genreAnimation                 -0.3681453  0.1668808  -2.206   0.0277 *  
## genreArt House & International  0.1997289  0.1376430   1.451   0.1473    
## genreComedy                    -0.1410076  0.0766630  -1.839   0.0663 .  
## genreDocumentary                0.2611971  0.0945446   2.763   0.0059 ** 
## genreDrama                      0.0573713  0.0655556   0.875   0.3818    
## genreHorror                     0.0953283  0.1141619   0.835   0.4040    
## genreMusical & Performing Arts  0.0156689  0.1491699   0.105   0.9164    
## genreMystery & Suspense         0.2613679  0.0846405   3.088   0.0021 ** 
## genreOther                     -0.0599035  0.1311583  -0.457   0.6480    
## genreScience Fiction & Fantasy -0.1913924  0.1660092  -1.153   0.2494    
## critics_score                   0.0104037  0.0009376  11.096  < 2e-16 ***
## audience_score                  0.0339006  0.0013210  25.663  < 2e-16 ***
## runtime                         0.0052878  0.0010182   5.193 2.78e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4657 on 636 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8194, Adjusted R-squared:  0.8157 
## F-statistic:   222 on 13 and 636 DF,  p-value: < 2.2e-16

Part 4: Modeling

model <- expdata %>% 
  select(c(genre, critics_score,audience_score,genre, runtime))

ggpairs(model, columns = 1:4)

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_density).

Since audience score and critics score are related and colinear i will only consider audience value

Fit the model

c<-lm(imdb_rating~+genre+runtime,data=expdata)
summary(c)

## 
## Call:
## lm(formula = imdb_rating ~ +genre + runtime, data = expdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9098 -0.5221  0.0525  0.5921  2.4074 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.596099   0.237685  19.337  < 2e-16 ***
## genreAnimation                  0.148778   0.332619   0.447   0.6548    
## genreArt House & International  0.665463   0.274197   2.427   0.0155 *  
## genreComedy                    -0.135125   0.153177  -0.882   0.3780    
## genreDocumentary                1.777018   0.174684  10.173  < 2e-16 ***
## genreDrama                      0.609926   0.127896   4.769 2.30e-06 ***
## genreHorror                    -0.055354   0.226971  -0.244   0.8074    
## genreMusical & Performing Arts  1.197458   0.293050   4.086 4.94e-05 ***
## genreMystery & Suspense         0.424538   0.167812   2.530   0.0117 *  
## genreOther                      0.562645   0.260116   2.163   0.0309 *  
## genreScience Fiction & Fantasy -0.178132   0.331007  -0.538   0.5907    
## runtime                         0.013243   0.002002   6.616 7.82e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9305 on 638 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2767, Adjusted R-squared:  0.2643 
## F-statistic: 22.19 on 11 and 638 DF,  p-value: < 2.2e-16

d<-lm(imdb_rating~+genre+runtime+audience_score,data=expdata)
summary(d)

## 
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + audience_score, 
##     data = expdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8358 -0.1802  0.0599  0.3101  1.0552 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.034492   0.135964  22.318  < 2e-16 ***
## genreAnimation                 -0.346923   0.182165  -1.904   0.0573 .  
## genreArt House & International  0.212049   0.150255   1.411   0.1587    
## genreComedy                    -0.130200   0.083683  -1.556   0.1202    
## genreDocumentary                0.463115   0.101281   4.573 5.79e-06 ***
## genreDrama                      0.161850   0.070822   2.285   0.0226 *  
## genreHorror                     0.202791   0.124177   1.633   0.1029    
## genreMusical & Performing Arts  0.130890   0.162448   0.806   0.4207    
## genreMystery & Suspense         0.377776   0.091686   4.120 4.28e-05 ***
## genreOther                      0.059508   0.142698   0.417   0.6768    
## genreScience Fiction & Fantasy -0.073596   0.180855  -0.407   0.6842    
## runtime                         0.005906   0.001110   5.321 1.43e-07 ***
## audience_score                  0.043195   0.001115  38.738  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5084 on 637 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7804 
## F-statistic: 193.2 on 12 and 637 DF,  p-value: < 2.2e-16

R adjusted is higher, check for residuals

ggplot(data=d, aes(x=.fitted,y=.resid))+
  geom_point()+
  geom_hline(yintercept=0,linetype='dashed')

check for linearity

ggplot(aes(x=.resid),data=d)+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Residuals follow slightly left skewed distribution so better model can be found

Part 5 : Prediction

Movie*Old Partner Documentary imdb rating 7.8 audience 86 critics score91 runtime 78

predict the imdb rating for similar featured films

predmovie <- data.frame(genre = "Documentary", audience_score = 86, runtime=78)
predict(d, predmovie)

##        1 
## 7.673056

Pretty close to real value of 7.8

predict(d, predmovie, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 7.673056 6.664157 8.681955

With 95% confidence interval the average imdb rating for movies which has audience score=86, genre= documentary and run time=78 is between 6.7 and 8.7 we can find narrower interval by taking %80 confidence interval

predict(d, predmovie, interval = "prediction", level = 0.66)

##        fit      lwr      upr
## 1 7.673056 7.182462 8.163651

Part 6:Conclusion I conclude that important variables for the interpratation of the popularity are chosen by considering low p-values as runtime, critics score, audience score, genre etc.

I ensure while evaluating my linear regression models by checking linearity, heteroscedasticity and normal dist of residuals.

I improved the credibility of the model by adding variables which increases adjusted R squared value, not only increasing R squared value