Part 1:Data
This dataset is made by 651 randomly sampled movies. Being an observational study, causation can not be inferred but strong correlation between variables can be indicated
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## title title_type genre runtime
## Length:651 Documentary : 55 Drama :305 Min. : 39.0
## Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:115.8
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 37 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :133 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.00 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.00 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.74 Mean :14.42 Mean :2004 Mean : 6.333
## 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.00 Max. :31.00 Max. :2015 Max. :12.000
## NA's :8 NA's :8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 180 Certified Fresh:135
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546 Fresh :209
## Median :15.00 Median :6.600 Median : 15116 Rotten :307
## Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :31.00 Max. :9.000 Max. :893008
## NA's :8
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:275 Min. :11.00 no :629 no :644
## 1st Qu.: 33.00 Upright:376 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :558 no :579 no :608 no :636 Length:651
## yes: 93 yes: 72 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:651 Length:651 Length:651 Length:651
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
Since the imdb rating moves accordingly with the number of votes, I choose my response variable as imdb rating to search for movie popularity.
Part 2: Research Question
Do genres, critics score,title heading, mpaa rating, audience score,runtime, winning or being nominated to any prize have a relationship with movie popularity (imdb_rating)?
Part 3: Exploratory data analysis
expdata<-movies%>%
select(title,genre,imdb_rating,audience_score,critics_score,runtime,best_pic_nom,best_dir_win,top200_box)
expdata%>%
group_by(genre)%>%
summarize(Mean=mean(imdb_rating),Median=median(imdb_rating),sd=sd(imdb_rating),IQR=IQR(imdb_rating),n())%>%
arrange(desc(Mean))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 6
## genre Mean Median sd IQR `n()`
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Documentary 7.65 7.6 0.410 0.5 52
## 2 Musical & Performing Arts 7.3 7.55 0.674 0.775 12
## 3 Drama 6.67 6.8 0.880 1.2 305
## 4 Other 6.63 6.8 1.14 0.925 16
## 5 Art House & International 6.61 6.5 0.918 1.17 14
## 6 Mystery & Suspense 6.48 6.5 0.826 1.00 59
## 7 Action & Adventure 5.97 6 1.21 1.1 65
## 8 Animation 5.9 6.4 1.49 1.4 9
## 9 Horror 5.76 5.9 0.874 0.7 23
## 10 Science Fiction & Fantasy 5.76 5.9 1.73 2.4 9
## 11 Comedy 5.74 5.7 1.18 1.4 87
We can infer that Documentaries and Musical&Performing Arts have higher imdb rating than other genres. Also Documentaries has lower volatility which mean movie producer can more confidentally expect high median imdb rating.
We can Visualize same result by using box-plot either
To be sure to whether I should add other variables to expdata I explored other variables too
## # A tibble: 1 x 1
## `cor(thtr_rel_year, dvd_rel_year, use = "na.or.complete")`
## <dbl>
## 1 0.660
## Warning: Removed 8 rows containing missing values (geom_point).
No apparent pattern found so it is ok to exclude them.
This box plot once again shows that documentaries, on average has higher median than other title categories. Also feature films have many outliers above the first quartile which means movie producers should consider this probability of being imdb rated under 4. To show this statistically we can write
movies%>%
group_by(title_type)%>%
summarize(Mean=mean(imdb_rating),Median=median(imdb_rating),sd=sd(imdb_rating),IQR=IQR(imdb_rating),n())%>%
arrange(desc(Mean))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 6
## title_type Mean Median sd IQR `n()`
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Documentary 7.67 7.7 0.410 0.4 55
## 2 Feature Film 6.39 6.5 1.06 1.25 591
## 3 TV Movie 6.04 7.3 1.92 3.2 5
By looking at the number of samples with TV movies I should say that i did not interpret the tv movie boxplot because sample size is small (5)
Let’s visualize other variables too
## [1] 0.7650355
## [1] 0.8648652
Actually I was expecting higher imdb ratings for top200 movies but it seems like i should exlude them too
```r
ggplot(data = movies, aes(x = runtime, y = imdb_rating)) + geom_point()+stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
#there are outliers and some of them are influential and we also should consider min,max, avg runtime for movies
```r
expdata%>%
group_by(genre)%>%
summarize(Mean=mean(runtime),Median=median(runtime),sd=sd(runtime),n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 5
## genre Mean Median sd `n()`
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Action & Adventure 104. 103 16.6 65
## 2 Animation 87.2 87 12.3 9
## 3 Art House & International 102. 105 13.5 14
## 4 Comedy 96.9 94 11.7 87
## 5 Documentary NA NA NA 52
## 6 Drama 111. 107 17.7 305
## 7 Horror 92.1 91 8.04 23
## 8 Musical & Performing Arts 114. 116. 15.1 12
## 9 Mystery & Suspense 110. 110 18.2 59
## 10 Other 111. 109 20.7 16
## 11 Science Fiction & Fantasy 101 95 24.1 9
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 39.0 92.0 103.0 105.8 115.8 267.0 1
##
## Call:
## lm(formula = imdb_rating ~ +runtime, data = expdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3099 -0.5791 0.0714 0.7572 2.2500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.907873 0.227163 21.605 < 2e-16 ***
## runtime 0.014965 0.002111 7.088 3.56e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.046 on 648 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.07195, Adjusted R-squared: 0.07052
## F-statistic: 50.24 on 1 and 648 DF, p-value: 3.564e-12
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Anova Table (Type II tests)
##
## Response: imdb_rating
## Sum Sq Df F value Pr(>F)
## genre 156.41 10 18.063 < 2.2e-16 ***
## runtime 37.90 1 43.772 7.82e-11 ***
## Residuals 552.45 638
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
although p values seems nice histogram is not look like normal distrbution this model did not work
library(car)
a<-lm(imdb_rating~+genre+runtime+best_pic_nom+best_dir_win+top200_box,data=movies)
plot(a$residuals)## Anova Table (Type II tests)
##
## Response: imdb_rating
## Sum Sq Df F value Pr(>F)
## genre 161.87 10 19.5674 < 2.2e-16 ***
## runtime 16.95 1 20.4900 7.158e-06 ***
## best_pic_nom 14.23 1 17.2039 3.814e-05 ***
## best_dir_win 4.79 1 5.7872 0.01643 *
## top200_box 5.24 1 6.3341 0.01209 *
## Residuals 525.31 635
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + best_pic_nom +
## best_dir_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8389 -0.4875 0.0627 0.5664 2.2026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.926778 0.240412 20.493 < 2e-16 ***
## genreAnimation 0.154097 0.325551 0.473 0.63613
## genreArt House & International 0.728259 0.268766 2.710 0.00692 **
## genreComedy -0.125789 0.150518 -0.836 0.40364
## genreDocumentary 1.818039 0.171725 10.587 < 2e-16 ***
## genreDrama 0.616690 0.126321 4.882 1.33e-06 ***
## genreHorror -0.046741 0.222500 -0.210 0.83368
## genreMusical & Performing Arts 1.275043 0.287253 4.439 1.07e-05 ***
## genreMystery & Suspense 0.442219 0.164919 2.681 0.00752 **
## genreOther 0.492119 0.255351 1.927 0.05440 .
## genreScience Fiction & Fantasy -0.227645 0.323775 -0.703 0.48225
## runtime 0.009391 0.002075 4.527 7.16e-06 ***
## best_pic_nomyes 0.859731 0.207276 4.148 3.81e-05 ***
## best_dir_winyes 0.359096 0.149271 2.406 0.01643 *
## top200_boxyes 0.612064 0.243195 2.517 0.01209 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9095 on 635 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.2971
## F-statistic: 20.59 on 14 and 635 DF, p-value: < 2.2e-16
Histogram does not seem normal again
library(car)
linreg <- lm(imdb_rating ~+genre+critics_score+runtime,
data = movies)
plot(linreg$residuals)## Anova Table (Type II tests)
##
## Response: imdb_rating
## Sum Sq Df F value Pr(>F)
## genre 21.684 10 4.9199 7.079e-07 ***
## critics_score 271.699 1 616.4520 < 2.2e-16 ***
## runtime 12.928 1 29.3320 8.652e-08 ***
## Residuals 280.755 637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = imdb_rating ~ +genre + critics_score + runtime,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77547 -0.34271 0.03323 0.40747 1.83491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.095442 0.170768 23.982 < 2e-16 ***
## genreAnimation -0.166659 0.237643 -0.701 0.4834
## genreArt House & International 0.394428 0.195927 2.013 0.0445 *
## genreComedy -0.159168 0.109287 -1.456 0.1458
## genreDocumentary 0.581629 0.133603 4.353 1.56e-05 ***
## genreDrama 0.114409 0.093403 1.225 0.2211
## genreHorror -0.183407 0.162012 -1.132 0.2580
## genreMusical & Performing Arts 0.347183 0.211859 1.639 0.1018
## genreMystery & Suspense 0.112592 0.120381 0.935 0.3500
## genreOther 0.001071 0.186950 0.006 0.9954
## genreScience Fiction & Fantasy -0.413202 0.236343 -1.748 0.0809 .
## critics_score 0.025661 0.001034 24.828 < 2e-16 ***
## runtime 0.007824 0.001445 5.416 8.65e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6639 on 637 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6324, Adjusted R-squared: 0.6255
## F-statistic: 91.34 on 12 and 637 DF, p-value: < 2.2e-16
library(car)
b<- lm(imdb_rating ~+genre+critics_score+audience_score+runtime,
data = movies)
plot(b$residuals)## Anova Table (Type II tests)
##
## Response: imdb_rating
## Sum Sq Df F value Pr(>F)
## genre 10.169 10 4.6892 1.747e-06 ***
## critics_score 26.702 1 123.1272 < 2.2e-16 ***
## audience_score 142.830 1 658.6135 < 2.2e-16 ***
## runtime 5.849 1 26.9706 2.784e-07 ***
## Residuals 137.925 636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = imdb_rating ~ +genre + critics_score + audience_score +
## runtime, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34430 -0.20090 0.03524 0.27085 1.17364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1675348 0.1251241 25.315 < 2e-16 ***
## genreAnimation -0.3681453 0.1668808 -2.206 0.0277 *
## genreArt House & International 0.1997289 0.1376430 1.451 0.1473
## genreComedy -0.1410076 0.0766630 -1.839 0.0663 .
## genreDocumentary 0.2611971 0.0945446 2.763 0.0059 **
## genreDrama 0.0573713 0.0655556 0.875 0.3818
## genreHorror 0.0953283 0.1141619 0.835 0.4040
## genreMusical & Performing Arts 0.0156689 0.1491699 0.105 0.9164
## genreMystery & Suspense 0.2613679 0.0846405 3.088 0.0021 **
## genreOther -0.0599035 0.1311583 -0.457 0.6480
## genreScience Fiction & Fantasy -0.1913924 0.1660092 -1.153 0.2494
## critics_score 0.0104037 0.0009376 11.096 < 2e-16 ***
## audience_score 0.0339006 0.0013210 25.663 < 2e-16 ***
## runtime 0.0052878 0.0010182 5.193 2.78e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4657 on 636 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8194, Adjusted R-squared: 0.8157
## F-statistic: 222 on 13 and 636 DF, p-value: < 2.2e-16
Part 4: Modeling
model <- expdata %>%
select(c(genre, critics_score,audience_score,genre, runtime))
ggpairs(model, columns = 1:4)## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_density).
Since audience score and critics score are related and colinear i will only consider audience value
Fit the model
##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime, data = expdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9098 -0.5221 0.0525 0.5921 2.4074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.596099 0.237685 19.337 < 2e-16 ***
## genreAnimation 0.148778 0.332619 0.447 0.6548
## genreArt House & International 0.665463 0.274197 2.427 0.0155 *
## genreComedy -0.135125 0.153177 -0.882 0.3780
## genreDocumentary 1.777018 0.174684 10.173 < 2e-16 ***
## genreDrama 0.609926 0.127896 4.769 2.30e-06 ***
## genreHorror -0.055354 0.226971 -0.244 0.8074
## genreMusical & Performing Arts 1.197458 0.293050 4.086 4.94e-05 ***
## genreMystery & Suspense 0.424538 0.167812 2.530 0.0117 *
## genreOther 0.562645 0.260116 2.163 0.0309 *
## genreScience Fiction & Fantasy -0.178132 0.331007 -0.538 0.5907
## runtime 0.013243 0.002002 6.616 7.82e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9305 on 638 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2767, Adjusted R-squared: 0.2643
## F-statistic: 22.19 on 11 and 638 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = imdb_rating ~ +genre + runtime + audience_score,
## data = expdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8358 -0.1802 0.0599 0.3101 1.0552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.034492 0.135964 22.318 < 2e-16 ***
## genreAnimation -0.346923 0.182165 -1.904 0.0573 .
## genreArt House & International 0.212049 0.150255 1.411 0.1587
## genreComedy -0.130200 0.083683 -1.556 0.1202
## genreDocumentary 0.463115 0.101281 4.573 5.79e-06 ***
## genreDrama 0.161850 0.070822 2.285 0.0226 *
## genreHorror 0.202791 0.124177 1.633 0.1029
## genreMusical & Performing Arts 0.130890 0.162448 0.806 0.4207
## genreMystery & Suspense 0.377776 0.091686 4.120 4.28e-05 ***
## genreOther 0.059508 0.142698 0.417 0.6768
## genreScience Fiction & Fantasy -0.073596 0.180855 -0.407 0.6842
## runtime 0.005906 0.001110 5.321 1.43e-07 ***
## audience_score 0.043195 0.001115 38.738 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5084 on 637 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7804
## F-statistic: 193.2 on 12 and 637 DF, p-value: < 2.2e-16
R adjusted is higher, check for residuals
check for linearity
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Residuals follow slightly left skewed distribution so better model can be found
Part 5 : Prediction
Movie*Old Partner Documentary imdb rating 7.8 audience 86 critics score91 runtime 78
predict the imdb rating for similar featured films
predmovie <- data.frame(genre = "Documentary", audience_score = 86, runtime=78)
predict(d, predmovie)## 1
## 7.673056
Pretty close to real value of 7.8
## fit lwr upr
## 1 7.673056 6.664157 8.681955
With 95% confidence interval the average imdb rating for movies which has audience score=86, genre= documentary and run time=78 is between 6.7 and 8.7 we can find narrower interval by taking %80 confidence interval
## fit lwr upr
## 1 7.673056 7.182462 8.163651
Part 6:Conclusion I conclude that important variables for the interpratation of the popularity are chosen by considering low p-values as runtime, critics score, audience score, genre etc.
I ensure while evaluating my linear regression models by checking linearity, heteroscedasticity and normal dist of residuals.
I improved the credibility of the model by adding variables which increases adjusted R squared value, not only increasing R squared value