CourserA : Linear Regression and Modelling Project

Details

Congratulations on getting a job as a data scientist at Paramount Pictures! Please see the Data Analysis Project for your assignment. Below you will find the files that you will need.

Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants you team to figure it all out.

As part of this project you will complete exploratory data analysis (EDA), modeling, and prediction.

More details about project can be found at coursera website https://www.coursera.org/learn/linear-regression-model/supplement/LDoCh/project-files-and-rubric.

load("D:/working on r/datasciencecoursera/movies.Rdata")
library(statsr)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Part 1: Data

The dataset information is about movies contains 651 randomly sampled movies produced and released before 2016. This data set comes from Rotten Tomatoes and IMDB.There are 32 available variables.

The data is randomly sampled and it’s big enough (651 observations) so the generalization princple is achieved on movies before 2016 and there is no bais in the data.

causality

The data is an observational and hence we can not conclude causality from it so we will be able to show association between variables and experimental data is needed for further causal analysis.

str(movies)
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

As we can see from above table that there are 32 variables, among them 9 are character 12 are factorand 10 are numerical variables Moreover like audience score, actor (1,2,…,5), URL and studio are not suitable for inclusion in analysis.

mymovies<- na.omit(movies) %>% 
  select (title_type, genre, runtime, 
         mpaa_rating, thtr_rel_year,imdb_rating, imdb_num_votes, critics_score,
         critics_rating, audience_rating, audience_score)
head(mymovies)
## # A tibble: 6 x 11
##   title_type genre runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
##   <fct>      <fct>   <dbl> <fct>               <dbl>       <dbl>          <int>
## 1 Feature F~ Drama      80 R                    2013         5.5            899
## 2 Feature F~ Drama     101 PG-13                2001         7.3          12285
## 3 Feature F~ Come~      84 R                    1996         7.6          22381
## 4 Feature F~ Drama     139 PG                   1993         7.2          35096
## 5 Feature F~ Horr~      90 R                    2004         5.1           2386
## 6 Feature F~ Drama     142 PG-13                1986         7.2           5016
## # ... with 4 more variables: critics_score <dbl>, critics_rating <fct>,
## #   audience_rating <fct>, audience_score <dbl>

Part 2: Research Question

What are the variables that have a good association with movies rating ? This might sound interesting to deduce whether model predict the rating accurately, if it do so then it is asserted that people like most of movies.Moroever if there weak association then we might need to explore more data.

Part 3: Exploratory Data Analysis

ggplot(data= mymovies,aes(x=genre))+geom_bar( fill='skyblue',color='black')+
        theme(axis.text.x=element_text(angle = 60, hjust = 1))+
         ggtitle("movie count by genre")

#Relation between critic score and audience score
ggplot(data=mymovies,aes(x=critics_score,y=audience_score))+geom_point()+
        geom_smooth(method=lm,se=F)+ggtitle("relationship between critic score Vs audience score")
## `geom_smooth()` using formula 'y ~ x'

#Relation between imdb vs audience score 
ggplot(data=mymovies, aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method=lm,se=F)+ggtitle("relationship between IMDB Vs Audience score")
## `geom_smooth()` using formula 'y ~ x'

# Genre vs audience score
ggplot(data=mymovies,aes(x=audience_rating, y=audience_score))+geom_point()+theme(axis.text.x=element_text(angle = 60, hjust = 1))+ geom_smooth(method=lm)+ggtitle("relationship between genre Vs audience score")
## `geom_smooth()` using formula 'y ~ x'

The average runtime of a movie is around 105 and its distribution is slightly right skewed with some outliers movies with runtime around 250 minutes. The critics scores are nearly uniformly distributed, but slightly left skewed.The score given by the audience has a similar behavior has the critics scores, with a nearly uniform left skewed distribution.

Furthermore we assert that ## Other method Firstly analyse whether three types of rating 1) imdb rating 2) rotten tomatos audience 3) rotten tomatos are corolated or not.

#here again we employ movie data as
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
moviesa <- na.omit(movies)
ggpairs(moviesa , columns = c(13,16,18))

There is a strong relation found between variables.Hence average rating is prominent column.Furthermore ID column and group are added to this column to extract average rating for each row.Rotten tomatos rating will be scaled by dividing on 10 to be the same as imdb rating.

moviesa$ID <- seq.int(nrow(moviesa))

moviesa <- moviesa %>% group_by(ID) %>% mutate(average_rating = sum(imdb_rating + (critics_score/10) + (audience_score/10) )/3 ) 
moviesa$average_rating <- round(moviesa$average_rating,2)
head( moviesa[ , c(13,16,18,length(movies))] )
## # A tibble: 6 x 4
##   imdb_rating critics_score audience_score rt_url                               
##         <dbl>         <dbl>          <dbl> <chr>                                
## 1         5.5            45             73 //www.rottentomatoes.com/m/filly_bro~
## 2         7.3            96             81 //www.rottentomatoes.com/m/dish/     
## 3         7.6            91             91 //www.rottentomatoes.com/m/waiting_f~
## 4         7.2            80             76 //www.rottentomatoes.com/m/age_of_in~
## 5         5.1            33             27 //www.rottentomatoes.com/m/10004684-~
## 6         7.2            57             76 //www.rottentomatoes.com/m/lady_jane/
moviesa <- moviesa %>% select( average_rating , genre , runtime , mpaa_rating , thtr_rel_month , best_pic_nom , best_actor_win , best_actress_win , best_dir_win , director , top200_box )
## Adding missing grouping variables: `ID`
#Now will perform exploratory data analysis
ggplot(data = moviesa , aes(x = runtime , y = average_rating, color = runtime)  ) + geom_col()

moviesa %>% group_by(genre ) %>% summarise(average_rating_by_genre = mean(average_rating)) %>% arrange(desc(average_rating_by_genre))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 11 x 2
##    genre                     average_rating_by_genre
##    <fct>                                       <dbl>
##  1 Documentary                                  8.21
##  2 Musical & Performing Arts                    7.66
##  3 Other                                        6.85
##  4 Art House & International                    6.51
##  5 Drama                                        6.47
##  6 Mystery & Suspense                           5.82
##  7 Animation                                    5.72
##  8 Science Fiction & Fantasy                    5.71
##  9 Action & Adventure                           5.20
## 10 Comedy                                       5.02
## 11 Horror                                       4.85
ggplot(data = moviesa , aes(x = thtr_rel_month , y = average_rating,fill =thtr_rel_month)  ) + geom_col( ) + scale_x_discrete(limits = c(1:12))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?

ggplot(data = moviesa , aes(x = mpaa_rating , y = average_rating , fill = mpaa_rating)  ) + geom_col() 

par(mfrow = c(2,3))
plot(moviesa$average_rating ~ moviesa$best_pic_nom , col = "yellow" , main = "best picture nomination")

plot(moviesa$average_rating ~ moviesa$best_actor_win , col = "orange" , main = "best actor win ")

plot(moviesa$average_rating ~ moviesa$best_actress_win , col = "steelblue" , main = "best actress win ")

plot(moviesa$average_rating ~ moviesa$best_dir_win , col = "purple" , main = "best director win ")

plot(moviesa$average_rating ~ moviesa$top200_box , col = "green" , main = "Box office top 200")

Part 4: Modelling

In this project we will use linear regression and start by fitting a model with 8 variables(described in the previous section).

Regfull<-lm(audience_score~critics_score+genre+critics_rating+imdb_rating+audience_rating+mpaa_rating+runtime
              +imdb_num_votes+title_type+thtr_rel_year,data=mymovies)
Regfull<-summary(Regfull)
Regfull
## 
## Call:
## lm(formula = audience_score ~ critics_score + genre + critics_rating + 
##     imdb_rating + audience_rating + mpaa_rating + runtime + imdb_num_votes + 
##     title_type + thtr_rel_year, data = mymovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7520  -4.4252   0.5607   4.1959  24.3954 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.334e+01  6.145e+01   1.356   0.1755    
## critics_score                   1.046e-02  2.571e-02   0.407   0.6842    
## genreAnimation                  4.025e+00  2.855e+00   1.410   0.1591    
## genreArt House & International -1.723e+00  2.275e+00  -0.758   0.4490    
## genreComedy                     1.641e+00  1.168e+00   1.405   0.1607    
## genreDocumentary                4.479e+00  2.929e+00   1.529   0.1267    
## genreDrama                     -1.170e-01  1.032e+00  -0.113   0.9097    
## genreHorror                    -1.476e+00  1.765e+00  -0.836   0.4032    
## genreMusical & Performing Arts  4.636e+00  2.404e+00   1.928   0.0543 .  
## genreMystery & Suspense        -2.761e+00  1.319e+00  -2.093   0.0368 *  
## genreOther                      3.546e-01  2.021e+00   0.175   0.8608    
## genreScience Fiction & Fantasy -1.491e-01  2.601e+00  -0.057   0.9543    
## critics_ratingFresh            -2.292e-01  9.075e-01  -0.253   0.8007    
## critics_ratingRotten           -8.477e-01  1.439e+00  -0.589   0.5562    
## imdb_rating                     9.321e+00  5.000e-01  18.642   <2e-16 ***
## audience_ratingUpright          2.016e+01  8.081e-01  24.943   <2e-16 ***
## mpaa_ratingNC-17               -8.167e+00  7.193e+00  -1.135   0.2567    
## mpaa_ratingPG                  -1.213e+00  2.037e+00  -0.595   0.5518    
## mpaa_ratingPG-13               -1.590e+00  2.121e+00  -0.750   0.4536    
## mpaa_ratingR                   -2.010e+00  2.039e+00  -0.986   0.3247    
## mpaa_ratingUnrated             -6.683e-01  2.442e+00  -0.274   0.7845    
## runtime                        -4.014e-02  1.711e-02  -2.346   0.0193 *  
## imdb_num_votes                  5.209e-06  3.175e-06   1.641   0.1014    
## title_typeFeature Film          3.378e+00  2.783e+00   1.214   0.2253    
## title_typeTV Movie              2.290e+00  4.433e+00   0.517   0.6057    
## thtr_rel_year                  -4.558e-02  3.044e-02  -1.497   0.1349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.876 on 593 degrees of freedom
## Multiple R-squared:  0.8886, Adjusted R-squared:  0.8839 
## F-statistic: 189.2 on 25 and 593 DF,  p-value: < 2.2e-16

The linear model and anova output summary reveals an adjusted R^2 value of 88.39%. Ths means that only 11.7 of the variability is accounted for in the residuals and the remainder is accounted for in the predictor varibles. This extremely high value may be suspect to error, that would invalidate the reliability of the study.

#Final model
Finalmodel<-lm(audience_score~audience_rating+imdb_rating+genre,data=mymovies)

summary(Finalmodel)
## 
## Call:
## lm(formula = audience_score ~ audience_rating + imdb_rating + 
##     genre, data = mymovies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.620  -4.502   0.627   4.364  24.975 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.2663     2.2718  -5.399 9.61e-08 ***
## audience_ratingUpright          20.5042     0.7937  25.834  < 2e-16 ***
## imdb_rating                      9.7107     0.3802  25.543  < 2e-16 ***
## genreAnimation                   5.5041     2.6000   2.117   0.0347 *  
## genreArt House & International  -2.2434     2.1928  -1.023   0.3067    
## genreComedy                      1.6229     1.1509   1.410   0.1590    
## genreDocumentary                 1.3614     1.4789   0.921   0.3577    
## genreDrama                      -0.6963     0.9828  -0.708   0.4790    
## genreHorror                     -1.3892     1.7145  -0.810   0.4181    
## genreMusical & Performing Arts   2.7493     2.2083   1.245   0.2136    
## genreMystery & Suspense         -3.2959     1.2818  -2.571   0.0104 *  
## genreOther                       0.7362     1.9985   0.368   0.7127    
## genreScience Fiction & Fantasy   0.5883     2.5888   0.227   0.8203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.89 on 606 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8834 
## F-statistic: 391.1 on 12 and 606 DF,  p-value: < 2.2e-16

#Multicollinearity

pairs(data=mymovies,~audience_score+critics_rating+audience_rating+imdb_rating+genre)

plot(Finalmodel$residuals~mymovies$imdb_rating,pch=20,main="Linearity Condition")
abline(h=0)

par(mfrow=c(1,2),mar=c(3,2,3,2))
hist(Finalmodel$residuals,main="histogram of residuals",col="skyblue")
qqnorm(Finalmodel$residuals,pch=20,main="Normal Probability Plot of Residuals")
qqline(Finalmodel$residuals)

par(mfrow=c(1,2))

plot(Finalmodel$residuals~Finalmodel$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(Finalmodel$residuals)~Finalmodel$fitted,main='Absolute Residuals vs. Predicted')

plot(Finalmodel$residuals,pch=20 ,main="Independany Conditions")

Part 5 Is Prediction

Case study of Shawshank redemption

imdb_rating<-9.2
genre <- "Drama"
critics_rating <-"Fresh"
audience_score= 95
audience_rating<- "Upright"

Newestdata<- data.frame(critics_rating,audience_rating,imdb_rating,genre)

predict(Finalmodel, Newestdata)
##        1 
## 96.88009
predict(Finalmodel,Newestdata,interval="confidence")
##        fit      lwr      upr
## 1 96.88009 95.09837 98.66181
##Prediction number 2
predicted_movie <- data.frame(runtime = 120 ,
                              mpaa_rating = "PG-17" ,
                              genre = "Science Fiction & Fantasy" ,
                              best_pic_nom = "yes" ,
                              best_dir_win = "yes",
                              top200_box = "yes"  )
predict(Finalmodel , predicted_movie , interval = "prediction" , level = 0.95 )
##        fit      lwr      upr
## 1 98.16467 83.69245 112.6369