Overview

Regression diagnostic is used to evaluate the model assumptions and investigate whether or not there are observations with a large, undue influence on the analysis [1]. In this Vignette, I will show you how you can generate and interpret the residual plots in R. I will also provide an example using the Movie Data I found in Kaggle and explore if there is any relationship between movie budget and popularity (Does bigger movie budget result in more popular movie?). I will then generate the residual plots and explains in detail on how to interpret the plots.

Data

Here is the movie dataset that will be used in this Vignette. It contains information on 45,000 movies featured in the Full MovieLens dataset.

https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv

I first load the tidyverse library:

library(tidyverse)

and then load the movie dataset

moviesData<- read_csv("movies_metadata.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   budget = col_integer(),
##   id = col_integer(),
##   popularity = col_double(),
##   release_date = col_date(format = ""),
##   revenue = col_integer(),
##   runtime = col_double(),
##   vote_average = col_double(),
##   vote_count = col_integer()
## )

## See spec(...) for full column specifications.

I then check the dataset:

summary(moviesData)

##     adult           belongs_to_collection     budget         
##  Length:45466       Length:45466          Min.   :        0  
##  Class :character   Class :character      1st Qu.:        0  
##  Mode  :character   Mode  :character      Median :        0  
##                                           Mean   :  4224579  
##                                           3rd Qu.:        0  
##                                           Max.   :380000000  
##                                           NA's   :3          
##     genres            homepage               id           imdb_id         
##  Length:45466       Length:45466       Min.   :     2   Length:45466      
##  Class :character   Class :character   1st Qu.: 26450   Class :character  
##  Mode  :character   Mode  :character   Median : 60003   Mode  :character  
##                                        Mean   :108360                     
##                                        3rd Qu.:157328                     
##                                        Max.   :469172                     
##                                        NA's   :3                          
##  original_language  original_title       overview        
##  Length:45466       Length:45466       Length:45466      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    popularity       poster_path        production_companies
##  Min.   :  0.0000   Length:45466       Length:45466        
##  1st Qu.:  0.3859   Class :character   Class :character    
##  Median :  1.1277   Mode  :character   Mode  :character    
##  Mean   :  2.9215                                          
##  3rd Qu.:  3.6789                                          
##  Max.   :547.4883                                          
##  NA's   :6                                                 
##  production_countries  release_date           revenue         
##  Length:45466         Min.   :1874-12-09   Min.   :0.000e+00  
##  Class :character     1st Qu.:1978-10-06   1st Qu.:0.000e+00  
##  Mode  :character     Median :2001-08-30   Median :0.000e+00  
##                       Mean   :1992-05-15   Mean   :1.115e+07  
##                       3rd Qu.:2010-12-17   3rd Qu.:0.000e+00  
##                       Max.   :2020-12-16   Max.   :2.068e+09  
##                       NA's   :90           NA's   :7          
##     runtime        spoken_languages      status         
##  Min.   :   0.00   Length:45466       Length:45466      
##  1st Qu.:  85.00   Class :character   Class :character  
##  Median :  95.00   Mode  :character   Mode  :character  
##  Mean   :  94.13                                        
##  3rd Qu.: 107.00                                        
##  Max.   :1256.00                                        
##  NA's   :263                                            
##    tagline             title              video            vote_average   
##  Length:45466       Length:45466       Length:45466       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.618  
##                                                           3rd Qu.: 6.800  
##                                                           Max.   :10.000  
##                                                           NA's   :6       
##    vote_count     
##  Min.   :    0.0  
##  1st Qu.:    3.0  
##  Median :   10.0  
##  Mean   :  109.9  
##  3rd Qu.:   34.0  
##  Max.   :14075.0  
##  NA's   :6

It looks like some of the observations do not have budget and popularity and so let’s filter them out:

moviesData <- filter(moviesData,budget!=0 | !is.na(budget) | popularity!=0 | !is.na(popularity))
summary(moviesData)

##     adult           belongs_to_collection     budget         
##  Length:45463       Length:45463          Min.   :        0  
##  Class :character   Class :character      1st Qu.:        0  
##  Mode  :character   Mode  :character      Median :        0  
##                                           Mean   :  4224579  
##                                           3rd Qu.:        0  
##                                           Max.   :380000000  
##                                                              
##     genres            homepage               id           imdb_id         
##  Length:45463       Length:45463       Min.   :     2   Length:45463      
##  Class :character   Class :character   1st Qu.: 26450   Class :character  
##  Mode  :character   Mode  :character   Median : 60003   Mode  :character  
##                                        Mean   :108360                     
##                                        3rd Qu.:157328                     
##                                        Max.   :469172                     
##                                                                           
##  original_language  original_title       overview        
##  Length:45463       Length:45463       Length:45463      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    popularity       poster_path        production_companies
##  Min.   :  0.0000   Length:45463       Length:45463        
##  1st Qu.:  0.3859   Class :character   Class :character    
##  Median :  1.1277   Mode  :character   Mode  :character    
##  Mean   :  2.9215                                          
##  3rd Qu.:  3.6789                                          
##  Max.   :547.4883                                          
##  NA's   :3                                                 
##  production_countries  release_date           revenue         
##  Length:45463         Min.   :1874-12-09   Min.   :0.000e+00  
##  Class :character     1st Qu.:1978-10-06   1st Qu.:0.000e+00  
##  Mode  :character     Median :2001-08-30   Median :0.000e+00  
##                       Mean   :1992-05-15   Mean   :1.115e+07  
##                       3rd Qu.:2010-12-17   3rd Qu.:0.000e+00  
##                       Max.   :2020-12-16   Max.   :2.068e+09  
##                       NA's   :87           NA's   :4          
##     runtime        spoken_languages      status         
##  Min.   :   0.00   Length:45463       Length:45463      
##  1st Qu.:  85.00   Class :character   Class :character  
##  Median :  95.00   Mode  :character   Mode  :character  
##  Mean   :  94.13                                        
##  3rd Qu.: 107.00                                        
##  Max.   :1256.00                                        
##  NA's   :260                                            
##    tagline             title              video            vote_average   
##  Length:45463       Length:45463       Length:45463       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.618  
##                                                           3rd Qu.: 6.800  
##                                                           Max.   :10.000  
##                                                           NA's   :3       
##    vote_count     
##  Min.   :    0.0  
##  1st Qu.:    3.0  
##  Median :   10.0  
##  Mean   :  109.9  
##  3rd Qu.:   34.0  
##  Max.   :14075.0  
##  NA's   :3

Linear Regression Analysis

I use lm() function to generate the linear regression model between the budget (as predictor) and rating (as dependent variable)

slm <- lm(popularity~budget,data=moviesData)
slm

## 
## Call:
## lm(formula = popularity ~ budget, data = moviesData)
## 
## Coefficients:
## (Intercept)       budget  
##   2.267e+00    1.550e-07

and so the linear regression model is y=2.267 + 0.0000001550x

ggplot(data=moviesData, aes(x=budget,y=popularity)) + geom_point()+ 
  geom_smooth(method='lm')

summary(slm)

## 
## Call:
## lm(formula = popularity ~ budget, data = moviesData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.94  -1.95  -1.27   0.64 533.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.267e+00  2.589e-02   87.56   <2e-16 ***
## budget      1.550e-07  1.444e-09  107.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.364 on 45458 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.2022, Adjusted R-squared:  0.2022 
## F-statistic: 1.152e+04 on 1 and 45458 DF,  p-value: < 2.2e-16

The R-squared is 0.2022 and the p-value is very small. We then assume there could be ~20% of the cause of popularity is due to the budget.

Regression Diagnostics

But how do we know that the regression model we just used fit to our movie data adequately?

Linear Regression makes several assumptions about the data, the model assumes that[2]:

The relationship between the predictor (x) and the dependent variable (y) has linear relationship.
The residuals are assumed to have a constant variance.
The residual errors are assumed to be normally distributed.
Error terms are independent and have zero mean.

The plot() function in R can be used to generate the diagnostic plots and check the assumptions above.

par(mfrow = c(2, 2))
plot(slm)

We will look at each plot in more details.

Residuals vs Fitted Plot

plot(slm, which = 1)

This plot is used to check the linear relationship assumption. It displays the error residuals vs fitted values. The dotted line (y=0) indicates the fit line.

From the graph above, most of the dots are above the dotted line and we can also see that the red line is close to the dotted line. We could then reasonably assume there is a linear relationship between the budget and the popularity.

Normal Q-Q

plot(slm, which = 2)

This plot is used to check whether the residuals are normally distributed. A straight line indicates the residuals are normally distributed.

From the graph above, the residual points follow the dotted line closely until when the quartile is greater than 2.5. In this case, the data do not seem to be normally distributed. Also, the line curves upward (postively skewed) and it means data have more extreme values than would be expected if they truly came from a normal distribution.

Scale-Location

plot(slm, which = 3)

This plot indicates spread of points across predicted values range and is used to check the homogeneity of variance of the residuals. Ideally, the red line should be horizontal with equally spread points and it would indicate that residuals have uniform variance across the range [2]. However, in our case the red line is slightly upward.

Residuals vs Leverage

Cook’s distance is used to estimate of the influence of a data point[3].

plot(slm, which = 4)

This plot indicates that there are three points(30699, 33355 and 42220) that could potentially have large influence in our model.

Let’s look at the Cook’s distance more closely.

plot(slm, which = 5)

This plot is used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.

Looking at the diagram, most of the dots are inside the Cook’s distance lines except the three observations (30699, 33355 and 42220). These observations have Cook’s distance greater than 1.0 and it means they have large influence on the model. We should look into these observations further and consider to exclude them if they are errors.

filter(moviesData, row_number() == 30699 | row_number() == 33355 | row_number() == 42220)  %>% select(original_title, release_date, budget,popularity)

## # A tibble: 3 x 4
##   original_title       release_date    budget popularity
##   <chr>                <date>           <int>      <dbl>
## 1 Minions              2015-06-17    74000000       547.
## 2 Wonder Woman         2017-05-30   149000000       294.
## 3 Beauty and the Beast 2017-03-16   160000000       287.

Have you seen those movies? I watched all three and Minions is one of my favourite movies of all time!

Conclusion

So, does bigger movie budget result in more popular movie? Based on the analysis above, the analysis does indicate there is a positive linear relationship between movie budget and the movie ratings. Although it does not seem to be a strong linear relationship.

The regression diagnostic helps us to test the model’s assumptions and it is a useful technique to determine how well the model works with the data. In our case, it looks like we should investigate further as some of the assumptions are not met.

References

[1] Regression Diagnostics, http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression7.html
[2] Linear Regression Assumptions and Diagnostics in R: Essentials- http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
[3] Bommae Kim, Understanding Diagnostic Plots for Linear Regression Analysis, https://data.library.virginia.edu/diagnostic-plots/
[4] Wikipedia - Regression diagnostic - https://en.wikipedia.org/wiki/Regression_diagnostic
[5] Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.
[6] Wikipedia - Cook’s distance - https://en.wikipedia.org/wiki/Cook%27s_distance
[7] Phil Mike Jones - Regression Diagnostics with R - https://philmikejones.wordpress.com/2014/05/12/regression-diagnostics-r/

Regression Diagnostic for Simple Linear Regression using R

Benny Lee

17/08/2018