Regression diagnostic is used to evaluate the model assumptions and investigate whether or not there are observations with a large, undue influence on the analysis [1]. In this Vignette, I will show you how you can generate and interpret the residual plots in R. I will also provide an example using the Movie Data I found in Kaggle and explore if there is any relationship between movie budget and popularity (Does bigger movie budget result in more popular movie?). I will then generate the residual plots and explains in detail on how to interpret the plots.
Here is the movie dataset that will be used in this Vignette. It contains information on 45,000 movies featured in the Full MovieLens dataset.
https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv
I first load the tidyverse library:
library(tidyverse)
and then load the movie dataset
moviesData<- read_csv("movies_metadata.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## budget = col_integer(),
## id = col_integer(),
## popularity = col_double(),
## release_date = col_date(format = ""),
## revenue = col_integer(),
## runtime = col_double(),
## vote_average = col_double(),
## vote_count = col_integer()
## )
## See spec(...) for full column specifications.
I then check the dataset:
summary(moviesData)
## adult belongs_to_collection budget
## Length:45466 Length:45466 Min. : 0
## Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Median : 0
## Mean : 4224579
## 3rd Qu.: 0
## Max. :380000000
## NA's :3
## genres homepage id imdb_id
## Length:45466 Length:45466 Min. : 2 Length:45466
## Class :character Class :character 1st Qu.: 26450 Class :character
## Mode :character Mode :character Median : 60003 Mode :character
## Mean :108360
## 3rd Qu.:157328
## Max. :469172
## NA's :3
## original_language original_title overview
## Length:45466 Length:45466 Length:45466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## popularity poster_path production_companies
## Min. : 0.0000 Length:45466 Length:45466
## 1st Qu.: 0.3859 Class :character Class :character
## Median : 1.1277 Mode :character Mode :character
## Mean : 2.9215
## 3rd Qu.: 3.6789
## Max. :547.4883
## NA's :6
## production_countries release_date revenue
## Length:45466 Min. :1874-12-09 Min. :0.000e+00
## Class :character 1st Qu.:1978-10-06 1st Qu.:0.000e+00
## Mode :character Median :2001-08-30 Median :0.000e+00
## Mean :1992-05-15 Mean :1.115e+07
## 3rd Qu.:2010-12-17 3rd Qu.:0.000e+00
## Max. :2020-12-16 Max. :2.068e+09
## NA's :90 NA's :7
## runtime spoken_languages status
## Min. : 0.00 Length:45466 Length:45466
## 1st Qu.: 85.00 Class :character Class :character
## Median : 95.00 Mode :character Mode :character
## Mean : 94.13
## 3rd Qu.: 107.00
## Max. :1256.00
## NA's :263
## tagline title video vote_average
## Length:45466 Length:45466 Length:45466 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.618
## 3rd Qu.: 6.800
## Max. :10.000
## NA's :6
## vote_count
## Min. : 0.0
## 1st Qu.: 3.0
## Median : 10.0
## Mean : 109.9
## 3rd Qu.: 34.0
## Max. :14075.0
## NA's :6
It looks like some of the observations do not have budget and popularity and so let’s filter them out:
moviesData <- filter(moviesData,budget!=0 | !is.na(budget) | popularity!=0 | !is.na(popularity))
summary(moviesData)
## adult belongs_to_collection budget
## Length:45463 Length:45463 Min. : 0
## Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Median : 0
## Mean : 4224579
## 3rd Qu.: 0
## Max. :380000000
##
## genres homepage id imdb_id
## Length:45463 Length:45463 Min. : 2 Length:45463
## Class :character Class :character 1st Qu.: 26450 Class :character
## Mode :character Mode :character Median : 60003 Mode :character
## Mean :108360
## 3rd Qu.:157328
## Max. :469172
##
## original_language original_title overview
## Length:45463 Length:45463 Length:45463
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## popularity poster_path production_companies
## Min. : 0.0000 Length:45463 Length:45463
## 1st Qu.: 0.3859 Class :character Class :character
## Median : 1.1277 Mode :character Mode :character
## Mean : 2.9215
## 3rd Qu.: 3.6789
## Max. :547.4883
## NA's :3
## production_countries release_date revenue
## Length:45463 Min. :1874-12-09 Min. :0.000e+00
## Class :character 1st Qu.:1978-10-06 1st Qu.:0.000e+00
## Mode :character Median :2001-08-30 Median :0.000e+00
## Mean :1992-05-15 Mean :1.115e+07
## 3rd Qu.:2010-12-17 3rd Qu.:0.000e+00
## Max. :2020-12-16 Max. :2.068e+09
## NA's :87 NA's :4
## runtime spoken_languages status
## Min. : 0.00 Length:45463 Length:45463
## 1st Qu.: 85.00 Class :character Class :character
## Median : 95.00 Mode :character Mode :character
## Mean : 94.13
## 3rd Qu.: 107.00
## Max. :1256.00
## NA's :260
## tagline title video vote_average
## Length:45463 Length:45463 Length:45463 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.618
## 3rd Qu.: 6.800
## Max. :10.000
## NA's :3
## vote_count
## Min. : 0.0
## 1st Qu.: 3.0
## Median : 10.0
## Mean : 109.9
## 3rd Qu.: 34.0
## Max. :14075.0
## NA's :3
I use lm() function to generate the linear regression model between the budget (as predictor) and rating (as dependent variable)
slm <- lm(popularity~budget,data=moviesData)
slm
##
## Call:
## lm(formula = popularity ~ budget, data = moviesData)
##
## Coefficients:
## (Intercept) budget
## 2.267e+00 1.550e-07
and so the linear regression model is y=2.267 + 0.0000001550x
ggplot(data=moviesData, aes(x=budget,y=popularity)) + geom_point()+
geom_smooth(method='lm')
summary(slm)
##
## Call:
## lm(formula = popularity ~ budget, data = moviesData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.94 -1.95 -1.27 0.64 533.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.267e+00 2.589e-02 87.56 <2e-16 ***
## budget 1.550e-07 1.444e-09 107.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.364 on 45458 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.2022, Adjusted R-squared: 0.2022
## F-statistic: 1.152e+04 on 1 and 45458 DF, p-value: < 2.2e-16
The R-squared is 0.2022 and the p-value is very small. We then assume there could be ~20% of the cause of popularity is due to the budget.
But how do we know that the regression model we just used fit to our movie data adequately?
Linear Regression makes several assumptions about the data, the model assumes that[2]:
The plot() function in R can be used to generate the diagnostic plots and check the assumptions above.
par(mfrow = c(2, 2))
plot(slm)
We will look at each plot in more details.
plot(slm, which = 1)
This plot is used to check the linear relationship assumption. It displays the error residuals vs fitted values. The dotted line (y=0) indicates the fit line.
From the graph above, most of the dots are above the dotted line and we can also see that the red line is close to the dotted line. We could then reasonably assume there is a linear relationship between the budget and the popularity.
plot(slm, which = 2)
This plot is used to check whether the residuals are normally distributed. A straight line indicates the residuals are normally distributed.
From the graph above, the residual points follow the dotted line closely until when the quartile is greater than 2.5. In this case, the data do not seem to be normally distributed. Also, the line curves upward (postively skewed) and it means data have more extreme values than would be expected if they truly came from a normal distribution.
plot(slm, which = 3)
This plot indicates spread of points across predicted values range and is used to check the homogeneity of variance of the residuals. Ideally, the red line should be horizontal with equally spread points and it would indicate that residuals have uniform variance across the range [2]. However, in our case the red line is slightly upward.
Cook’s distance is used to estimate of the influence of a data point[3].
plot(slm, which = 4)
This plot indicates that there are three points(30699, 33355 and 42220) that could potentially have large influence in our model.
Let’s look at the Cook’s distance more closely.
plot(slm, which = 5)
This plot is used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.
Looking at the diagram, most of the dots are inside the Cook’s distance lines except the three observations (30699, 33355 and 42220). These observations have Cook’s distance greater than 1.0 and it means they have large influence on the model. We should look into these observations further and consider to exclude them if they are errors.
filter(moviesData, row_number() == 30699 | row_number() == 33355 | row_number() == 42220) %>% select(original_title, release_date, budget,popularity)
## # A tibble: 3 x 4
## original_title release_date budget popularity
## <chr> <date> <int> <dbl>
## 1 Minions 2015-06-17 74000000 547.
## 2 Wonder Woman 2017-05-30 149000000 294.
## 3 Beauty and the Beast 2017-03-16 160000000 287.
Have you seen those movies? I watched all three and Minions is one of my favourite movies of all time!
So, does bigger movie budget result in more popular movie? Based on the analysis above, the analysis does indicate there is a positive linear relationship between movie budget and the movie ratings. Although it does not seem to be a strong linear relationship.
The regression diagnostic helps us to test the model’s assumptions and it is a useful technique to determine how well the model works with the data. In our case, it looks like we should investigate further as some of the assumptions are not met.