Adjustment for multivariate regression

According Caffo (2015), “Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two…….It is often the case that a third variable can distort, or confound if you will, the relationship between two others.”

In this post, I will try to explain how a third variable effect a linear regression model using the movie dataset in the assignment 2.

Dataset

Using the Google’s YouTube API, we have extracted the following statistics of movies trailers in YouTube for movies showed from the period 2015 to 2018:

Number of ‘like’ vote
Number of ‘view’
Number of ‘comment’
Number of ‘dislike’ vote

Besides that, we have also obtained the IMDB’s user ratings for these movies.

summary(ratings.youtube.genre)

##   YTLikeCount     YTViewCount       YTCommentCount   YTDislikeCount  
##  Min.   :    0   Min.   :       0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:    0   1st Qu.:       0   1st Qu.:   0.0   1st Qu.:   0.0  
##  Median :  135   Median :   60778   Median :  11.0   Median :   9.0  
##  Mean   : 2313   Mean   :  887520   Mean   : 238.2   Mean   : 166.4  
##  3rd Qu.: 1341   3rd Qu.:  631042   3rd Qu.: 156.0   3rd Qu.: 100.0  
##  Max.   :29770   Max.   :10332940   Max.   :3048.0   Max.   :2149.0  
##    IMDBRating   
##  Min.   : 0.00  
##  1st Qu.:59.00  
##  Median :66.00  
##  Mean   :64.18  
##  3rd Qu.:72.00  
##  Max.   :98.00

Relationship between variables

pairs(ratings.youtube.genre)

Model and Evaluation

For the purpose of this illustration, we want to see if we can predict IMDB’s user ratings using the dataset above.

Firstly, we want to see if there is any relationship between ratings and number of ‘dislike’ vote:

cor(ratings.youtube.genre$IMDBRating, ratings.youtube.genre$YTDislikeCount)

## [1] -0.03014

It seems there is a very weak negative relationship between ratings and number of ‘dislike’ vote.

Now, let’s generate the linear regression model:

lm1 <- lm(IMDBRating~YTDislikeCount, data=ratings.youtube.genre)
summary(lm1)$coef

##                    Estimate   Std. Error    t value   Pr(>|t|)
## (Intercept)    65.212631904 0.2627027283 248.237361 0.00000000
## YTDislikeCount -0.001711292 0.0005204629  -3.288018 0.00102893

The model above indicates that for any additional ‘dislike’ vote of movie trailer, the rating for the movie will be decreased by -0.0017.

Now, let’s add a thrid variable and see how it would impact the relationship between the two variables. In this case, we want to add number of “comment” as an additional variable in the model.

lm2 <- lm(IMDBRating ~ YTDislikeCount + YTCommentCount, data=ratings.youtube.genre)
summary(lm2)$coef

##                    Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)    65.030567808 0.2666554656 243.874873 0.000000e+00
## YTDislikeCount -0.008468466 0.0010424265  -8.123802 8.644938e-16
## YTCommentCount  0.004821137 0.0007010086   6.877429 8.558334e-12

As you can see, adding a third variable has impacted the relationship betweenrating and number of ‘dislike’ vote where now any additional ‘dislike’ vote for the movie trailer, the rating will be decreased by -0.008. The coefficient of the ‘dislike’ vote when compare to the first model is decreased by -0.0068 when we introduce a third variable (number of ‘comment’) in the model. The significant (p-value) also goes down from 0.00103 to 8.64e-16.

Here is the correlation between number of ‘dislike’ vote and number of ‘comment’:

cor(ratings.youtube.genre$YTDislikeCount, ratings.youtube.genre$YTCommentCount)

## [1] 0.6135742

It seems there is a moderate relationship between number of ‘dislike’ vote and number of ‘comment’

Here is the 3D plot to view the relationship for all three variables:

Conclusion

Does number of “comment” have confounding effect on the relationship between movie ratings and number of ‘dislike’ of the movie trailer? The output of the models indicate there is a change in the relationship when a third variable (number of ‘comment’) is introduced. The p-value is smaller and t value is still significant. However, there is a moderate correlation between the two independent variables and so more investigation should be conducted to ensure the impact is not casual.

References

Caffo, Brian 2015, Leanpub Adjustments - Regression Models for Data Science in R, Leanpub, viewed 29 September 2018, https://www.simplypsychology.org/maslow.html. https://leanpub.com/regmods/read#leanpub-auto-adjustment

Adjustment for multivariate regression - Movie Trailer Example