According Caffo (2015), “Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two…….It is often the case that a third variable can distort, or confound if you will, the relationship between two others.”
In this post, I will try to explain how a third variable effect a linear regression model using the movie dataset in the assignment 2.
Using the Google’s YouTube API, we have extracted the following statistics of movies trailers in YouTube for movies showed from the period 2015 to 2018:
Besides that, we have also obtained the IMDB’s user ratings for these movies.
summary(ratings.youtube.genre)
## YTLikeCount YTViewCount YTCommentCount YTDislikeCount
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 135 Median : 60778 Median : 11.0 Median : 9.0
## Mean : 2313 Mean : 887520 Mean : 238.2 Mean : 166.4
## 3rd Qu.: 1341 3rd Qu.: 631042 3rd Qu.: 156.0 3rd Qu.: 100.0
## Max. :29770 Max. :10332940 Max. :3048.0 Max. :2149.0
## IMDBRating
## Min. : 0.00
## 1st Qu.:59.00
## Median :66.00
## Mean :64.18
## 3rd Qu.:72.00
## Max. :98.00
pairs(ratings.youtube.genre)
For the purpose of this illustration, we want to see if we can predict IMDB’s user ratings using the dataset above.
Firstly, we want to see if there is any relationship between ratings and number of ‘dislike’ vote:
cor(ratings.youtube.genre$IMDBRating, ratings.youtube.genre$YTDislikeCount)
## [1] -0.03014
It seems there is a very weak negative relationship between ratings and number of ‘dislike’ vote.
Now, let’s generate the linear regression model:
lm1 <- lm(IMDBRating~YTDislikeCount, data=ratings.youtube.genre)
summary(lm1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.212631904 0.2627027283 248.237361 0.00000000
## YTDislikeCount -0.001711292 0.0005204629 -3.288018 0.00102893
The model above indicates that for any additional ‘dislike’ vote of movie trailer, the rating for the movie will be decreased by -0.0017.
Now, let’s add a thrid variable and see how it would impact the relationship between the two variables. In this case, we want to add number of “comment” as an additional variable in the model.
lm2 <- lm(IMDBRating ~ YTDislikeCount + YTCommentCount, data=ratings.youtube.genre)
summary(lm2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.030567808 0.2666554656 243.874873 0.000000e+00
## YTDislikeCount -0.008468466 0.0010424265 -8.123802 8.644938e-16
## YTCommentCount 0.004821137 0.0007010086 6.877429 8.558334e-12
As you can see, adding a third variable has impacted the relationship betweenrating and number of ‘dislike’ vote where now any additional ‘dislike’ vote for the movie trailer, the rating will be decreased by -0.008. The coefficient of the ‘dislike’ vote when compare to the first model is decreased by -0.0068 when we introduce a third variable (number of ‘comment’) in the model. The significant (p-value) also goes down from 0.00103 to 8.64e-16.
Here is the correlation between number of ‘dislike’ vote and number of ‘comment’:
cor(ratings.youtube.genre$YTDislikeCount, ratings.youtube.genre$YTCommentCount)
## [1] 0.6135742
It seems there is a moderate relationship between number of ‘dislike’ vote and number of ‘comment’
Here is the 3D plot to view the relationship for all three variables:
Does number of “comment” have confounding effect on the relationship between movie ratings and number of ‘dislike’ of the movie trailer? The output of the models indicate there is a change in the relationship when a third variable (number of ‘comment’) is introduced. The p-value is smaller and t value is still significant. However, there is a moderate correlation between the two independent variables and so more investigation should be conducted to ensure the impact is not casual.
Caffo, Brian 2015, Leanpub Adjustments - Regression Models for Data Science in R, Leanpub, viewed 29 September 2018, https://www.simplypsychology.org/maslow.html. https://leanpub.com/regmods/read#leanpub-auto-adjustment