Polynomial Regression

In class today we learned about quadratic regression and, more generally, polynomial regression. The basic idea is that some data can be better predicted using polynomial regression rather than linear regression. In class, we used the “women” data in R because we had found earlier that the shortest heights and tallest heights were being over-estimated systematically. Another way to test if polynomial linear regression should be used is looking for a curved trend in a scatterplot. Imagine fitting a line to a curved trend, that would be the error in using linear regression in a case when polynomial regression is needed.

I will be using the data on Rotten Tomato ratings from fivethirtyeight because I am interested to know if this data will be too similar to my group’s project idea.

#install.packages("fivethirtyeight")
library(fivethirtyeight)
data("fandango")
attach(fandango)
names(fandango)
##  [1] "film"                       "year"                      
##  [3] "rottentomatoes"             "rottentomatoes_user"       
##  [5] "metacritic"                 "metacritic_user"           
##  [7] "imdb"                       "fandango_stars"            
##  [9] "fandango_ratingvalue"       "rt_norm"                   
## [11] "rt_user_norm"               "metacritic_norm"           
## [13] "metacritic_user_nom"        "imdb_norm"                 
## [15] "rt_norm_round"              "rt_user_norm_round"        
## [17] "metacritic_norm_round"      "metacritic_user_norm_round"
## [19] "imdb_norm_round"            "metacritic_user_vote_count"
## [21] "imdb_user_vote_count"       "fandango_votes"            
## [23] "fandango_difference"
plot(imdb,rottentomatoes)

RTIMDBlm <- lm(rottentomatoes~imdb)
plot(RTIMDBlm$residuals~RTIMDBlm$fitted.values)

The scatterplot looks fairly linear, so it may not need quadratic regression by this check, but the residual plot shows unequal variance, lets use quadratic regression.

RTIMDBlmQUAD <- lm(rottentomatoes ~ imdb + I(imdb^2))

Let’s check if we needed to include that extra term.

anova(RTIMDBlm,RTIMDBlmQUAD)
## Analysis of Variance Table
## 
## Model 1: rottentomatoes ~ imdb
## Model 2: rottentomatoes ~ imdb + I(imdb^2)
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    144 51748                           
## 2    143 51601  1    147.26 0.4081  0.524

The high p-value means that the extra term in the quadratic regression is not worth adding.

Another thing that I would like to try from class is this method of fitting a regression line. Because the anova test deemed the x^2 term as unimportant, the end result of the following code should be a linear regression line.

x <- seq(from=4, to =8, .1)
y <- coef(RTIMDBlmQUAD) [1] + coef(RTIMDBlmQUAD) [2] *x + coef(RTIMDBlmQUAD) [3]*x^2
plot(rottentomatoes~imdb)
lines(x,y,lty=2)
abline(RTIMDBlm)

The line created using quadratic regression can be seen as the dotted line in the above figure. The line is essentially straight. There is no need to use the quadratic term for this data.

Multicollinearity

We also learned how to check for Multicollinearity in class. Multicollinearity is when you have predictors that are correlated with each other. This inflates standard error.

Lets use the same data to inspect for collinearity.

cor(imdb,metacritic)
## [1] 0.7272978
cor(rottentomatoes,imdb)
## [1] 0.7796709
cor(rottentomatoes,metacritic)
## [1] 0.9573596

IMDB rating, Metacritic rating, and Rotten Tomato ratings are all predictors in calculating Fandango scores for this data. They are all highly correlated, because they are all measuring the same thing, movie popularity. We are most concerned with correlation coefficients above 0.9. So in the case of Rotten Tomatoes and Metacritic ratings, we have an issue. They are basically telling us the exact same thing. The standard error is inflated as a result.

Which one should we drop?

complete <- lm(fandango_stars~imdb+rottentomatoes+metacritic)
reduced1 <- lm(fandango_stars~imdb+rottentomatoes)
reduced2 <- lm(fandango_stars~imdb+metacritic)
summary(complete)
## 
## Call:
## lm(formula = fandango_stars ~ imdb + rottentomatoes + metacritic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86200 -0.25947 -0.05442  0.22057  1.16405 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.808442   0.326464   5.539 1.43e-07 ***
## imdb            0.487120   0.053894   9.038 1.06e-15 ***
## rottentomatoes  0.010414   0.004069   2.559   0.0115 *  
## metacritic     -0.027798   0.005738  -4.844 3.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3874 on 142 degrees of freedom
## Multiple R-squared:  0.4966, Adjusted R-squared:  0.486 
## F-statistic:  46.7 on 3 and 142 DF,  p-value: < 2.2e-16
summary(reduced1)
## 
## Call:
## lm(formula = fandango_stars ~ imdb + rottentomatoes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97902 -0.27303 -0.04911  0.25038  1.27830 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.076954   0.311356   3.459 0.000715 ***
## imdb            0.514725   0.057649   8.929 1.92e-15 ***
## rottentomatoes -0.007488   0.001832  -4.087 7.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4168 on 143 degrees of freedom
## Multiple R-squared:  0.4134, Adjusted R-squared:  0.4052 
## F-statistic:  50.4 on 2 and 143 DF,  p-value: < 2.2e-16
summary(reduced2)
## 
## Call:
## lm(formula = fandango_stars ~ imdb + metacritic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82808 -0.26525 -0.05599  0.25252  1.10873 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.266922   0.253399   5.000 1.66e-06 ***
## imdb         0.545129   0.049837  10.938  < 2e-16 ***
## metacritic  -0.014461   0.002448  -5.907 2.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3949 on 143 degrees of freedom
## Multiple R-squared:  0.4734, Adjusted R-squared:  0.466 
## F-statistic: 64.28 on 2 and 143 DF,  p-value: < 2.2e-16

Looking at the f-test output, we aren’t better off dropping a variable. Standard error is minimized when all variables are involved, adjusted R^2 is also maximized.

anova(complete,reduced1)
## Analysis of Variance Table
## 
## Model 1: fandango_stars ~ imdb + rottentomatoes + metacritic
## Model 2: fandango_stars ~ imdb + rottentomatoes
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    142 21.314                                  
## 2    143 24.837 -1   -3.5226 23.468 3.282e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(complete,reduced2)
## Analysis of Variance Table
## 
## Model 1: fandango_stars ~ imdb + rottentomatoes + metacritic
## Model 2: fandango_stars ~ imdb + metacritic
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    142 21.314                              
## 2    143 22.297 -1  -0.98321 6.5504 0.01153 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if we were to drop a predictor, it should be the Rotten Tomato rating, based on the comparative significance of each variable. When Metacritic ratings were dropped, the anova test returned a very small p-value, but when Rotten Tomato ratings were dropped, we got a p-value of 0.01, which would still reject the null hypothesis under the standard 95% significance level, but we can still see which variables are most important to the regression.