Correlation as Prediction

Probability and Statistics

FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm Metacritic_norm Metacritic_user_nom IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 3.70 4.3 3.30 3.55 3.90 3.5 4.5 3.5 3.5 4.0 1330 271107 14846 0.5
Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 4.25 4.0 3.35 3.75 3.55 4.5 4.0 3.5 4.0 3.5 249 65709 12640 0.5
Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.5 3.20 4.05 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5
Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 0.90 4.2 1.10 2.35 2.70 1.0 4.0 1.0 2.5 2.5 31 3136 1793 0.5
Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0 0.70 1.4 1.45 1.70 2.55 0.5 1.5 1.5 1.5 2.5 88 19560 1021 0.5
The Water Diviner (2015) 63 62 50 6.8 7.2 4.5 4.0 3.15 3.1 2.50 3.40 3.60 3.0 3.0 2.5 3.5 3.5 34 39373 397 0.5
 [1] "FILM"                       "RottenTomatoes"            
 [3] "RottenTomatoes_User"        "Metacritic"                
 [5] "Metacritic_User"            "IMDB"                      
 [7] "Fandango_Stars"             "Fandango_Ratingvalue"      
 [9] "RT_norm"                    "RT_user_norm"              
[11] "Metacritic_norm"            "Metacritic_user_nom"       
[13] "IMDB_norm"                  "RT_norm_round"             
[15] "RT_user_norm_round"         "Metacritic_norm_round"     
[17] "Metacritic_user_norm_round" "IMDB_norm_round"           
[19] "Metacritic_user_vote_count" "IMDB_user_vote_count"      
[21] "Fandango_votes"             "Fandango_Difference"       
movie_ratings %>% filter(IMDB_user_vote_count>60000) %>% select(FILM, IMDB) %>% top_n(10, IMDB) %>% arrange(desc(IMDB))
                                               FILM IMDB
1                                 Inside Out (2015)  8.6
2                         Mad Max: Fury Road (2015)  8.3
3                         The Imitation Game (2014)  8.1
4                                    Birdman (2014)  7.9
5                    Avengers: Age of Ultron (2015)  7.8
6                                    Ant-Man (2015)  7.8
7               Kingsman: The Secret Service (2015)  7.8
8       Mission: Impossible – Rogue Nation (2015)  7.8
9                                 Ex Machina (2015)  7.7
10 The Hobbit: The Battle of the Five Armies (2014)  7.5

plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-6

summary(movie_ratings[,2:3])
 RottenTomatoes   RottenTomatoes_User
 Min.   :  5.00   Min.   :20.00      
 1st Qu.: 31.25   1st Qu.:50.00      
 Median : 63.50   Median :66.50      
 Mean   : 60.85   Mean   :63.88      
 3rd Qu.: 89.00   3rd Qu.:81.00      
 Max.   :100.00   Max.   :94.00      
movie_ratings <- movie_ratings %>% mutate(
          z_RT = 
  (RottenTomatoes - mean(RottenTomatoes))/sd(RottenTomatoes),
          z_RT_users = 
    (RottenTomatoes_User - mean(RottenTomatoes_User))/sd(RottenTomatoes_User)
                         )
movie_ratings %>% ggplot(aes(x=z_RT , y=z_RT_users))+geom_point()

plot of chunk unnamed-chunk-9

with(movie_ratings, cor(RottenTomatoes, RottenTomatoes_User))
[1] 0.7814396
with(movie_ratings, cor(z_RT, z_RT_users))
[1] 0.7814396

A "best-fit"/"prediction"/"least squares regression" line

lm(z_RT_users~z_RT, data=movie_ratings)

Call:
lm(formula = z_RT_users ~ z_RT, data = movie_ratings)

Coefficients:
(Intercept)         z_RT  
 -2.982e-16    7.814e-01  

For every 1 standard deviation increase in Rotten Tomato Rating we expect a 0.78 increase in Rotten Tomato User Rating

movie_ratings %>% ggplot(aes(x=z_RT , y=z_RT_users))+geom_point()+geom_smooth(method="lm", se=FALSE)

plot of chunk unnamed-chunk-12

Deriving an equation for any least squares best-fit line

Recalling our dear friend, \( y = m \cdot x + b, \) we can write:

\[ zscore(y) = r \cdot zscore(x) \]

and with some plugging in and rearranging…

\[ \frac{y - \mu_y}{\sigma_y} = r \cdot \frac{x - \mu_x}{\sigma_x} \]

\[ y - \mu_y = r \cdot (x - \mu_x)\cdot \frac{\sigma_y}{\sigma_x} \]

\[ y = (r \cdot \frac{\sigma_y}{\sigma_x}) x + (\mu_y - r \cdot \mu_x \frac{\sigma_y}{\sigma_x} + ) \]

Writing a function in R!

\[ y = (r \cdot \frac{\sigma_y}{\sigma_x}) x + (\mu_y - r \cdot \mu_x \frac{\sigma_y}{\sigma_x} + ) \]

BestFit <- function(x, y){
  m = cor(x,y)*sd(y)/sd(x)
  b = mean(y) - cor(x,y)*mean(x)*(sd(y)/sd(x))
  return(list(m=m, b=b))
}
with(movie_ratings, BestFit(RottenTomatoes, RottenTomatoes_User))
$m
[1] 0.5186777

$b
[1] 32.31553
lm(RottenTomatoes_User~RottenTomatoes, data=movie_ratings)

Call:
lm(formula = RottenTomatoes_User ~ RottenTomatoes, data = movie_ratings)

Coefficients:
   (Intercept)  RottenTomatoes  
       32.3155          0.5187  

Making Predictions

If RottenTomatoes gave a more a score of 70, what score do we predict from RottenTomatoes_User?

fitline <- with(movie_ratings, BestFit(RottenTomatoes, RottenTomatoes_User))

fitline$m * 70 + fitline$b
[1] 68.62297

What if RottenTomatoes gave a more a score of 90? Or a score of 50?

fitline$m * 90 + fitline$b; fitline$m * 50 + fitline$b
[1] 78.99652
[1] 58.24941