Correlation as Prediction

Probability and Statistics

FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	Metacritic_norm	Metacritic_user_nom	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
Avengers: Age of Ultron (2015)	74	86	66	7.1	7.8	5.0	4.5	3.70	4.3	3.30	3.55	3.90	3.5	4.5	3.5	3.5	4.0	1330	271107	14846	0.5
Cinderella (2015)	85	80	67	7.5	7.1	5.0	4.5	4.25	4.0	3.35	3.75	3.55	4.5	4.0	3.5	4.0	3.5	249	65709	12640	0.5
Ant-Man (2015)	80	90	64	8.1	7.8	5.0	4.5	4.00	4.5	3.20	4.05	3.90	4.0	4.5	3.0	4.0	4.0	627	103660	12055	0.5
Do You Believe? (2015)	18	84	22	4.7	5.4	5.0	4.5	0.90	4.2	1.10	2.35	2.70	1.0	4.0	1.0	2.5	2.5	31	3136	1793	0.5
Hot Tub Time Machine 2 (2015)	14	28	29	3.4	5.1	3.5	3.0	0.70	1.4	1.45	1.70	2.55	0.5	1.5	1.5	1.5	2.5	88	19560	1021	0.5
The Water Diviner (2015)	63	62	50	6.8	7.2	4.5	4.0	3.15	3.1	2.50	3.40	3.60	3.0	3.0	2.5	3.5	3.5	34	39373	397	0.5

 [1] "FILM"                       "RottenTomatoes"            
 [3] "RottenTomatoes_User"        "Metacritic"                
 [5] "Metacritic_User"            "IMDB"                      
 [7] "Fandango_Stars"             "Fandango_Ratingvalue"      
 [9] "RT_norm"                    "RT_user_norm"              
[11] "Metacritic_norm"            "Metacritic_user_nom"       
[13] "IMDB_norm"                  "RT_norm_round"             
[15] "RT_user_norm_round"         "Metacritic_norm_round"     
[17] "Metacritic_user_norm_round" "IMDB_norm_round"           
[19] "Metacritic_user_vote_count" "IMDB_user_vote_count"      
[21] "Fandango_votes"             "Fandango_Difference"

movie_ratings %>% filter(IMDB_user_vote_count>60000) %>% select(FILM, IMDB) %>% top_n(10, IMDB) %>% arrange(desc(IMDB))

                                               FILM IMDB
1                                 Inside Out (2015)  8.6
2                         Mad Max: Fury Road (2015)  8.3
3                         The Imitation Game (2014)  8.1
4                                    Birdman (2014)  7.9
5                    Avengers: Age of Ultron (2015)  7.8
6                                    Ant-Man (2015)  7.8
7               Kingsman: The Secret Service (2015)  7.8
8       Mission: Impossible â€“ Rogue Nation (2015)  7.8
9                                 Ex Machina (2015)  7.7
10 The Hobbit: The Battle of the Five Armies (2014)  7.5

plot of chunk unnamed-chunk-5

plot of chunk unnamed-chunk-6

summary(movie_ratings[,2:3])

 RottenTomatoes   RottenTomatoes_User
 Min.   :  5.00   Min.   :20.00      
 1st Qu.: 31.25   1st Qu.:50.00      
 Median : 63.50   Median :66.50      
 Mean   : 60.85   Mean   :63.88      
 3rd Qu.: 89.00   3rd Qu.:81.00      
 Max.   :100.00   Max.   :94.00

movie_ratings <- movie_ratings %>% mutate(
          z_RT = 
  (RottenTomatoes - mean(RottenTomatoes))/sd(RottenTomatoes),
          z_RT_users = 
    (RottenTomatoes_User - mean(RottenTomatoes_User))/sd(RottenTomatoes_User)
                         )

movie_ratings %>% ggplot(aes(x=z_RT , y=z_RT_users))+geom_point()

plot of chunk unnamed-chunk-9

with(movie_ratings, cor(RottenTomatoes, RottenTomatoes_User))

[1] 0.7814396

with(movie_ratings, cor(z_RT, z_RT_users))

[1] 0.7814396

A "best-fit"/"prediction"/"least squares regression" line

lm(z_RT_users~z_RT, data=movie_ratings)


Call:
lm(formula = z_RT_users ~ z_RT, data = movie_ratings)

Coefficients:
(Intercept)         z_RT  
 -2.982e-16    7.814e-01

For every 1 standard deviation increase in Rotten Tomato Rating we expect a 0.78 increase in Rotten Tomato User Rating

movie_ratings %>% ggplot(aes(x=z_RT , y=z_RT_users))+geom_point()+geom_smooth(method="lm", se=FALSE)

plot of chunk unnamed-chunk-12

Deriving an equation for any least squares best-fit line

Recalling our dear friend, \( y = m \cdot x + b, \) we can write:

\[ zscore(y) = r \cdot zscore(x) \]

and with some plugging in and rearranging…

\[ \frac{y - \mu_y}{\sigma_y} = r \cdot \frac{x - \mu_x}{\sigma_x} \]

\[ y - \mu_y = r \cdot (x - \mu_x)\cdot \frac{\sigma_y}{\sigma_x} \]

\[ y = (r \cdot \frac{\sigma_y}{\sigma_x}) x + (\mu_y - r \cdot \mu_x \frac{\sigma_y}{\sigma_x} + ) \]

Writing a function in R!

\[ y = (r \cdot \frac{\sigma_y}{\sigma_x}) x + (\mu_y - r \cdot \mu_x \frac{\sigma_y}{\sigma_x} + ) \]

BestFit <- function(x, y){
  m = cor(x,y)*sd(y)/sd(x)
  b = mean(y) - cor(x,y)*mean(x)*(sd(y)/sd(x))
  return(list(m=m, b=b))
}

with(movie_ratings, BestFit(RottenTomatoes, RottenTomatoes_User))

$m
[1] 0.5186777

$b
[1] 32.31553

lm(RottenTomatoes_User~RottenTomatoes, data=movie_ratings)


Call:
lm(formula = RottenTomatoes_User ~ RottenTomatoes, data = movie_ratings)

Coefficients:
   (Intercept)  RottenTomatoes  
       32.3155          0.5187

Making Predictions

If RottenTomatoes gave a more a score of 70, what score do we predict from RottenTomatoes_User?

fitline <- with(movie_ratings, BestFit(RottenTomatoes, RottenTomatoes_User))

fitline$m * 70 + fitline$b

[1] 68.62297

What if RottenTomatoes gave a more a score of 90? Or a score of 50?

fitline$m * 90 + fitline$b; fitline$m * 50 + fitline$b

[1] 78.99652

[1] 58.24941