1 Setup

1.1 R packages

library(broom)
library(ggrepel)
library(janitor)
library(knitr)
library(magrittr)
library(patchwork)
library(tidyverse)

theme_set(theme_bw())

1.2 Read in the data

file_raw <- "https://raw.githubusercontent.com/THOMASELOVE/431-2020/master/classes/movies/data/movies_2020-09-10.csv"

movies <- read_csv(file_raw)

Parsed with column specification:
cols(
  film_id = col_double(),
  film = col_character(),
  mentions = col_double(),
  year = col_double(),
  imdb_categories = col_character(),
  imdb_ratings = col_double(),
  imdb_stars = col_double(),
  length = col_double()
)

movies

# A tibble: 66 x 8
   film_id film    mentions  year imdb_categories imdb_ratings imdb_stars length
     <dbl> <chr>      <dbl> <dbl> <chr>                  <dbl>      <dbl>  <dbl>
 1       1 8 1/2          1  1963 Drama                 106555        8      138
 2       2 About ~        2  2013 Comedy, Drama,~       290158        7.8    123
 3       3 Avatar         1  2009 Action, Advent~      1101874        7.8    162
 4       4 Avenge~        1  2019 Action, Advent~       757530        8.4    181
 5       5 Avenge~        1  2018 Action, Advent~       798058        8.4    149
 6       6 Back t~        1  1985 Adventure, Com~      1028636        8.5    116
 7       7 Beaches        1  1988 Comedy, Drama,~        22854        7      123
 8       8 Beetle~        1  1988 Comedy, Fantasy       252221        7.5     92
 9       9 Being ~        1  1999 Comedy, Drama,~       305588        7.7    113
10      10 The Bi~        1  1998 Comedy, Crime,~       714755        8.1    117
# ... with 56 more rows

1.3 Sanity Checks

Let’s take a quick look at the variables we’ll actually use in our work:

movies %>% select(film, year, imdb_ratings) %$%
  summary(.)

     film                year       imdb_ratings    
 Length:66          Min.   :1955   Min.   :      6  
 Class :character   1st Qu.:1996   1st Qu.: 129284  
 Mode  :character   Median :2004   Median : 289463  
                    Mean   :2002   Mean   : 532777  
                    3rd Qu.:2012   3rd Qu.: 708161  
                    Max.   :2020   Max.   :2281331

Mostly, I’m checking the minimum and maximum for quantities. These seem plausible, although that minimum in imdb_ratings of 6 is impressive.

movies %>% select(film, year, imdb_ratings) %>% 
  slice_min(imdb_ratings) %>% kable()

film	year	imdb_ratings
Farewell My Concubine: the Beijing Opera	2014	6

OK. It’s plausible that film could have very few ratings.

2 Visualizing the Association

2.1 Create a new `age` variable

The year information tells us about the age of a film. We could calculate a film’s age, by subtracting 2020 - year, as follows.

movies <- movies %>%
    mutate(age = 2020 - year)

2.2 Exploratory Data Analyses

2.2.1 For `imdb_ratings`

imdb_ratings will be our outcome in our regression model, so understanding whether or not it is well described by a Normal model is somewhat helpful.

p1 <- ggplot(movies, aes(sample = imdb_ratings)) +
  geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot of imdb_ratings")

p2 <- ggplot(movies, aes(x = imdb_ratings)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = "dodgerblue", col = "white") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$imdb_ratings), 
                            sd = sd(movies$imdb_ratings)),
                col = "navy", lwd = 1.5) +
  labs(title = "Histogram with Normal Density")

p3 <- ggplot(movies, aes(x = imdb_ratings, y = "")) +
  geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") + 
  labs(title = "Boxplot of imdb_ratings", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

It would be better to label the axes of these plots on a more legible scale. One approach to this would be to use the following strategy:

p1 <- p1 +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1))

p2 <- p2 +
    scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))

p3 <- p3 +
    scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))


p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

If we decide that imdb_ratings doesn’t follow a Normal distribution, that’s not going to change our approach to linear regression modeling. As always, we’ll need to look for Normality in the residuals from the model, not the outcome.

Here are some numerical summaries, as well.

mosaic::favstats(~ imdb_ratings, data = movies) %>% kable(digits = 2)

	min	Q1	median	Q3	max	mean	sd	n	missing
	6	129284.2	289463	708160.5	2281331	532776.9	568185.5	66	0

2.2.2 For `age`

We’ll treat age as a predictor in our regression model. Whether age is Normally distributed or not will be of no consequence in our modeling. We will often be interested in understanding the center, spread, outliers and shape of a predictor’s distribution regardless, just so that we have a better sense of the data, and in particular, whether the interesting ages are well represented.

p1 <- ggplot(movies, aes(sample = age)) +
  geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot of age")

p2 <- ggplot(movies, aes(x = age)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = "dodgerblue", col = "white") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$age), 
                            sd = sd(movies$age)),

                                col = "navy", lwd = 1.5) +
  labs(title = "Histogram with Normal Density")

p3 <- ggplot(movies, aes(x = age, y = "")) +
  geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") + 
  labs(title = "Boxplot of age", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

mosaic::favstats(~ age, data = movies) %>% kable(digits = 2)

	min	Q1	median	Q3	max	mean	sd	n	missing
	0	8.25	15.5	23.75	65	18.02	13.5	66	0

2.3 A First Scatterplot

Now, we want to see the association between two quantitative variables: the age of the film, which we’ll treat as the predictor, and the the number of IMDB ratings (imdb_ratings) which we’ll treat as our outcome.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

This initial picture suggests that age alone isn’t a strong predictor of imdb_ratings. All of the work we do in what follows isn’t going to change that.

2.4 Changing the Y Axis

Can we change the Y axis tickmark labels to something more readable?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.5 Annotating the plot with r

Let’s add some text to indicate, r, the Pearson correlation.

movies %$% cor(age, imdb_ratings)

[1] -0.04779641

We could just type in this value…

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
             label = "Correlation = -0.048") +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.6 Pulling the label from the data

A better approach would be to pull it from the data:

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
        label = paste0("Correlation r = ", 
                        movies %$% cor(age, imdb_ratings))) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.7 Rounding!

Whoops - probably want to round that off…

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
        label = paste0("Correlation r = ", 
                        round_half_up(movies %$% cor(age, imdb_ratings),3))) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

Another option is to use signif_half_up to specify the number of significant figures you want to see in your response.

2.8 Can we label the films?

Now, Can we label the films?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_text_repel(aes(label = film)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.8.1 Labeling Some of the Films

Hmmm, maybe we don’t want to label them all. Let’s label the films that are at the top of the plot or on the far right. We’ll select those films that either have more than 1.7 Million ratings or that are more than 40 years old. Also, we’ll use geom_label_repel rather than geom_text_repel to see what that does.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_label_repel(aes(label = film),
        data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

I would suggest looking at the R Graphics Cookbook Chapter on Scatter Plots for recipes that might improve this work.

2.9 Adding fitted smooths

Let’s add a couple of smooths to the plot.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_text_repel(aes(label = film),
        data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

3 Fitting `mod_A`

3.1 The Linear Model?

Let’s look at that linear model.

mod_A <- lm(imdb_ratings ~ age, data = movies)

summary(mod_A)


Call:
lm(formula = imdb_ratings ~ age, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-556940 -364284 -236804  176939 1764616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   569016     117975   4.823  9.1e-06 ***
age            -2012       5255  -0.383    0.703    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572000 on 64 degrees of freedom
Multiple R-squared:  0.002284,  Adjusted R-squared:  -0.0133 
F-statistic: 0.1465 on 1 and 64 DF,  p-value: 0.7031

3.2 Tidy Model Coefficients and CIs

tidy(mod_A, conf.int = TRUE, conf.level = 0.90) %>% kable()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	569015.901	117975.307	4.8231780	0.0000091	372113.58	765918.225
age	-2011.584	5254.801	-0.3828088	0.7031300	-10781.92	6758.747

3.3 What does `glance` provide?

Here are the glance summaries we’ll use in the early part of the course.

glance(mod_A) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()

r.squared	adj.r.squared	sigma	AIC	BIC	nobs
0.0022845	-0.0133048	571952.8	1941.168	1947.737	66

Here are the other summaries that glance provides for a linear model fit using lm.

glance(mod_A) %>% select(statistic, p.value, df, logLik, deviance, df.residual)

# A tibble: 1 x 6
  statistic p.value    df logLik deviance df.residual
      <dbl>   <dbl> <dbl>  <dbl>    <dbl>       <int>
1     0.147   0.703     1  -968.  2.09e13          64

3.4 `mod_A`: Predictions

movies_augA <- augment(mod_A, movies)

Let’s look at the predictions for the first few films in the data set.

movies_augA %>% 
  select(film_id, film, year, age, imdb_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()

film_id	film	year	age	imdb_ratings	.fitted	.resid
1	8 1/2	1963	57	106555	454355.6	-347800.6
2	About Time	2013	7	290158	554934.8	-264776.8
3	Avatar	2009	11	1101874	546888.5	554985.5
4	Avengers: Endgame	2019	1	757530	567004.3	190525.7
5	Avengers: Infinity War	2018	2	798058	564992.7	233065.3

OK. Let’s look at the residual plots to see if our regression assumptions are reasonable now.

3.5 `mod_A` Residual Plots

p1 <- ggplot(movies_augA, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augA %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_A Residuals vs. Fitted",
       x = "Fitted Values from mod_A",
       y = "Residuals from mod_A")

p2 <- ggplot(movies_augA, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_A Residuals",
       y = "")

p3 <- ggplot(movies_augA, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

I don’t see a lot of curve in the residuals vs. fitted plot, but we definitely have a problem with the Normality assumption for the residuals. The plots show some substantial right skew. It might be wise to consider transforming our outcome with, for instance, a logarithm.

4 Consider Transformation?

4.1 Visualizing on the Log Scale

It looks like the relationship is pretty weak, but I am a bit concerned about the few films with very high numbers of ratings.

Might we try a transformation? Suppose we place the imdb_rankings on a logarithmic scale? R has a tool to help us do this for base 10 logs, so let’s try that.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

4.2 Identifying a new outlier

Now, maybe, we have a different outlier to worry about? What is that smallest value?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_text_repel(aes(label = film), col = "purple",
        data = movies %>% slice_min(imdb_ratings)) +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

4.3 Three Least Often Rated Films

What are the three films least often rated?

movies %>% select(film_id, film, imdb_ratings) %>% 
    slice_min(imdb_ratings, n = 3)

# A tibble: 3 x 3
  film_id film                                     imdb_ratings
    <dbl> <chr>                                           <dbl>
1      21 Farewell My Concubine: the Beijing Opera            6
2      33 House Party 2                                    5921
3      59 Still Walking                                   13154

5 Fitting `mod_B`

Note that since we’ve transformed the outcome (from imdb_ratings to its logarithm) the summaries here (like \(R^2\)) are no longer comparable to what we saw in mod_A.

For example, the Pearson correlation of age with log10(imdb_ratings) is different from the Pearson correlation of age with the raw imdb_ratings.

movies %$% cor(log10(imdb_ratings), age)

[1] -0.04998179

movies %$% cor(imdb_ratings, age)

[1] -0.04779641

5.1 What is the resulting model?

mod_B <- lm(log10(imdb_ratings) ~ age, data = movies)

summary(mod_B)


Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6457 -0.2279  0.0806  0.4367  0.9938 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.441649   0.166693   32.65   <2e-16 ***
age         -0.002973   0.007425   -0.40     0.69    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8081 on 64 degrees of freedom
Multiple R-squared:  0.002498,  Adjusted R-squared:  -0.01309 
F-statistic: 0.1603 on 1 and 64 DF,  p-value: 0.6902

5.2 Coefficients and Summaries

tidy(mod_B, conf.int = TRUE, conf.level = 0.90) %>% kable()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	5.4416493	0.1666925	32.6448316	0.0000000	5.1634373	5.7198613
age	-0.0029725	0.0074247	-0.4003547	0.6902285	-0.0153645	0.0094195

glance(mod_B) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()

r.squared	adj.r.squared	sigma	AIC	BIC	nobs
0.0024982	-0.0130878	0.8081374	163.1499	169.7189	66

What conclusions can you draw here?

5.3 `mod_B`: Predictions

movies_augB <- augment(mod_B, movies)

Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10 of imdb_ratings, so we need to think about that.

movies_augB %>% 
  mutate(log10_ratings = log10(imdb_ratings)) %>%
  select(film_id, film, year, age, 
         imdb_ratings, log10_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()

film_id	film	year	age	imdb_ratings	log10_ratings	.fitted	.resid
1	8 1/2	1963	57	106555	5.027574	5.272215	-0.2446413
2	About Time	2013	7	290158	5.462635	5.420842	0.0417929
3	Avatar	2009	11	1101874	6.042132	5.408951	0.6331805
4	Avengers: Endgame	2019	1	757530	5.879400	5.438677	0.4407231
5	Avengers: Infinity War	2018	2	798058	5.902035	5.435704	0.4663302

OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.

5.4 `mod_B` Residual Plots

p1 <- ggplot(movies_augB, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augB %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_B Residuals vs. Fitted",
       x = "Fitted Values from mod_B",
       y = "Residuals from mod_B")

p2 <- ggplot(movies_augB, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_B Residuals",
       y = "")

p3 <- ggplot(movies_augB, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

That low outlier certainly stands out. Perhaps we should look at the data excluding that point to see if we can plausibly fit a model.

6 Fitting `mod_C`

6.1 Dropping One Film

Suppose we decided to look at how well we could predict the logarithm of the imdb_ratings if if we dropped “Farewell My Concubine” from the list. Let’s create a new tibble, where we filter this film out.

movies_minus_one <- movies %>% filter(imdb_ratings > 10)

6.2 Redrawing the Association

ggplot(movies_minus_one, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students",
         subtitle = "Excluding one film with only 6 IMDB Ratings")

6.3 Fitting model `mod_C`

mod_C <- lm(log10(imdb_ratings) ~ age, data = movies_minus_one)

summary(mod_C)


Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies_minus_one)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.60221 -0.22154  0.04478  0.31639  0.97105 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.601282   0.116619  48.031   <2e-16 ***
age         -0.007817   0.005158  -1.516    0.135    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5579 on 63 degrees of freedom
Multiple R-squared:  0.03517,   Adjusted R-squared:  0.01986 
F-statistic: 2.297 on 1 and 63 DF,  p-value: 0.1346

6.4 Coefficients and Summaries

tidy(mod_C, conf.int = TRUE, conf.level = 0.90) %>% kable()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	5.6012824	0.1166190	48.030623	0.0000000	5.4065984	5.7959664
age	-0.0078166	0.0051577	-1.515519	0.1346426	-0.0164268	0.0007937

glance(mod_C) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()

r.squared	adj.r.squared	sigma	AIC	BIC	nobs
0.0351747	0.0198601	0.5578977	112.5652	119.0884	65

Are the \(R^2\) values we obtain for Model C comparable to those we developed for Model A? For Model B?

6.5 `mod_C`: Predictions

movies_augC <- augment(mod_C, movies_minus_one)

Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10 of imdb_ratings, so we need to think about that.

movies_augC %>% 
  mutate(log10_ratings = log10(imdb_ratings)) %>%
  select(film_id, film, year, age, 
         imdb_ratings, log10_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()

film_id	film	year	age	imdb_ratings	log10_ratings	.fitted	.resid
1	8 1/2	1963	57	106555	5.027574	5.155739	-0.1281646
2	About Time	2013	7	290158	5.462635	5.546566	-0.0839319
3	Avatar	2009	11	1101874	6.042132	5.515300	0.5268317
4	Avengers: Endgame	2019	1	757530	5.879400	5.593466	0.2859340
5	Avengers: Infinity War	2018	2	798058	5.902035	5.585649	0.3163852

OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.

6.6 `mod_C` Residual Plots

p1 <- ggplot(movies_augC, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augC %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_C Residuals vs. Fitted",
       x = "Fitted Values from mod_C",
       y = "Residuals from mod_C")

p2 <- ggplot(movies_augC, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_C Residuals",
       y = "")

p3 <- ggplot(movies_augC, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

We still have some low outliers, but the residuals are closer to a Normal distribution, and I don’t see a strong curve in the plot against fitted values.

The main problem is that the model remains very weak. age alone isn’t a strong predictor of imdb_ratings.

7 Closing Materials

To view the HTML report generated by this R Markdown file, visit https://rpubs.com/TELOVE/movies-A-431-2020

7.1 Session Info

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2    mosaicData_0.20.1 ggformula_0.9.4   ggstance_0.3.4   
 [5] Matrix_1.2-18     lattice_0.20-41   forcats_0.5.0     stringr_1.4.0    
 [9] dplyr_1.0.2       purrr_0.3.4       readr_1.3.1       tidyr_1.1.2      
[13] tibble_3.0.3      tidyverse_1.3.0   patchwork_1.0.1   magrittr_1.5     
[17] knitr_1.29        janitor_2.0.1     ggrepel_0.8.2     ggplot2_3.3.2    
[21] broom_0.7.0      

loaded via a namespace (and not attached):
 [1] nlme_3.1-148      fs_1.5.0          lubridate_1.7.9   httr_1.4.2       
 [5] tools_4.0.2       backports_1.1.10  utf8_1.1.4        R6_2.4.1         
 [9] DBI_1.1.0         mgcv_1.8-31       colorspace_1.4-1  withr_2.2.0      
[13] tidyselect_1.1.0  gridExtra_2.3     leaflet_2.0.3     curl_4.3         
[17] compiler_4.0.2    cli_2.0.2         rvest_0.3.6       xml2_1.3.2       
[21] ggdendro_0.1.22   labeling_0.3      mosaicCore_0.8.0  scales_1.1.1     
[25] digest_0.6.25     rmarkdown_2.3.3   pkgconfig_2.0.3   htmltools_0.5.0  
[29] dbplyr_1.4.4      highr_0.8         htmlwidgets_1.5.1 rlang_0.4.7      
[33] readxl_1.3.1      rstudioapi_0.11   farver_2.0.3      generics_0.0.2   
[37] jsonlite_1.7.1    crosstalk_1.1.0.1 Rcpp_1.0.5        munsell_0.5.0    
[41] fansi_0.4.1       lifecycle_0.2.0   stringi_1.5.3     yaml_2.2.1       
[45] snakecase_0.11.0  MASS_7.3-53       plyr_1.8.6        grid_4.0.2       
[49] blob_1.2.1        crayon_1.3.4      haven_2.3.1       splines_4.0.2    
[53] hms_0.5.3         pillar_1.4.6      reprex_0.3.0      glue_1.4.2       
[57] evaluate_0.14     modelr_0.1.8      vctrs_0.3.4       tweenr_1.0.1     
[61] cellranger_1.1.0  gtable_0.3.0      polyclip_1.10-0   assertthat_0.2.1 
[65] xfun_0.16         ggforce_0.3.2     mosaic_1.8.2      ellipsis_0.3.1

Do Older Films Have More IMDB ratings?

Thomas E. Love

2020-09-24