1 Setup

1.1 R packages

library(broom)
library(ggrepel)
library(janitor)
library(knitr)
library(magrittr)
library(patchwork)
library(tidyverse)

theme_set(theme_bw())

1.2 Read in the data

file_raw <- "https://raw.githubusercontent.com/THOMASELOVE/431-2020/master/classes/movies/data/movies_2020-09-10.csv"

movies <- read_csv(file_raw)
Parsed with column specification:
cols(
  film_id = col_double(),
  film = col_character(),
  mentions = col_double(),
  year = col_double(),
  imdb_categories = col_character(),
  imdb_ratings = col_double(),
  imdb_stars = col_double(),
  length = col_double()
)
movies
# A tibble: 66 x 8
   film_id film    mentions  year imdb_categories imdb_ratings imdb_stars length
     <dbl> <chr>      <dbl> <dbl> <chr>                  <dbl>      <dbl>  <dbl>
 1       1 8 1/2          1  1963 Drama                 106555        8      138
 2       2 About ~        2  2013 Comedy, Drama,~       290158        7.8    123
 3       3 Avatar         1  2009 Action, Advent~      1101874        7.8    162
 4       4 Avenge~        1  2019 Action, Advent~       757530        8.4    181
 5       5 Avenge~        1  2018 Action, Advent~       798058        8.4    149
 6       6 Back t~        1  1985 Adventure, Com~      1028636        8.5    116
 7       7 Beaches        1  1988 Comedy, Drama,~        22854        7      123
 8       8 Beetle~        1  1988 Comedy, Fantasy       252221        7.5     92
 9       9 Being ~        1  1999 Comedy, Drama,~       305588        7.7    113
10      10 The Bi~        1  1998 Comedy, Crime,~       714755        8.1    117
# ... with 56 more rows

1.3 Sanity Checks

Let’s take a quick look at the variables we’ll actually use in our work:

movies %>% select(film, year, imdb_ratings) %$%
  summary(.)
     film                year       imdb_ratings    
 Length:66          Min.   :1955   Min.   :      6  
 Class :character   1st Qu.:1996   1st Qu.: 129284  
 Mode  :character   Median :2004   Median : 289463  
                    Mean   :2002   Mean   : 532777  
                    3rd Qu.:2012   3rd Qu.: 708161  
                    Max.   :2020   Max.   :2281331  

Mostly, I’m checking the minimum and maximum for quantities. These seem plausible, although that minimum in imdb_ratings of 6 is impressive.

movies %>% select(film, year, imdb_ratings) %>% 
  slice_min(imdb_ratings) %>% kable()
film year imdb_ratings
Farewell My Concubine: the Beijing Opera 2014 6

OK. It’s plausible that film could have very few ratings.

2 Visualizing the Association

2.1 Create a new age variable

The year information tells us about the age of a film. We could calculate a film’s age, by subtracting 2020 - year, as follows.

movies <- movies %>%
    mutate(age = 2020 - year)

2.2 Exploratory Data Analyses

2.2.1 For imdb_ratings

imdb_ratings will be our outcome in our regression model, so understanding whether or not it is well described by a Normal model is somewhat helpful.

p1 <- ggplot(movies, aes(sample = imdb_ratings)) +
  geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot of imdb_ratings")

p2 <- ggplot(movies, aes(x = imdb_ratings)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = "dodgerblue", col = "white") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$imdb_ratings), 
                            sd = sd(movies$imdb_ratings)),
                col = "navy", lwd = 1.5) +
  labs(title = "Histogram with Normal Density")

p3 <- ggplot(movies, aes(x = imdb_ratings, y = "")) +
  geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") + 
  labs(title = "Boxplot of imdb_ratings", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

It would be better to label the axes of these plots on a more legible scale. One approach to this would be to use the following strategy:

p1 <- p1 +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1))

p2 <- p2 +
    scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))

p3 <- p3 +
    scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))


p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

If we decide that imdb_ratings doesn’t follow a Normal distribution, that’s not going to change our approach to linear regression modeling. As always, we’ll need to look for Normality in the residuals from the model, not the outcome.

Here are some numerical summaries, as well.

mosaic::favstats(~ imdb_ratings, data = movies) %>% kable(digits = 2)
min Q1 median Q3 max mean sd n missing
6 129284.2 289463 708160.5 2281331 532776.9 568185.5 66 0

2.2.2 For age

We’ll treat age as a predictor in our regression model. Whether age is Normally distributed or not will be of no consequence in our modeling. We will often be interested in understanding the center, spread, outliers and shape of a predictor’s distribution regardless, just so that we have a better sense of the data, and in particular, whether the interesting ages are well represented.

p1 <- ggplot(movies, aes(sample = age)) +
  geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot of age")

p2 <- ggplot(movies, aes(x = age)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 10, fill = "dodgerblue", col = "white") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(movies$age), 
                            sd = sd(movies$age)),

                                col = "navy", lwd = 1.5) +
  labs(title = "Histogram with Normal Density")

p3 <- ggplot(movies, aes(x = age, y = "")) +
  geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") + 
  labs(title = "Boxplot of age", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

mosaic::favstats(~ age, data = movies) %>% kable(digits = 2)
min Q1 median Q3 max mean sd n missing
0 8.25 15.5 23.75 65 18.02 13.5 66 0

2.3 A First Scatterplot

Now, we want to see the association between two quantitative variables: the age of the film, which we’ll treat as the predictor, and the the number of IMDB ratings (imdb_ratings) which we’ll treat as our outcome.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

This initial picture suggests that age alone isn’t a strong predictor of imdb_ratings. All of the work we do in what follows isn’t going to change that.

2.4 Changing the Y Axis

Can we change the Y axis tickmark labels to something more readable?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.5 Annotating the plot with r

Let’s add some text to indicate, r, the Pearson correlation.

movies %$% cor(age, imdb_ratings)
[1] -0.04779641

We could just type in this value…

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
             label = "Correlation = -0.048") +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.6 Pulling the label from the data

A better approach would be to pull it from the data:

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
        label = paste0("Correlation r = ", 
                        movies %$% cor(age, imdb_ratings))) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.7 Rounding!

Whoops - probably want to round that off…

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    annotate(geom = "label", x = 50, y = 1250000, 
        label = paste0("Correlation r = ", 
                        round_half_up(movies %$% cor(age, imdb_ratings),3))) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

Another option is to use signif_half_up to specify the number of significant figures you want to see in your response.

2.8 Can we label the films?

Now, Can we label the films?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_text_repel(aes(label = film)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

2.8.1 Labeling Some of the Films

Hmmm, maybe we don’t want to label them all. Let’s label the films that are at the top of the plot or on the far right. We’ll select those films that either have more than 1.7 Million ratings or that are more than 40 years old. Also, we’ll use geom_label_repel rather than geom_text_repel to see what that does.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_label_repel(aes(label = film),
        data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

I would suggest looking at the R Graphics Cookbook Chapter on Scatter Plots for recipes that might improve this work.

2.9 Adding fitted smooths

Let’s add a couple of smooths to the plot.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
    geom_text_repel(aes(label = film),
        data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

3 Fitting mod_A

3.1 The Linear Model?

Let’s look at that linear model.

mod_A <- lm(imdb_ratings ~ age, data = movies)

summary(mod_A)

Call:
lm(formula = imdb_ratings ~ age, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-556940 -364284 -236804  176939 1764616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   569016     117975   4.823  9.1e-06 ***
age            -2012       5255  -0.383    0.703    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 572000 on 64 degrees of freedom
Multiple R-squared:  0.002284,  Adjusted R-squared:  -0.0133 
F-statistic: 0.1465 on 1 and 64 DF,  p-value: 0.7031

3.2 Tidy Model Coefficients and CIs

tidy(mod_A, conf.int = TRUE, conf.level = 0.90) %>% kable()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 569015.901 117975.307 4.8231780 0.0000091 372113.58 765918.225
age -2011.584 5254.801 -0.3828088 0.7031300 -10781.92 6758.747

3.3 What does glance provide?

Here are the glance summaries we’ll use in the early part of the course.

glance(mod_A) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()
r.squared adj.r.squared sigma AIC BIC nobs
0.0022845 -0.0133048 571952.8 1941.168 1947.737 66

Here are the other summaries that glance provides for a linear model fit using lm.

glance(mod_A) %>% select(statistic, p.value, df, logLik, deviance, df.residual)
# A tibble: 1 x 6
  statistic p.value    df logLik deviance df.residual
      <dbl>   <dbl> <dbl>  <dbl>    <dbl>       <int>
1     0.147   0.703     1  -968.  2.09e13          64

3.4 mod_A: Predictions

movies_augA <- augment(mod_A, movies)

Let’s look at the predictions for the first few films in the data set.

movies_augA %>% 
  select(film_id, film, year, age, imdb_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()
film_id film year age imdb_ratings .fitted .resid
1 8 1/2 1963 57 106555 454355.6 -347800.6
2 About Time 2013 7 290158 554934.8 -264776.8
3 Avatar 2009 11 1101874 546888.5 554985.5
4 Avengers: Endgame 2019 1 757530 567004.3 190525.7
5 Avengers: Infinity War 2018 2 798058 564992.7 233065.3

OK. Let’s look at the residual plots to see if our regression assumptions are reasonable now.

3.5 mod_A Residual Plots

p1 <- ggplot(movies_augA, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augA %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_A Residuals vs. Fitted",
       x = "Fitted Values from mod_A",
       y = "Residuals from mod_A")

p2 <- ggplot(movies_augA, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_A Residuals",
       y = "")

p3 <- ggplot(movies_augA, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

I don’t see a lot of curve in the residuals vs. fitted plot, but we definitely have a problem with the Normality assumption for the residuals. The plots show some substantial right skew. It might be wise to consider transforming our outcome with, for instance, a logarithm.

4 Consider Transformation?

4.1 Visualizing on the Log Scale

It looks like the relationship is pretty weak, but I am a bit concerned about the few films with very high numbers of ratings.

Might we try a transformation? Suppose we place the imdb_rankings on a logarithmic scale? R has a tool to help us do this for base 10 logs, so let’s try that.

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

4.2 Identifying a new outlier

Now, maybe, we have a different outlier to worry about? What is that smallest value?

ggplot(movies, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_text_repel(aes(label = film), col = "purple",
        data = movies %>% slice_min(imdb_ratings)) +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students")

4.3 Three Least Often Rated Films

What are the three films least often rated?

movies %>% select(film_id, film, imdb_ratings) %>% 
    slice_min(imdb_ratings, n = 3)
# A tibble: 3 x 3
  film_id film                                     imdb_ratings
    <dbl> <chr>                                           <dbl>
1      21 Farewell My Concubine: the Beijing Opera            6
2      33 House Party 2                                    5921
3      59 Still Walking                                   13154

5 Fitting mod_B

Note that since we’ve transformed the outcome (from imdb_ratings to its logarithm) the summaries here (like \(R^2\)) are no longer comparable to what we saw in mod_A.

For example, the Pearson correlation of age with log10(imdb_ratings) is different from the Pearson correlation of age with the raw imdb_ratings.

movies %$% cor(log10(imdb_ratings), age)
[1] -0.04998179
movies %$% cor(imdb_ratings, age)
[1] -0.04779641

5.1 What is the resulting model?

mod_B <- lm(log10(imdb_ratings) ~ age, data = movies)

summary(mod_B)

Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6457 -0.2279  0.0806  0.4367  0.9938 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.441649   0.166693   32.65   <2e-16 ***
age         -0.002973   0.007425   -0.40     0.69    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8081 on 64 degrees of freedom
Multiple R-squared:  0.002498,  Adjusted R-squared:  -0.01309 
F-statistic: 0.1603 on 1 and 64 DF,  p-value: 0.6902

5.2 Coefficients and Summaries

tidy(mod_B, conf.int = TRUE, conf.level = 0.90) %>% kable()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5.4416493 0.1666925 32.6448316 0.0000000 5.1634373 5.7198613
age -0.0029725 0.0074247 -0.4003547 0.6902285 -0.0153645 0.0094195
glance(mod_B) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()
r.squared adj.r.squared sigma AIC BIC nobs
0.0024982 -0.0130878 0.8081374 163.1499 169.7189 66

What conclusions can you draw here?

5.3 mod_B: Predictions

movies_augB <- augment(mod_B, movies)

Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10 of imdb_ratings, so we need to think about that.

movies_augB %>% 
  mutate(log10_ratings = log10(imdb_ratings)) %>%
  select(film_id, film, year, age, 
         imdb_ratings, log10_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()
film_id film year age imdb_ratings log10_ratings .fitted .resid
1 8 1/2 1963 57 106555 5.027574 5.272215 -0.2446413
2 About Time 2013 7 290158 5.462635 5.420842 0.0417929
3 Avatar 2009 11 1101874 6.042132 5.408951 0.6331805
4 Avengers: Endgame 2019 1 757530 5.879400 5.438677 0.4407231
5 Avengers: Infinity War 2018 2 798058 5.902035 5.435704 0.4663302

OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.

5.4 mod_B Residual Plots

p1 <- ggplot(movies_augB, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augB %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_B Residuals vs. Fitted",
       x = "Fitted Values from mod_B",
       y = "Residuals from mod_B")

p2 <- ggplot(movies_augB, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_B Residuals",
       y = "")

p3 <- ggplot(movies_augB, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

That low outlier certainly stands out. Perhaps we should look at the data excluding that point to see if we can plausibly fit a model.

6 Fitting mod_C

6.1 Dropping One Film

Suppose we decided to look at how well we could predict the logarithm of the imdb_ratings if if we dropped “Farewell My Concubine” from the list. Let’s create a new tibble, where we filter this film out.

movies_minus_one <- movies %>% filter(imdb_ratings > 10)

6.2 Redrawing the Association

ggplot(movies_minus_one, aes(x = age, y = imdb_ratings)) + 
    geom_point() +
    scale_y_log10() +
    geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
    geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
    labs(title = "Movies Mentioned as Favorites by 2020 431 Students",
         subtitle = "Excluding one film with only 6 IMDB Ratings")

6.3 Fitting model mod_C

mod_C <- lm(log10(imdb_ratings) ~ age, data = movies_minus_one)

summary(mod_C)

Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies_minus_one)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.60221 -0.22154  0.04478  0.31639  0.97105 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.601282   0.116619  48.031   <2e-16 ***
age         -0.007817   0.005158  -1.516    0.135    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5579 on 63 degrees of freedom
Multiple R-squared:  0.03517,   Adjusted R-squared:  0.01986 
F-statistic: 2.297 on 1 and 63 DF,  p-value: 0.1346

6.4 Coefficients and Summaries

tidy(mod_C, conf.int = TRUE, conf.level = 0.90) %>% kable()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5.6012824 0.1166190 48.030623 0.0000000 5.4065984 5.7959664
age -0.0078166 0.0051577 -1.515519 0.1346426 -0.0164268 0.0007937
glance(mod_C) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>% 
    kable()
r.squared adj.r.squared sigma AIC BIC nobs
0.0351747 0.0198601 0.5578977 112.5652 119.0884 65

Are the \(R^2\) values we obtain for Model C comparable to those we developed for Model A? For Model B?

6.5 mod_C: Predictions

movies_augC <- augment(mod_C, movies_minus_one)

Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10 of imdb_ratings, so we need to think about that.

movies_augC %>% 
  mutate(log10_ratings = log10(imdb_ratings)) %>%
  select(film_id, film, year, age, 
         imdb_ratings, log10_ratings, .fitted, .resid) %>% 
  head(5) %>% kable()
film_id film year age imdb_ratings log10_ratings .fitted .resid
1 8 1/2 1963 57 106555 5.027574 5.155739 -0.1281646
2 About Time 2013 7 290158 5.462635 5.546566 -0.0839319
3 Avatar 2009 11 1101874 6.042132 5.515300 0.5268317
4 Avengers: Endgame 2019 1 757530 5.879400 5.593466 0.2859340
5 Avengers: Infinity War 2018 2 798058 5.902035 5.585649 0.3163852

OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.

6.6 mod_C Residual Plots

p1 <- ggplot(movies_augC, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, se = F,
              lty = "dashed", col = "black") +
  geom_smooth(method = "loess", formula = y ~ x, se = F, 
              col = "blue") +
  geom_text_repel(data = movies_augC %>% 
                    slice_max(abs(.resid), n = 3), 
                  aes(label = film)) +
  labs(title = "mod_C Residuals vs. Fitted",
       x = "Fitted Values from mod_C",
       y = "Residuals from mod_C")

p2 <- ggplot(movies_augC, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") + 
  labs(title = "mod_C Residuals",
       y = "")

p3 <- ggplot(movies_augC, aes(y = .resid, x = "")) +
  geom_violin(fill = "tomato") +
  geom_boxplot(width = 0.5) + 
  labs(y = "", x = "")

p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))

We still have some low outliers, but the residuals are closer to a Normal distribution, and I don’t see a strong curve in the plot against fitted values.

The main problem is that the model remains very weak. age alone isn’t a strong predictor of imdb_ratings.

7 Closing Materials

To view the HTML report generated by this R Markdown file, visit https://rpubs.com/TELOVE/movies-A-431-2020

7.1 Session Info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2    mosaicData_0.20.1 ggformula_0.9.4   ggstance_0.3.4   
 [5] Matrix_1.2-18     lattice_0.20-41   forcats_0.5.0     stringr_1.4.0    
 [9] dplyr_1.0.2       purrr_0.3.4       readr_1.3.1       tidyr_1.1.2      
[13] tibble_3.0.3      tidyverse_1.3.0   patchwork_1.0.1   magrittr_1.5     
[17] knitr_1.29        janitor_2.0.1     ggrepel_0.8.2     ggplot2_3.3.2    
[21] broom_0.7.0      

loaded via a namespace (and not attached):
 [1] nlme_3.1-148      fs_1.5.0          lubridate_1.7.9   httr_1.4.2       
 [5] tools_4.0.2       backports_1.1.10  utf8_1.1.4        R6_2.4.1         
 [9] DBI_1.1.0         mgcv_1.8-31       colorspace_1.4-1  withr_2.2.0      
[13] tidyselect_1.1.0  gridExtra_2.3     leaflet_2.0.3     curl_4.3         
[17] compiler_4.0.2    cli_2.0.2         rvest_0.3.6       xml2_1.3.2       
[21] ggdendro_0.1.22   labeling_0.3      mosaicCore_0.8.0  scales_1.1.1     
[25] digest_0.6.25     rmarkdown_2.3.3   pkgconfig_2.0.3   htmltools_0.5.0  
[29] dbplyr_1.4.4      highr_0.8         htmlwidgets_1.5.1 rlang_0.4.7      
[33] readxl_1.3.1      rstudioapi_0.11   farver_2.0.3      generics_0.0.2   
[37] jsonlite_1.7.1    crosstalk_1.1.0.1 Rcpp_1.0.5        munsell_0.5.0    
[41] fansi_0.4.1       lifecycle_0.2.0   stringi_1.5.3     yaml_2.2.1       
[45] snakecase_0.11.0  MASS_7.3-53       plyr_1.8.6        grid_4.0.2       
[49] blob_1.2.1        crayon_1.3.4      haven_2.3.1       splines_4.0.2    
[53] hms_0.5.3         pillar_1.4.6      reprex_0.3.0      glue_1.4.2       
[57] evaluate_0.14     modelr_0.1.8      vctrs_0.3.4       tweenr_1.0.1     
[61] cellranger_1.1.0  gtable_0.3.0      polyclip_1.10-0   assertthat_0.2.1 
[65] xfun_0.16         ggforce_0.3.2     mosaic_1.8.2      ellipsis_0.3.1