library(broom)
library(ggrepel)
library(janitor)
library(knitr)
library(magrittr)
library(patchwork)
library(tidyverse)
theme_set(theme_bw())
file_raw <- "https://raw.githubusercontent.com/THOMASELOVE/431-2020/master/classes/movies/data/movies_2020-09-10.csv"
movies <- read_csv(file_raw)
Parsed with column specification:
cols(
film_id = col_double(),
film = col_character(),
mentions = col_double(),
year = col_double(),
imdb_categories = col_character(),
imdb_ratings = col_double(),
imdb_stars = col_double(),
length = col_double()
)
movies
# A tibble: 66 x 8
film_id film mentions year imdb_categories imdb_ratings imdb_stars length
<dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 8 1/2 1 1963 Drama 106555 8 138
2 2 About ~ 2 2013 Comedy, Drama,~ 290158 7.8 123
3 3 Avatar 1 2009 Action, Advent~ 1101874 7.8 162
4 4 Avenge~ 1 2019 Action, Advent~ 757530 8.4 181
5 5 Avenge~ 1 2018 Action, Advent~ 798058 8.4 149
6 6 Back t~ 1 1985 Adventure, Com~ 1028636 8.5 116
7 7 Beaches 1 1988 Comedy, Drama,~ 22854 7 123
8 8 Beetle~ 1 1988 Comedy, Fantasy 252221 7.5 92
9 9 Being ~ 1 1999 Comedy, Drama,~ 305588 7.7 113
10 10 The Bi~ 1 1998 Comedy, Crime,~ 714755 8.1 117
# ... with 56 more rows
Let’s take a quick look at the variables we’ll actually use in our work:
movies %>% select(film, year, imdb_ratings) %$%
summary(.)
film year imdb_ratings
Length:66 Min. :1955 Min. : 6
Class :character 1st Qu.:1996 1st Qu.: 129284
Mode :character Median :2004 Median : 289463
Mean :2002 Mean : 532777
3rd Qu.:2012 3rd Qu.: 708161
Max. :2020 Max. :2281331
Mostly, I’m checking the minimum and maximum for quantities. These seem plausible, although that minimum in imdb_ratings
of 6 is impressive.
movies %>% select(film, year, imdb_ratings) %>%
slice_min(imdb_ratings) %>% kable()
film | year | imdb_ratings |
---|---|---|
Farewell My Concubine: the Beijing Opera | 2014 | 6 |
OK. It’s plausible that film could have very few ratings.
age
variableThe year
information tells us about the age of a film. We could calculate a film’s age, by subtracting 2020 - year
, as follows.
movies <- movies %>%
mutate(age = 2020 - year)
imdb_ratings
imdb_ratings
will be our outcome in our regression model, so understanding whether or not it is well described by a Normal model is somewhat helpful.
p1 <- ggplot(movies, aes(sample = imdb_ratings)) +
geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot of imdb_ratings")
p2 <- ggplot(movies, aes(x = imdb_ratings)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(movies$imdb_ratings),
sd = sd(movies$imdb_ratings)),
col = "navy", lwd = 1.5) +
labs(title = "Histogram with Normal Density")
p3 <- ggplot(movies, aes(x = imdb_ratings, y = "")) +
geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
labs(title = "Boxplot of imdb_ratings", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
It would be better to label the axes of these plots on a more legible scale. One approach to this would be to use the following strategy:
p1 <- p1 +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1))
p2 <- p2 +
scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))
p3 <- p3 +
scale_x_continuous(labels = scales::label_number_si(accuracy = 0.1))
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
If we decide that imdb_ratings
doesn’t follow a Normal distribution, that’s not going to change our approach to linear regression modeling. As always, we’ll need to look for Normality in the residuals from the model, not the outcome.
Here are some numerical summaries, as well.
mosaic::favstats(~ imdb_ratings, data = movies) %>% kable(digits = 2)
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
6 | 129284.2 | 289463 | 708160.5 | 2281331 | 532776.9 | 568185.5 | 66 | 0 |
age
We’ll treat age
as a predictor in our regression model. Whether age
is Normally distributed or not will be of no consequence in our modeling. We will often be interested in understanding the center, spread, outliers and shape of a predictor’s distribution regardless, just so that we have a better sense of the data, and in particular, whether the interesting ages
are well represented.
p1 <- ggplot(movies, aes(sample = age)) +
geom_qq(col = "dodgerblue") + geom_qq_line(col = "navy") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q plot of age")
p2 <- ggplot(movies, aes(x = age)) +
geom_histogram(aes(y = stat(density)),
bins = 10, fill = "dodgerblue", col = "white") +
stat_function(fun = dnorm,
args = list(mean = mean(movies$age),
sd = sd(movies$age)),
col = "navy", lwd = 1.5) +
labs(title = "Histogram with Normal Density")
p3 <- ggplot(movies, aes(x = age, y = "")) +
geom_boxplot(fill = "dodgerblue", outlier.color = "dodgerblue") +
labs(title = "Boxplot of age", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ age, data = movies) %>% kable(digits = 2)
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 8.25 | 15.5 | 23.75 | 65 | 18.02 | 13.5 | 66 | 0 |
Now, we want to see the association between two quantitative variables: the age
of the film, which we’ll treat as the predictor, and the the number of IMDB ratings (imdb_ratings
) which we’ll treat as our outcome.
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
This initial picture suggests that age
alone isn’t a strong predictor of imdb_ratings
. All of the work we do in what follows isn’t going to change that.
Can we change the Y axis tickmark labels to something more readable?
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
Let’s add some text to indicate, r, the Pearson correlation.
movies %$% cor(age, imdb_ratings)
[1] -0.04779641
We could just type in this value…
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
annotate(geom = "label", x = 50, y = 1250000,
label = "Correlation = -0.048") +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
A better approach would be to pull it from the data:
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
annotate(geom = "label", x = 50, y = 1250000,
label = paste0("Correlation r = ",
movies %$% cor(age, imdb_ratings))) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
Whoops - probably want to round that off…
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
annotate(geom = "label", x = 50, y = 1250000,
label = paste0("Correlation r = ",
round_half_up(movies %$% cor(age, imdb_ratings),3))) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
Another option is to use signif_half_up
to specify the number of significant figures you want to see in your response.
Now, Can we label the films?
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
geom_text_repel(aes(label = film)) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
Hmmm, maybe we don’t want to label them all. Let’s label the films that are at the top of the plot or on the far right. We’ll select those films that either have more than 1.7 Million ratings or that are more than 40 years old. Also, we’ll use geom_label_repel
rather than geom_text_repel
to see what that does.
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
geom_label_repel(aes(label = film),
data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
I would suggest looking at the R Graphics Cookbook Chapter on Scatter Plots for recipes that might improve this work.
Let’s add a couple of smooths to the plot.
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_continuous(labels = scales::label_number_si(accuracy = 0.1)) +
geom_text_repel(aes(label = film),
data = movies %>% filter(imdb_ratings > 1700000 | age > 40)) +
geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
mod_A
Let’s look at that linear model.
mod_A <- lm(imdb_ratings ~ age, data = movies)
summary(mod_A)
Call:
lm(formula = imdb_ratings ~ age, data = movies)
Residuals:
Min 1Q Median 3Q Max
-556940 -364284 -236804 176939 1764616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 569016 117975 4.823 9.1e-06 ***
age -2012 5255 -0.383 0.703
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 572000 on 64 degrees of freedom
Multiple R-squared: 0.002284, Adjusted R-squared: -0.0133
F-statistic: 0.1465 on 1 and 64 DF, p-value: 0.7031
tidy(mod_A, conf.int = TRUE, conf.level = 0.90) %>% kable()
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 569015.901 | 117975.307 | 4.8231780 | 0.0000091 | 372113.58 | 765918.225 |
age | -2011.584 | 5254.801 | -0.3828088 | 0.7031300 | -10781.92 | 6758.747 |
glance
provide?Here are the glance
summaries we’ll use in the early part of the course.
glance(mod_A) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>%
kable()
r.squared | adj.r.squared | sigma | AIC | BIC | nobs |
---|---|---|---|---|---|
0.0022845 | -0.0133048 | 571952.8 | 1941.168 | 1947.737 | 66 |
Here are the other summaries that glance
provides for a linear model fit using lm
.
glance(mod_A) %>% select(statistic, p.value, df, logLik, deviance, df.residual)
# A tibble: 1 x 6
statistic p.value df logLik deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.147 0.703 1 -968. 2.09e13 64
mod_A
: Predictionsmovies_augA <- augment(mod_A, movies)
Let’s look at the predictions for the first few films in the data set.
movies_augA %>%
select(film_id, film, year, age, imdb_ratings, .fitted, .resid) %>%
head(5) %>% kable()
film_id | film | year | age | imdb_ratings | .fitted | .resid |
---|---|---|---|---|---|---|
1 | 8 1/2 | 1963 | 57 | 106555 | 454355.6 | -347800.6 |
2 | About Time | 2013 | 7 | 290158 | 554934.8 | -264776.8 |
3 | Avatar | 2009 | 11 | 1101874 | 546888.5 | 554985.5 |
4 | Avengers: Endgame | 2019 | 1 | 757530 | 567004.3 | 190525.7 |
5 | Avengers: Infinity War | 2018 | 2 | 798058 | 564992.7 | 233065.3 |
OK. Let’s look at the residual plots to see if our regression assumptions are reasonable now.
mod_A
Residual Plotsp1 <- ggplot(movies_augA, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F,
lty = "dashed", col = "black") +
geom_smooth(method = "loess", formula = y ~ x, se = F,
col = "blue") +
geom_text_repel(data = movies_augA %>%
slice_max(abs(.resid), n = 3),
aes(label = film)) +
labs(title = "mod_A Residuals vs. Fitted",
x = "Fitted Values from mod_A",
y = "Residuals from mod_A")
p2 <- ggplot(movies_augA, aes(sample = .resid)) +
geom_qq() + geom_qq_line(col = "red") +
labs(title = "mod_A Residuals",
y = "")
p3 <- ggplot(movies_augA, aes(y = .resid, x = "")) +
geom_violin(fill = "tomato") +
geom_boxplot(width = 0.5) +
labs(y = "", x = "")
p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))
I don’t see a lot of curve in the residuals vs. fitted plot, but we definitely have a problem with the Normality assumption for the residuals. The plots show some substantial right skew. It might be wise to consider transforming our outcome with, for instance, a logarithm.
It looks like the relationship is pretty weak, but I am a bit concerned about the few films with very high numbers of ratings.
Might we try a transformation? Suppose we place the imdb_rankings
on a logarithmic scale? R has a tool to help us do this for base 10 logs, so let’s try that.
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_log10() +
geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
Now, maybe, we have a different outlier to worry about? What is that smallest value?
ggplot(movies, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_log10() +
geom_text_repel(aes(label = film), col = "purple",
data = movies %>% slice_min(imdb_ratings)) +
geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students")
What are the three films least often rated?
movies %>% select(film_id, film, imdb_ratings) %>%
slice_min(imdb_ratings, n = 3)
# A tibble: 3 x 3
film_id film imdb_ratings
<dbl> <chr> <dbl>
1 21 Farewell My Concubine: the Beijing Opera 6
2 33 House Party 2 5921
3 59 Still Walking 13154
mod_B
Note that since we’ve transformed the outcome (from imdb_ratings
to its logarithm) the summaries here (like \(R^2\)) are no longer comparable to what we saw in mod_A
.
For example, the Pearson correlation of age
with log10(imdb_ratings)
is different from the Pearson correlation of age
with the raw imdb_ratings
.
movies %$% cor(log10(imdb_ratings), age)
[1] -0.04998179
movies %$% cor(imdb_ratings, age)
[1] -0.04779641
mod_B <- lm(log10(imdb_ratings) ~ age, data = movies)
summary(mod_B)
Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies)
Residuals:
Min 1Q Median 3Q Max
-4.6457 -0.2279 0.0806 0.4367 0.9938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.441649 0.166693 32.65 <2e-16 ***
age -0.002973 0.007425 -0.40 0.69
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8081 on 64 degrees of freedom
Multiple R-squared: 0.002498, Adjusted R-squared: -0.01309
F-statistic: 0.1603 on 1 and 64 DF, p-value: 0.6902
tidy(mod_B, conf.int = TRUE, conf.level = 0.90) %>% kable()
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 5.4416493 | 0.1666925 | 32.6448316 | 0.0000000 | 5.1634373 | 5.7198613 |
age | -0.0029725 | 0.0074247 | -0.4003547 | 0.6902285 | -0.0153645 | 0.0094195 |
glance(mod_B) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>%
kable()
r.squared | adj.r.squared | sigma | AIC | BIC | nobs |
---|---|---|---|---|---|
0.0024982 | -0.0130878 | 0.8081374 | 163.1499 | 169.7189 | 66 |
What conclusions can you draw here?
mod_B
: Predictionsmovies_augB <- augment(mod_B, movies)
Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10
of imdb_ratings
, so we need to think about that.
movies_augB %>%
mutate(log10_ratings = log10(imdb_ratings)) %>%
select(film_id, film, year, age,
imdb_ratings, log10_ratings, .fitted, .resid) %>%
head(5) %>% kable()
film_id | film | year | age | imdb_ratings | log10_ratings | .fitted | .resid |
---|---|---|---|---|---|---|---|
1 | 8 1/2 | 1963 | 57 | 106555 | 5.027574 | 5.272215 | -0.2446413 |
2 | About Time | 2013 | 7 | 290158 | 5.462635 | 5.420842 | 0.0417929 |
3 | Avatar | 2009 | 11 | 1101874 | 6.042132 | 5.408951 | 0.6331805 |
4 | Avengers: Endgame | 2019 | 1 | 757530 | 5.879400 | 5.438677 | 0.4407231 |
5 | Avengers: Infinity War | 2018 | 2 | 798058 | 5.902035 | 5.435704 | 0.4663302 |
OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.
mod_B
Residual Plotsp1 <- ggplot(movies_augB, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F,
lty = "dashed", col = "black") +
geom_smooth(method = "loess", formula = y ~ x, se = F,
col = "blue") +
geom_text_repel(data = movies_augB %>%
slice_max(abs(.resid), n = 3),
aes(label = film)) +
labs(title = "mod_B Residuals vs. Fitted",
x = "Fitted Values from mod_B",
y = "Residuals from mod_B")
p2 <- ggplot(movies_augB, aes(sample = .resid)) +
geom_qq() + geom_qq_line(col = "red") +
labs(title = "mod_B Residuals",
y = "")
p3 <- ggplot(movies_augB, aes(y = .resid, x = "")) +
geom_violin(fill = "tomato") +
geom_boxplot(width = 0.5) +
labs(y = "", x = "")
p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))
That low outlier certainly stands out. Perhaps we should look at the data excluding that point to see if we can plausibly fit a model.
mod_C
Suppose we decided to look at how well we could predict the logarithm of the imdb_ratings
if if we dropped “Farewell My Concubine” from the list. Let’s create a new tibble, where we filter this film out.
movies_minus_one <- movies %>% filter(imdb_ratings > 10)
ggplot(movies_minus_one, aes(x = age, y = imdb_ratings)) +
geom_point() +
scale_y_log10() +
geom_smooth(method = "loess", se = FALSE, col = "blue", formula = y ~ x) +
geom_smooth(method = "lm", se = TRUE, col = "red", formula = y ~ x) +
labs(title = "Movies Mentioned as Favorites by 2020 431 Students",
subtitle = "Excluding one film with only 6 IMDB Ratings")
mod_C
mod_C <- lm(log10(imdb_ratings) ~ age, data = movies_minus_one)
summary(mod_C)
Call:
lm(formula = log10(imdb_ratings) ~ age, data = movies_minus_one)
Residuals:
Min 1Q Median 3Q Max
-1.60221 -0.22154 0.04478 0.31639 0.97105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.601282 0.116619 48.031 <2e-16 ***
age -0.007817 0.005158 -1.516 0.135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5579 on 63 degrees of freedom
Multiple R-squared: 0.03517, Adjusted R-squared: 0.01986
F-statistic: 2.297 on 1 and 63 DF, p-value: 0.1346
tidy(mod_C, conf.int = TRUE, conf.level = 0.90) %>% kable()
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 5.6012824 | 0.1166190 | 48.030623 | 0.0000000 | 5.4065984 | 5.7959664 |
age | -0.0078166 | 0.0051577 | -1.515519 | 0.1346426 | -0.0164268 | 0.0007937 |
glance(mod_C) %>% select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>%
kable()
r.squared | adj.r.squared | sigma | AIC | BIC | nobs |
---|---|---|---|---|---|
0.0351747 | 0.0198601 | 0.5578977 | 112.5652 | 119.0884 | 65 |
Are the \(R^2\) values we obtain for Model C comparable to those we developed for Model A? For Model B?
mod_C
: Predictionsmovies_augC <- augment(mod_C, movies_minus_one)
Again, let’s look at the predictions for the first few films in the data set. Note that we are now predicting the log10
of imdb_ratings
, so we need to think about that.
movies_augC %>%
mutate(log10_ratings = log10(imdb_ratings)) %>%
select(film_id, film, year, age,
imdb_ratings, log10_ratings, .fitted, .resid) %>%
head(5) %>% kable()
film_id | film | year | age | imdb_ratings | log10_ratings | .fitted | .resid |
---|---|---|---|---|---|---|---|
1 | 8 1/2 | 1963 | 57 | 106555 | 5.027574 | 5.155739 | -0.1281646 |
2 | About Time | 2013 | 7 | 290158 | 5.462635 | 5.546566 | -0.0839319 |
3 | Avatar | 2009 | 11 | 1101874 | 6.042132 | 5.515300 | 0.5268317 |
4 | Avengers: Endgame | 2019 | 1 | 757530 | 5.879400 | 5.593466 | 0.2859340 |
5 | Avengers: Infinity War | 2018 | 2 | 798058 | 5.902035 | 5.585649 | 0.3163852 |
OK. Let’s look at the residual plots to see if our regression assumptions are more reasonable now.
mod_C
Residual Plotsp1 <- ggplot(movies_augC, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = F,
lty = "dashed", col = "black") +
geom_smooth(method = "loess", formula = y ~ x, se = F,
col = "blue") +
geom_text_repel(data = movies_augC %>%
slice_max(abs(.resid), n = 3),
aes(label = film)) +
labs(title = "mod_C Residuals vs. Fitted",
x = "Fitted Values from mod_C",
y = "Residuals from mod_C")
p2 <- ggplot(movies_augC, aes(sample = .resid)) +
geom_qq() + geom_qq_line(col = "red") +
labs(title = "mod_C Residuals",
y = "")
p3 <- ggplot(movies_augC, aes(y = .resid, x = "")) +
geom_violin(fill = "tomato") +
geom_boxplot(width = 0.5) +
labs(y = "", x = "")
p1 + p2 + p3 + plot_layout(widths = c(5, 4, 1))
We still have some low outliers, but the residuals are closer to a Normal distribution, and I don’t see a strong curve in the plot against fitted values.
The main problem is that the model remains very weak. age
alone isn’t a strong predictor of imdb_ratings
.
To view the HTML report generated by this R Markdown file, visit https://rpubs.com/TELOVE/movies-A-431-2020
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggridges_0.5.2 mosaicData_0.20.1 ggformula_0.9.4 ggstance_0.3.4
[5] Matrix_1.2-18 lattice_0.20-41 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2
[13] tibble_3.0.3 tidyverse_1.3.0 patchwork_1.0.1 magrittr_1.5
[17] knitr_1.29 janitor_2.0.1 ggrepel_0.8.2 ggplot2_3.3.2
[21] broom_0.7.0
loaded via a namespace (and not attached):
[1] nlme_3.1-148 fs_1.5.0 lubridate_1.7.9 httr_1.4.2
[5] tools_4.0.2 backports_1.1.10 utf8_1.1.4 R6_2.4.1
[9] DBI_1.1.0 mgcv_1.8-31 colorspace_1.4-1 withr_2.2.0
[13] tidyselect_1.1.0 gridExtra_2.3 leaflet_2.0.3 curl_4.3
[17] compiler_4.0.2 cli_2.0.2 rvest_0.3.6 xml2_1.3.2
[21] ggdendro_0.1.22 labeling_0.3 mosaicCore_0.8.0 scales_1.1.1
[25] digest_0.6.25 rmarkdown_2.3.3 pkgconfig_2.0.3 htmltools_0.5.0
[29] dbplyr_1.4.4 highr_0.8 htmlwidgets_1.5.1 rlang_0.4.7
[33] readxl_1.3.1 rstudioapi_0.11 farver_2.0.3 generics_0.0.2
[37] jsonlite_1.7.1 crosstalk_1.1.0.1 Rcpp_1.0.5 munsell_0.5.0
[41] fansi_0.4.1 lifecycle_0.2.0 stringi_1.5.3 yaml_2.2.1
[45] snakecase_0.11.0 MASS_7.3-53 plyr_1.8.6 grid_4.0.2
[49] blob_1.2.1 crayon_1.3.4 haven_2.3.1 splines_4.0.2
[53] hms_0.5.3 pillar_1.4.6 reprex_0.3.0 glue_1.4.2
[57] evaluate_0.14 modelr_0.1.8 vctrs_0.3.4 tweenr_1.0.1
[61] cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0 assertthat_0.2.1
[65] xfun_0.16 ggforce_0.3.2 mosaic_1.8.2 ellipsis_0.3.1