Exploratory data analysis on data from RottenTomatoes about actor Jake Gyllenhaal. The code used to mine the data here analyzed and the explanation on how to use it can be found on this report’s repository.
import_data("jake_gyllenhaal")
filmes <- read_imported_data()
filmes %>%
glimpse()
## Observations: 20
## Variables: 5
## $ avaliacao <int> 92, 67, 72, 52, 73, 59, 82, 85, 92, 49, 35, 64, 47,...
## $ filme <chr> "Stronger", "Life", "Nocturnal Animals", "Demolitio...
## $ papel <chr> "Jeff Bauman", "David Jordan", "Tony HastingsEdward...
## $ bilheteria <dbl> 4.2, 30.2, 10.7, 1.7, 46.6, 42.4, 61.0, 39.1, 54.7,...
## $ ano <int> 2017, 2017, 2016, 2016, 2015, 2015, 2013, 2012, 201...
p <- filmes %>%
ggplot(aes(x = ano,
y = bilheteria,
text = paste("Movie:",filme,
"\nBox Office:",
bilheteria,"m",
"\nYear:",ano))) +
geom_point(size = 4, color = paleta[1]) +
labs(y = "Box Office (MM)", x = "Year of release")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
Among the movies where Jake acted one sets itself apart from others in terms of revenue: The movie “The Day After Tomorrow” released in 2004.
It’s possible to notice a downward trend in the Box Office of the movies where Jake acted after 2013.
filmes %>%
ggplot(aes(x = bilheteria)) +
geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0,
fill = "grey", color = "black") +
geom_rug(size = .5) +
scale_x_continuous(breaks=seq(0,200,20)) +
labs(y = "Relative Frequency", x = "Box Office (MM)")
We see a clear disparity between “The Day After Tomorrow” and the rest of the movies.
No values outside expected domain, e.g. negative values.
p <- filmes %>%
ggplot(aes(x = "",
y = bilheteria,
label = filme,
text = paste("Movie:",filme,
"\nBox Office:",
bilheteria,"m"))) +
geom_jitter(width = .05, alpha = .3, size = 3) +
labs(x = "", y="Box Office (MM)")
ggplotly(p, tooltip="text") %>%
layout(autosize = F)
Separate movies in those whose Box Office is below 50 millions and those whose Box Office is above that seems a reasonable approach.
“The Day After Tomorrow” seems to form a group of its own. Which would give us 3 groups.
p <- filmes %>%
ggplot(aes(x = ano,
y = avaliacao,
text = paste("Movie:",filme,
"\nRating:",
avaliacao,
"\nYear:",ano))) +
geom_point(size = 4, color = paleta[1]) +
scale_y_continuous(limits = c(0, 100)) +
labs(y = "Rating RT", x = "Year of Release")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
filmes %>%
ggplot(aes(x = avaliacao)) +
geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0,
fill = paleta[3], color = "black") +
geom_rug(size = .5) +
scale_x_continuous(breaks=seq(0,100,10)) +
labs(y = "Relative Frequency", x = "Rating RT")
It’s possible to notice a considerable number of movies with ratings above 80.
No values outside expected domain, e.g. negative values.
p <- filmes %>%
ggplot(aes(x = "",
y = avaliacao,
text = paste(
"Filme:",filme,
"\nAvaliação:",avaliacao))) +
geom_jitter(width = .05, alpha = .3, size = 3) +
labs(x = "", y="Avaliação RT")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
agrupamento_h = filmes %>%
mutate(nome = paste0(filme, " (bil=", bilheteria, ")")) %>%
as.data.frame() %>%
column_to_rownames("filme") %>%
select(bilheteria) %>%
dist(method = "euclidian") %>%
hclust(method = "centroid")
ggdendrogram(agrupamento_h, rotate = T, size = 2, theme_dendro = F) +
labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
geom_hline(aes(yintercept = c(20,30), color=c("4 grupos","3 grupos"))) +
scale_colour_manual(name="#Groups",
values=c("#56B4E9", "#FF9999"))
atribuicoes = get_grupos(agrupamento_h, num_grupos = 1:6)
atribuicoes = atribuicoes %>%
left_join(filmes, by = c("label" = "filme"))
atribuicoes %>%
ggplot(aes(x = "Movies", y = bilheteria, colour = grupo)) +
geom_jitter(width = .02, height = 0, size = 1.6, alpha = .6) +
facet_wrap(~ paste(k, " groups")) +
scale_color_brewer(palette = "Dark2") +
labs(y = "Box Office (MM)", x = "", title = "Grouping by Box Office") +
guides(color=guide_legend(title="group"))
k_escolhido = 4
m <- list(l = 220)
p <-atribuicoes %>%
filter(k == k_escolhido) %>%
ggplot(aes(x = reorder(label, bilheteria),
y = bilheteria,
colour = grupo,
text = paste(
"Movie:", reorder(label, bilheteria),
"\nRating:", bilheteria,
"\nGroup:", grupo))) +
geom_jitter(width = .02, height = 0, size = 3, alpha = .6) +
facet_wrap(~ paste(k, " groups")) +
scale_color_brewer(palette = "Dark2") +
labs(x = "", y = "Rating RT") +
guides(color=guide_legend(title="group")) +
coord_flip()
ggplotly(p,tooltip = "text") %>%
layout(autosize = F, margin = m)
agrupamento_h = filmes %>%
mutate(nome = paste0(filme, " (av=", avaliacao, ")")) %>%
as.data.frame() %>%
column_to_rownames("filme") %>%
select(avaliacao) %>%
dist(method = "euclidian") %>%
hclust(method = "ward.D")
ggdendrogram(agrupamento_h, rotate = T, size = 2, theme_dendro = F) +
labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
geom_hline(aes(yintercept = 30),color="red")
atribuicoes = get_grupos(agrupamento_h, num_grupos = 1:6)
atribuicoes = atribuicoes %>%
left_join(filmes, by = c("label" = "filme"))
atribuicoes %>%
ggplot(aes(x = "Movies", y = avaliacao, colour = grupo)) +
geom_jitter(width = .02, height = 0, size = 1.6, alpha = .6) +
facet_wrap(~ paste(k, " groups")) +
scale_color_brewer(palette = "Dark2") +
guides(color=guide_legend(title="group")) +
labs(y = "Rating RT", x = "", title = "Grouping by Rating")
k_escolhido = 3
m <- list(l = 220)
p <-atribuicoes %>%
filter(k == k_escolhido) %>%
ggplot(aes(x = reorder(label, avaliacao),
y = avaliacao,
colour = grupo,
text = paste(
"Movie:", reorder(label, avaliacao),
"\nRating:", avaliacao,
"\nGroup:", grupo))) +
geom_jitter(width = .02, height = 0, size = 3, alpha = .6) +
facet_wrap(~ paste(k, " groups")) +
scale_color_brewer(palette = "Dark2") +
labs(x = "", y = "Rating RT") +
guides(color=guide_legend(title="group")) +
coord_flip()
ggplotly(p,tooltip = "text") %>%
layout(autosize = F, margin = m)
agrupamento_h_2d = filmes %>%
mutate(bilheteria = log10(bilheteria)) %>%
mutate_at(vars("avaliacao", "bilheteria"), funs(scale)) %>%
column_to_rownames("filme") %>%
select("avaliacao", "bilheteria") %>%
dist(method = "euclidean") %>%
hclust(method = "ward.D")
ggdendrogram(agrupamento_h_2d, rotate = TRUE, theme_dendro = F) +
labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
geom_hline(aes(yintercept = 4),color="red")
filmes2 <- filmes %>%
mutate(bilheteria = log10(bilheteria))
plota_hclusts_2d(agrupamento_h_2d,
filmes2,
c("avaliacao", "bilheteria"),
linkage_method = "ward.D",
ks = 1:6,
palette = "Dark2") +
facet_wrap(~ paste(k, " groups")) +
scale_y_log10() +
guides(color=guide_legend(title="group")) +
labs(y = "Box Office", x = "Rating", title = "Grouping with two dimensions")
atribuicoes = get_grupos(agrupamento_h_2d, num_grupos = 1:6)
atribuicoes = atribuicoes %>%
filter(k == 5) %>%
mutate(filme = label) %>%
left_join(filmes, by = "filme")
p <- atribuicoes %>%
ggplot(aes(x = avaliacao,
y = bilheteria,
colour = grupo,
text = paste(
"Movie:", filme,
"\nBox Office:", bilheteria,"m\n",
"Rating:", avaliacao))) +
geom_jitter(width = .02, height = 0, size = 3, alpha = .6) +
facet_wrap(~ paste(k, " groups")) +
scale_color_brewer(palette = "Dark2") +
scale_y_log10() +
guides(color=guide_legend(title="group")) +
labs(y = "Box Office", x = "Rating RT")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
\(\color{#16A085}{\text{Group 1 (Oddball):}}\) Movies overall well received by the public, which reflected on its low revenue. The name Oddball comes from the later interest on the movies from people who consider themselves eccentric not rarely to re-validate their sense of exclusivity.
\(\color{#CF5300}{\text{Group 2 (Matinee):}}\) Movies overall not so well received by the critics and more formulaic. In terms of box office most of them had low revenue but the movie paid itself. The name Matinee comes from the idea of that movies that did not perform that well or stopped being the hot topic for a long time occupying being aired by that time.
\(\color{#7C3F7C}{\text{Group 3 (Demolition of a budget):}}\) Movies overall poorly received by both critics and public, which reflected on its small Box Office and low ratings. The name of the group is a wordplay with the very small revenue rendered by the movies, which “demolished” the investment of those who betted in them.
\(\color{magenta}{\text{Grupo 4 (Broke Records and Awards):}}\) Movies acclaimed by critics and whose box office was either successful or at least decent. The movies in this group have a more serious tone, talking of serious matter that frequently create controversy (serial murders, non heterosexuality, terrorism..). The name of the group is a word play with the name of one of its movies and the sheer amount of prizes this particular movie won.
\(\color{green}{\text{Grupo 5 (BlockBusters):}}\) Movies in which Jack acted that the critics didn’t like that much but who collected a huge box office, with a revenue on the scale of hundreds of millions. The term BlockBuster is usually given to movies who attract crowds to the movie theaters, which is the case of the movies who belong to this group.
\(\color{#16A085}{\text{Grupo 1 (Oddball):}}\)
\(\color{#CF5300}{\text{Grupo 2 (Matinee):}}\)
\(\color{#7C3F7C}{\text{Grupo 3 (Demolition of a budget):}}\)
\(\color{magenta}{\text{Grupo 4 (Broke Records and Awards):}}\)
\(\color{green}{\text{Grupo 5 (BlockBusters):}}\)