This line of code is quite magical. Change any
opts_chunk here and it will change setting for the entire
document. You can set any option you want: echo,
include, warning or messages.
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
df <- data %>% mutate(
genre = replace(genre, genre == "Suspense / Thriller", "Suspense"),
genre = replace(genre, genre == "Documentales", "Documental"),
genre = replace(genre, genre == "Ciencia ficci?n", "SyFy"),
genre = replace(genre, genre == "Infantil/Familiar", "Infantil")
)
The data set contains 1510 instances and 16 variables. From now on,
we won’t be using the variable Description and a bunch more
related to the creators of the film (directors, writers, actors,
etc).
Let’s see a few examples of the data
# Hay varios temas predeterminados en kableExtra, como kable_paper()
df_clean %>%
head(6) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Date | Channel | Spanish title | Original title | Year | Genre | Country | Length |
|---|---|---|---|---|---|---|---|
| 2022-09-13 00:19:00 | Paramount Network | Vanilla Sky | Vanilla Sky | 2001 | Drama | Estados Unidos | 130 |
| 2022-09-13 00:42:00 | Neox | Ruslan: la venganza del asesino | Driven to Kill | 2009 | Acción | Canadá | 98 |
| 2022-09-13 01:10:00 | laSexta | La mujer del pastor | The Pastor’s Wife | 2011 | Drama | Canadá | 87 |
| 2022-09-13 13:10:00 | La 2 | El sonido de un tambor | Cimarron: The Sound of a Drum | 1968 | Western | Estados Unidos | 70 |
| 2022-09-13 16:05:00 | TRECE | Comando secreto | The Secret War of Harry Frigg | 1968 | Comedia | Estados Unidos | 110 |
| 2022-09-13 16:20:00 | La 1 | La cuchara de Elli | Tessa Hennig - Elli gibt den Löffel ab | 2012 | Drama | Alemania | 90 |
library(ggplot2)
df_clean %>%
group_by(Channel) %>%
summarise(`Film count` = n()) %>%
ggplot(aes(x = `Film count`, y = reorder(Channel, `Film count`))) +
geom_bar(stat = 'identity') +
theme_classic() +
ylab("Channel")
library(purrr)
df_clean %>% map_dbl(
n_distinct) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| x | |
|---|---|
| Date | 1375 |
| Channel | 11 |
| Spanish title | 994 |
| Original title | 986 |
| Year | 83 |
| Genre | 22 |
| Country | 30 |
| Length | 119 |
So, if there are 1510 instances, why do we only have 994 movies? I guess some of them were broadcasted more than once. This is the top 10 (I filtered a couple of instances that are not films, but regular TV programs):
df_clean %>%
filter(`Spanish title` != "Mejor llama a Kiko" & `Spanish title` != "El zapping de Surferos") %>%
group_by(`Spanish title`) %>%
summarise(Emisiones = n()) %>%
filter(Emisiones > 1) %>%
arrange(desc(Emisiones)) %>%
head(10) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Spanish title | Emisiones |
|---|---|
| Querido fotogramas | 16 |
| Spanish Movie | 8 |
| Diario de Greg 2: La Ley de Rodrick | 5 |
| El diario de Greg | 5 |
| Locos por las motos | 5 |
| Lord Jim | 5 |
| Superlópez | 5 |
| Zombies 3 | 5 |
| Amigos… | 4 |
| Apocalypse Now | 4 |
The discrepancy between Spanish title (994) and
Original title (986) count is striking. There must be one
more duplicated value in the latter.
df_clean %>%
group_by(`Original title`) %>%
distinct() %>%
filter(n()>1) %>%
arrange(`Original title`) %>%
select(3:4) %>%
distinct(`Spanish title`) %>%
mutate(Count = n()) %>%
filter(Count >= 2) %>%
select(-Count) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Spanish title | Original title |
|---|---|
| Asteroides, el nuevo dorado | Astéroïdes, le nouvel eldorado spatial? |
| Asteroides, el nuevo El Dorado | Astéroïdes, le nouvel eldorado spatial? |
| El cabo del miedo | Cape Fear |
| El Cabo del Terror | Cape Fear |
| El justiciero de la ciudad | Death Wish |
| El justiciero | Death Wish |
| Dragon: la vida de Bruce Lee | Dragon: The Bruce Lee Story |
| Dragón: la vida de Bruce Lee | Dragon: The Bruce Lee Story |
| Aventura en el centro de la Tierra | Journey to the Center of the Earth |
| Viaje al centro de la tierra | Journey to the Center of the Earth |
| Robin Hood, el magnÃfico | Robin Hood |
| Robin Hood | Robin Hood |
| El Padrino II | The Godfather: Part II |
| El Padrino parte II | The Godfather: Part II |
| El último patriota | The Patriot |
| El patriota | The Patriot |
| El estrafalario prisionero de Zenda | The Prisoner of Zenda |
| El prisionero de Zenda | The Prisoner of Zenda |
| The Score (Un golpe maestro) | The Score |
| Un golpe maestro | The Score |
As you can see, some different films share the same
Original title. If this happens more times from now on, we
will now. Above is the list of every film shown more than once.
This is the distribution of film length. The mean is 99 and the median 99.
library(ggplot2)
df_clean %>%
drop_na() %>%
ggplot(aes(x = Length)) +
geom_histogram(aes(y=..density..), binwidth = 5) +
geom_density(alpha = 0.2, fill = "#FF6666") +
theme_classic()
Let’s see how it is by channel.
df_clean %>%
drop_na() %>%
ggplot(aes(x = Length)) +
geom_histogram(aes(y=..density..), binwidth = 5) +
geom_density(alpha = 0.2, fill = "#FF6666") +
facet_wrap(~ Channel) +
theme_classic()
A violin plot (a boxplot would be useful too) may be easier to interpret. It is remarkable how similar are lengths in Antena 3 channel.
df_clean %>%
ggplot(aes(x = Length, y = reorder(Channel, Length, median),
color = Channel, fill = Channel)) +
geom_violin(size = 0.9, alpha = 0.2, show.legend = F) +
theme_classic()
four_shortest <- df_clean %>%
group_by(Channel) %>%
summarise(Median = median(Length)) %>%
arrange(Median) %>%
slice_head(n = 4) %>%
select(-Median) %>%
as.list() %>%
unname() %>%
unlist()
Why are Antena 3, laSexta, La 1 and La 2 broadcasting the shortest films? Let’s have a glimpse.
library(stringr)
library(kableExtra)
library(lubridate)
df_clean %>%
mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>%
relocate("Weekday", .after = Date) %>%
transmute(
Date = Date,
Channel = Channel,
`Spanish title` = `Spanish title`,
"Weekday" = case_when(
str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
),
Length = Length
) %>%
filter(Channel == four_shortest) %>%
arrange(Length) %>%
head(15) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Date | Channel | Spanish title | Weekday | Length |
|---|---|---|---|---|
| 2022-09-14 22:56:00 | La 2 | Grace Kelly: Los millones perdidos | Laboral | 45 |
| 2022-10-21 04:30:00 | La 2 | La expulsión de los moriscos | Finde | 60 |
| 2022-09-13 13:10:00 | La 2 | El sonido de un tambor | Laboral | 70 |
| 2022-09-15 13:15:00 | La 2 | El último lobo | Laboral | 71 |
| 2022-10-15 17:45:00 | Antena 3 | Obsesión del pasado | Finde | 79 |
| 2022-10-02 05:30:00 | La 2 | Una esvástica sobre el Bidasoa | Finde | 80 |
| 2022-10-07 23:59:00 | La 2 | El elegido | Finde | 80 |
| 2022-10-10 02:50:00 | La 1 | Los médicos y yo | Laboral | 80 |
| 2022-10-12 19:30:00 | Antena 3 | Viviendo con una extraña | Laboral | 81 |
| 2022-10-09 02:05:00 | La 1 | Amor al sol | Finde | 82 |
| 2022-10-23 16:00:00 | Antena 3 | La acusación de Rachel | Finde | 82 |
| 2022-09-25 16:00:00 | Antena 3 | Un encuentro peligroso | Finde | 83 |
| 2022-10-06 00:40:00 | La 2 | Con la pata quebrada | Laboral | 83 |
| 2022-10-08 19:30:00 | Antena 3 | Muerte en familia | Finde | 83 |
| 2022-10-22 16:00:00 | Antena 3 | Belleza mortal | Finde | 83 |
It is obvious how short are these films; most of them are broadcasted during the afternoon, starting between 16 p.m. and 19 p.m. Most of these films are also shown on weekends (remember, including Fridays).
Now we transform Length in three categories being (1)
less than 90 minutes, (2) between 90 and 150 minutes and (3) more than
150 minutes.
df_clean <- df_clean %>%
mutate(
length_category = case_when(
Length <= 90 ~ "Less than 90 minutes",
Length > 90 & Length <= 150 ~ "Between 90 and 150 minutes",
Length > 150 ~ "More than 150 minutes",
TRUE ~ "Rest"
)
)
perc_data <- df_clean %>%
mutate(length_category = factor(length_category, levels = c("Less than 90 minutes",
"Between 90 and 150 minutes", "More than 150 minutes"))) %>%
group_by(Channel) %>%
mutate(Channel_shown = n()) %>%
group_by(Channel, length_category) %>%
summarise(perc = n()/max(Channel_shown))
library(scales) # For percent
perc_data %>%
ggplot(aes(x = length_category, y = perc, fill = length_category)) +
geom_bar(stat = 'identity', show.legend = "none") +
facet_wrap(~ Channel) +
geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
theme_classic() +
theme(axis.text.x = element_blank())
library(ggtext)
library(thematic)
# Color-blind safe colors
coloritos <- thematic::okabe_ito(3)
perc_data %>%
ggplot(aes(x = length_category, y = perc, fill = length_category)) +
geom_bar(stat = 'identity', show.legend = "none") +
scale_fill_manual(values = coloritos) +
facet_wrap(~ Channel) +
geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
theme_classic() +
theme(axis.text.x = element_blank())
This is a list of the number of films shown each day.
library(stringr)
library(kableExtra)
library(lubridate)
df_clean_film_day <- df_clean %>%
mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>%
relocate("Weekday", .after = Date) %>%
transmute(
Date = Date,
"Weekday" = case_when(
str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
)
) %>%
group_by(
Date = as.Date(Date), Weekday) %>%
summarise(Films = n())
df_clean_film_day %>%
head(10) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Date | Weekday | Films |
|---|---|---|
| 2022-09-13 | Laboral | 11 |
| 2022-09-14 | Laboral | 13 |
| 2022-09-15 | Laboral | 11 |
| 2022-09-16 | Finde | 20 |
| 2022-09-17 | Finde | 38 |
| 2022-09-18 | Finde | 34 |
| 2022-09-19 | Laboral | 17 |
| 2022-09-20 | Laboral | 22 |
| 2022-09-21 | Laboral | 23 |
| 2022-09-22 | Laboral | 27 |
There is a surprising spike on Wednesday 12th October which is not that surpring when I tell you that it is Spanish national day.
df_clean_film_day %>%
ggplot(aes(x = Date, y = Films, fill = Weekday)) +
geom_col() +
theme_classic()
First, we have to prepare the data. Country names are in Spanish, let’s translate them.
unique(df_clean$Country)
## [1] "Estados Unidos" "Canadá" "Alemania"
## [4] "España" "Italia" "Francia"
## [7] "Reino Unido" "Dinamarca" "Sudáfrica"
## [10] "Australia" "Corea del Sur" "Tailandia"
## [13] "Suecia" "Rusia" "China"
## [16] "Hong Kong" "Islandia" "Argentina"
## [19] "República Checa" "Noruega" "PaÃses Bajos"
## [22] "Bélgica" "TurquÃa" "Taiwán"
## [25] "Alemania del Oeste" "Irlanda" "Ucrania"
## [28] NA "Japón" "México"
I will check for empty values in the Country variable.
If there isn’t we will continue.
df_clean %>%
filter(Country == "")
## # A tibble: 0 x 9
## # ... with 9 variables: Date <dttm>, Channel <chr>, Spanish title <chr>,
## # Original title <chr>, Year <dbl>, Genre <chr>, Country <chr>, Length <dbl>,
## # length_category <chr>
df_map <- df_clean %>%
group_by(`Country`) %>%
summarise(Movies = n())
df_map <- df_map %>%
mutate(
Country = case_when(
Country == "Alemania" ~ "Germany",
Country == "Alemania del Oeste" ~ "Germany",
Country == "Argentina" ~ "Argentina",
Country == "Australia" ~ "Australia",
Country == "Bélgica" ~ "Belgium",
Country == "Canadá" ~ "Canada",
Country == "China" ~ "China",
Country == "Corea del Sur" ~ "South Korea",
Country == "Dinamarca" ~ "Denmark",
Country == "España" ~ "Spain",
Country == "Estados Unidos" ~ "United States of America",
Country == "Francia" ~ "France",
Country == "Hong Kong" ~ "Hong Kong",
Country == "Irlanda" ~ "Ireland",
Country == "Islandia" ~ "Island",
Country == "Italia" ~ "Italy",
Country == "Noruega" ~ "Norway",
Country == "PaÃses Bajos" ~ "Netherlands",
Country == "Reino Unido" ~ "United Kingdom",
Country == "República Checa" ~ "Czech Republic",
Country == "Rusia" ~ "Russia",
Country == "Sudáfrica" ~ "South Africa",
Country == "Suecia" ~ "Sweden",
Country == "Tailandia" ~ "Tailand",
Country == "Taiwán" ~ "Taiwan",
Country == "TurquÃa" ~ "Turkey",
Country == "Ucrania" ~ "Ukraine",
is.na(Country) == TRUE ~ "Sin datos"
)
)
df_map %>%
arrange(desc(Movies)) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
| Country | Movies |
|---|---|
| United States of America | 924 |
| Spain | 248 |
| Sin datos | 99 |
| United Kingdom | 56 |
| Germany | 47 |
| Canada | 44 |
| France | 28 |
| Italy | 19 |
| Australia | 9 |
| South Korea | 6 |
| Norway | 5 |
| Argentina | 2 |
| China | 2 |
| Hong Kong | 2 |
| Ireland | 2 |
| Russia | 2 |
| Sweden | 2 |
| Germany | 1 |
| Belgium | 1 |
| Denmark | 1 |
| Island | 1 |
| NA | 1 |
| NA | 1 |
| Netherlands | 1 |
| Czech Republic | 1 |
| South Africa | 1 |
| Tailand | 1 |
| Taiwan | 1 |
| Turkey | 1 |
| Ukraine | 1 |
First, let’s use a bar plot. Mmm, we do seem to like USA films…
df_map %>%
ggplot(aes(x = reorder(Country, Movies), y = Movies)) +
geom_bar(stat = 'identity') +
coord_flip() +
theme_classic()
Then, we plot a map.
library(sf)
library(rnaturalearth)
world <- ne_countries(scale = "small", returnclass = "sf")
world %>%
ggplot() +
geom_sf() +
geom_hline(yintercept = 0, linetype = "dashed")
world <- world %>%
dplyr::rename("Country" = "sovereignt")
df_world <- left_join(world, df_map)
Let’s see how it looks filled.
library(ggplot2)
library(sf)
df_world %>%
ggplot() +
geom_sf(aes(fill = Movies)) +
theme_void() +
theme(legend.position = "top") +
labs(fill = "Number of movies:") +
guides(fill = guide_legend(nrow = 2, byrow = TRUE))
I will plot the same map but without United States (and maybe without Spain too). The scale is very confusing as it is.
df_world %>%
filter(Country != "United States of America" & Country != "Spain") %>%
ggplot() +
geom_sf(aes(fill = Movies)) +
theme_void() +
theme(legend.position = "top") +
labs(fill = "Number of movies:") +
guides(fill = guide_legend(nrow = 2, byrow = TRUE))
I want to see the Country distribution by channel; for example, to analize which channel promote Spanish cinema. Or how the public channels behave.
library(forcats)
df_clean %>%
group_by(Channel, Country) %>%
count(Country) %>%
ggplot(aes(x = fct_lump(Country, prop = 0.05), y = n)) +
geom_col() +
facet_wrap(~Channel) +
coord_flip() +
theme_classic()
This we we can see better how Countries are distributed within each channel.
perc_data_2 <- df_clean %>%
group_by(Channel) %>%
mutate(Channel_shown = n()) %>%
group_by(Channel, Country, Channel_shown) %>%
summarise(perc = n()/max(Channel_shown))
perc_data_2 %>%
ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
geom_col() +
facet_wrap(~Channel) +
coord_flip() +
theme_classic()
I want to do the same but removing films from United States.
perc_data_3 <- df_clean %>%
filter(Country != "Estados Unidos") %>%
group_by(Channel) %>%
mutate(Channel_shown = n()) %>%
group_by(Channel, Country, Channel_shown) %>%
summarise(perc = n()/max(Channel_shown))
perc_data_3 %>%
ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
geom_col() +
facet_wrap(~Channel) +
coord_flip() +
theme_classic()