Introduction

GitHub

This line of code is quite magical. Change any opts_chunk here and it will change setting for the entire document. You can set any option you want: echo, include, warning or messages.

knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Load and clean data

df <- data %>% mutate(
  genre = replace(genre, genre == "Suspense / Thriller", "Suspense"),
  genre = replace(genre, genre == "Documentales", "Documental"),
  genre = replace(genre, genre == "Ciencia ficci?n", "SyFy"),
  genre = replace(genre, genre == "Infantil/Familiar", "Infantil")
)

The data set contains 1510 instances and 16 variables. From now on, we won’t be using the variable Description and a bunch more related to the creators of the film (directors, writers, actors, etc).

Let’s see a few examples of the data

# Hay varios temas predeterminados en kableExtra, como kable_paper()
df_clean %>% 
  head(6) %>%
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Date Channel Spanish title Original title Year Genre Country Length
2022-09-13 00:19:00 Paramount Network Vanilla Sky Vanilla Sky 2001 Drama Estados Unidos 130
2022-09-13 00:42:00 Neox Ruslan: la venganza del asesino Driven to Kill 2009 Acción Canadá 98
2022-09-13 01:10:00 laSexta La mujer del pastor The Pastor’s Wife 2011 Drama Canadá 87
2022-09-13 13:10:00 La 2 El sonido de un tambor Cimarron: The Sound of a Drum 1968 Western Estados Unidos 70
2022-09-13 16:05:00 TRECE Comando secreto The Secret War of Harry Frigg 1968 Comedia Estados Unidos 110
2022-09-13 16:20:00 La 1 La cuchara de Elli Tessa Hennig - Elli gibt den Löffel ab 2012 Drama Alemania 90

Some basic facts about the dataset

How many films have been shown in each channel?

library(ggplot2)

df_clean %>% 
  group_by(Channel) %>% 
  summarise(`Film count` = n()) %>%
  ggplot(aes(x = `Film count`, y = reorder(Channel, `Film count`))) +
  geom_bar(stat = 'identity') +
  theme_classic() +
  ylab("Channel")

How many unique values there are in each variable?

library(purrr)
df_clean %>% map_dbl(
  n_distinct) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
x
Date 1375
Channel 11
Spanish title 994
Original title 986
Year 83
Genre 22
Country 30
Length 119

So, if there are 1510 instances, why do we only have 994 movies? I guess some of them were broadcasted more than once. This is the top 10 (I filtered a couple of instances that are not films, but regular TV programs):

df_clean %>% 
  filter(`Spanish title` != "Mejor llama a Kiko" & `Spanish title` != "El zapping de Surferos") %>% 
  group_by(`Spanish title`) %>% 
  summarise(Emisiones = n()) %>% 
  filter(Emisiones > 1) %>% 
  arrange(desc(Emisiones)) %>% 
  head(10) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Spanish title Emisiones
Querido fotogramas 16
Spanish Movie 8
Diario de Greg 2: La Ley de Rodrick 5
El diario de Greg 5
Locos por las motos 5
Lord Jim 5
Superlópez 5
Zombies 3 5
Amigos… 4
Apocalypse Now 4

The discrepancy between Spanish title (994) and Original title (986) count is striking. There must be one more duplicated value in the latter.

df_clean %>% 
  group_by(`Original title`) %>% 
  distinct() %>% 
  filter(n()>1) %>% 
  arrange(`Original title`) %>% 
  select(3:4) %>% 
  distinct(`Spanish title`) %>% 
  mutate(Count = n()) %>% 
  filter(Count >= 2) %>% 
  select(-Count) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Spanish title Original title
Asteroides, el nuevo dorado Astéroïdes, le nouvel eldorado spatial?
Asteroides, el nuevo El Dorado Astéroïdes, le nouvel eldorado spatial?
El cabo del miedo Cape Fear
El Cabo del Terror Cape Fear
El justiciero de la ciudad Death Wish
El justiciero Death Wish
Dragon: la vida de Bruce Lee Dragon: The Bruce Lee Story
Dragón: la vida de Bruce Lee Dragon: The Bruce Lee Story
Aventura en el centro de la Tierra Journey to the Center of the Earth
Viaje al centro de la tierra Journey to the Center of the Earth
Robin Hood, el magnífico Robin Hood
Robin Hood Robin Hood
El Padrino II The Godfather: Part II
El Padrino parte II The Godfather: Part II
El último patriota The Patriot
El patriota The Patriot
El estrafalario prisionero de Zenda The Prisoner of Zenda
El prisionero de Zenda The Prisoner of Zenda
The Score (Un golpe maestro) The Score
Un golpe maestro The Score

As you can see, some different films share the same Original title. If this happens more times from now on, we will now. Above is the list of every film shown more than once.

Film length

How long are the films shown in TV?

This is the distribution of film length. The mean is 99 and the median 99.

library(ggplot2)
df_clean %>%
  drop_na() %>%
  ggplot(aes(x = Length)) +
    geom_histogram(aes(y=..density..), binwidth = 5) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    theme_classic()

Let’s see how it is by channel.

df_clean %>% 
  drop_na() %>% 
  ggplot(aes(x = Length)) +
    geom_histogram(aes(y=..density..), binwidth = 5) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    facet_wrap(~ Channel) +
    theme_classic() 

A violin plot (a boxplot would be useful too) may be easier to interpret. It is remarkable how similar are lengths in Antena 3 channel.

df_clean %>%  
  ggplot(aes(x = Length, y = reorder(Channel, Length, median),
             color = Channel, fill = Channel)) +
  geom_violin(size = 0.9, alpha = 0.2, show.legend = F) +
  theme_classic()

four_shortest <- df_clean %>% 
  group_by(Channel) %>% 
  summarise(Median = median(Length)) %>% 
  arrange(Median) %>% 
  slice_head(n = 4) %>% 
  select(-Median) %>% 
  as.list() %>% 
  unname() %>%
  unlist()

Why are Antena 3, laSexta, La 1 and La 2 broadcasting the shortest films? Let’s have a glimpse.

library(stringr)
library(kableExtra)
library(lubridate)

df_clean %>% 
  mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>% 
  relocate("Weekday", .after = Date) %>% 
  transmute(
    Date = Date,
    Channel = Channel,
    `Spanish title` = `Spanish title`,
    "Weekday" = case_when(
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
      ),
    Length = Length
    ) %>% 
  filter(Channel == four_shortest) %>% 
  arrange(Length) %>%
  head(15) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Date Channel Spanish title Weekday Length
2022-09-14 22:56:00 La 2 Grace Kelly: Los millones perdidos Laboral 45
2022-10-21 04:30:00 La 2 La expulsión de los moriscos Finde 60
2022-09-13 13:10:00 La 2 El sonido de un tambor Laboral 70
2022-09-15 13:15:00 La 2 El último lobo Laboral 71
2022-10-15 17:45:00 Antena 3 Obsesión del pasado Finde 79
2022-10-02 05:30:00 La 2 Una esvástica sobre el Bidasoa Finde 80
2022-10-07 23:59:00 La 2 El elegido Finde 80
2022-10-10 02:50:00 La 1 Los médicos y yo Laboral 80
2022-10-12 19:30:00 Antena 3 Viviendo con una extraña Laboral 81
2022-10-09 02:05:00 La 1 Amor al sol Finde 82
2022-10-23 16:00:00 Antena 3 La acusación de Rachel Finde 82
2022-09-25 16:00:00 Antena 3 Un encuentro peligroso Finde 83
2022-10-06 00:40:00 La 2 Con la pata quebrada Laboral 83
2022-10-08 19:30:00 Antena 3 Muerte en familia Finde 83
2022-10-22 16:00:00 Antena 3 Belleza mortal Finde 83

It is obvious how short are these films; most of them are broadcasted during the afternoon, starting between 16 p.m. and 19 p.m. Most of these films are also shown on weekends (remember, including Fridays).

Now we transform Length in three categories being (1) less than 90 minutes, (2) between 90 and 150 minutes and (3) more than 150 minutes.

df_clean <- df_clean %>% 
  mutate(
    length_category = case_when(
      Length <= 90 ~ "Less than 90 minutes",
      Length > 90 & Length <= 150 ~ "Between 90 and 150 minutes",
      Length > 150 ~ "More than 150 minutes",
      TRUE ~ "Rest"
    )
  )

Not color-blind graphic

perc_data <- df_clean %>% 
    mutate(length_category = factor(length_category, levels = c("Less than 90 minutes", 
                                                      "Between 90 and 150 minutes", "More than 150 minutes"))) %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, length_category) %>% 
  summarise(perc = n()/max(Channel_shown))

library(scales) # For percent

perc_data %>% 
  ggplot(aes(x = length_category, y = perc, fill = length_category)) +
  geom_bar(stat = 'identity', show.legend = "none") +
  facet_wrap(~ Channel) +
  geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
  theme_classic() +
  theme(axis.text.x = element_blank())

Color-blind graphic

library(ggtext)
library(thematic)
# Color-blind safe colors
coloritos <- thematic::okabe_ito(3)
perc_data %>% 
  ggplot(aes(x = length_category, y = perc, fill = length_category)) +
  geom_bar(stat = 'identity', show.legend = "none") +
  scale_fill_manual(values = coloritos) +
  facet_wrap(~ Channel) +
  geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
  theme_classic() +
  theme(axis.text.x = element_blank())

By date

This is a list of the number of films shown each day.

library(stringr)
library(kableExtra)
library(lubridate)

df_clean_film_day <- df_clean %>% 
  mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>% 
  relocate("Weekday", .after = Date) %>% 
  transmute(
    Date = Date,
    "Weekday" = case_when(
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
      )
    ) %>% 
  group_by(
    Date = as.Date(Date), Weekday) %>% 
  summarise(Films = n()) 
  

df_clean_film_day %>% 
  head(10) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Date Weekday Films
2022-09-13 Laboral 11
2022-09-14 Laboral 13
2022-09-15 Laboral 11
2022-09-16 Finde 20
2022-09-17 Finde 38
2022-09-18 Finde 34
2022-09-19 Laboral 17
2022-09-20 Laboral 22
2022-09-21 Laboral 23
2022-09-22 Laboral 27

There is a surprising spike on Wednesday 12th October which is not that surpring when I tell you that it is Spanish national day.

df_clean_film_day %>% 
  ggplot(aes(x = Date, y = Films, fill = Weekday)) +
  geom_col() +
  theme_classic()

Mapping the data (or analyzing the country variable)

First, we have to prepare the data. Country names are in Spanish, let’s translate them.

unique(df_clean$Country)
##  [1] "Estados Unidos"     "Canadá"             "Alemania"          
##  [4] "España"             "Italia"             "Francia"           
##  [7] "Reino Unido"        "Dinamarca"          "Sudáfrica"         
## [10] "Australia"          "Corea del Sur"      "Tailandia"         
## [13] "Suecia"             "Rusia"              "China"             
## [16] "Hong Kong"          "Islandia"           "Argentina"         
## [19] "República Checa"    "Noruega"            "Países Bajos"      
## [22] "Bélgica"            "Turquía"            "Taiwán"            
## [25] "Alemania del Oeste" "Irlanda"            "Ucrania"           
## [28] NA                   "Japón"              "México"

I will check for empty values in the Country variable. If there isn’t we will continue.

df_clean %>% 
  filter(Country == "")
## # A tibble: 0 x 9
## # ... with 9 variables: Date <dttm>, Channel <chr>, Spanish title <chr>,
## #   Original title <chr>, Year <dbl>, Genre <chr>, Country <chr>, Length <dbl>,
## #   length_category <chr>
df_map <- df_clean %>% 
  group_by(`Country`) %>% 
  summarise(Movies = n())

df_map <- df_map %>% 
  mutate(
    Country = case_when(
      Country == "Alemania" ~ "Germany",
      Country == "Alemania del Oeste" ~ "Germany",
      Country == "Argentina" ~ "Argentina",
      Country == "Australia" ~ "Australia",
      Country == "Bélgica" ~ "Belgium",
      Country == "Canadá" ~ "Canada",
      Country == "China" ~ "China",
      Country == "Corea del Sur" ~ "South Korea",
      Country == "Dinamarca" ~ "Denmark",
      Country == "España" ~ "Spain",
      Country == "Estados Unidos" ~ "United States of America",
      Country == "Francia" ~ "France",
      Country == "Hong Kong" ~ "Hong Kong",
      Country == "Irlanda" ~ "Ireland", 
      Country == "Islandia" ~ "Island",
      Country == "Italia" ~ "Italy",
      Country == "Noruega" ~ "Norway",
      Country == "Países Bajos" ~ "Netherlands",
      Country == "Reino Unido" ~ "United Kingdom",
      Country == "República Checa" ~ "Czech Republic",
      Country == "Rusia" ~ "Russia",
      Country == "Sudáfrica" ~ "South Africa",
      Country == "Suecia" ~ "Sweden",
      Country == "Tailandia" ~ "Tailand",
      Country == "Taiwán" ~ "Taiwan",
      Country == "Turquía" ~ "Turkey",
      Country == "Ucrania" ~ "Ukraine",
      is.na(Country) == TRUE ~ "Sin datos"
    )
  )

df_map %>% 
  arrange(desc(Movies)) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)
Country Movies
United States of America 924
Spain 248
Sin datos 99
United Kingdom 56
Germany 47
Canada 44
France 28
Italy 19
Australia 9
South Korea 6
Norway 5
Argentina 2
China 2
Hong Kong 2
Ireland 2
Russia 2
Sweden 2
Germany 1
Belgium 1
Denmark 1
Island 1
NA 1
NA 1
Netherlands 1
Czech Republic 1
South Africa 1
Tailand 1
Taiwan 1
Turkey 1
Ukraine 1

First, let’s use a bar plot. Mmm, we do seem to like USA films…

df_map %>% 
  ggplot(aes(x = reorder(Country, Movies), y = Movies)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  theme_classic()

Then, we plot a map.

library(sf)
library(rnaturalearth)

world <- ne_countries(scale = "small", returnclass = "sf")

world %>% 
  ggplot() + 
  geom_sf() +
  geom_hline(yintercept = 0, linetype = "dashed") 

world <- world %>% 
  dplyr::rename("Country" = "sovereignt")

df_world <- left_join(world, df_map)

Let’s see how it looks filled.

library(ggplot2)
library(sf)
df_world %>% 
  ggplot() +
  geom_sf(aes(fill = Movies)) +
  theme_void() +
  theme(legend.position = "top") +
  labs(fill = "Number of movies:") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

I will plot the same map but without United States (and maybe without Spain too). The scale is very confusing as it is.

df_world %>% 
  filter(Country != "United States of America" & Country != "Spain") %>% 
  ggplot() +
  geom_sf(aes(fill = Movies)) +
  theme_void() +
  theme(legend.position = "top") +
  labs(fill = "Number of movies:") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

I want to see the Country distribution by channel; for example, to analize which channel promote Spanish cinema. Or how the public channels behave.

library(forcats)
df_clean %>%
  group_by(Channel, Country) %>% 
  count(Country) %>% 
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = n)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()

This we we can see better how Countries are distributed within each channel.

perc_data_2 <- df_clean %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, Country, Channel_shown) %>% 
  summarise(perc = n()/max(Channel_shown))

perc_data_2 %>% 
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()

I want to do the same but removing films from United States.

perc_data_3 <- df_clean %>% 
  filter(Country != "Estados Unidos") %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, Country, Channel_shown) %>% 
  summarise(perc = n()/max(Channel_shown))
  
  
perc_data_3 %>%
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()