Introduction

GitHub

This line of code is quite magical. Change any opts_chunk here and it will change setting for the entire document. You can set any option you want: echo, include, warning or messages.

knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Load and clean data

df <- data %>% mutate(
  genre = replace(genre, genre == "Suspense / Thriller", "Suspense"),
  genre = replace(genre, genre == "Documentales", "Documental"),
  genre = replace(genre, genre == "Ciencia ficci?n", "SyFy"),
  genre = replace(genre, genre == "Infantil/Familiar", "Infantil")
)

The data set contains 1510 instances and 16 variables. From now on, we won’t be using the variable Description and a bunch more related to the creators of the film (directors, writers, actors, etc).

Let’s see a few examples of the data

# Hay varios temas predeterminados en kableExtra, como kable_paper()
df_clean %>% 
  head(6) %>%
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Date	Channel	Spanish title	Original title	Year	Genre	Country	Length
2022-09-13 00:19:00	Paramount Network	Vanilla Sky	Vanilla Sky	2001	Drama	Estados Unidos	130
2022-09-13 00:42:00	Neox	Ruslan: la venganza del asesino	Driven to Kill	2009	Acción	Canadá	98
2022-09-13 01:10:00	laSexta	La mujer del pastor	The Pastor’s Wife	2011	Drama	Canadá	87
2022-09-13 13:10:00	La 2	El sonido de un tambor	Cimarron: The Sound of a Drum	1968	Western	Estados Unidos	70
2022-09-13 16:05:00	TRECE	Comando secreto	The Secret War of Harry Frigg	1968	Comedia	Estados Unidos	110
2022-09-13 16:20:00	La 1	La cuchara de Elli	Tessa Hennig - Elli gibt den Löffel ab	2012	Drama	Alemania	90

Some basic facts about the dataset

How many films have been shown in each channel?

library(ggplot2)

df_clean %>% 
  group_by(Channel) %>% 
  summarise(`Film count` = n()) %>%
  ggplot(aes(x = `Film count`, y = reorder(Channel, `Film count`))) +
  geom_bar(stat = 'identity') +
  theme_classic() +
  ylab("Channel")

How many unique values there are in each variable?

library(purrr)
df_clean %>% map_dbl(
  n_distinct) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

	x
Date	1375
Channel	11
Spanish title	994
Original title	986
Year	83
Genre	22
Country	30
Length	119

So, if there are 1510 instances, why do we only have 994 movies? I guess some of them were broadcasted more than once. This is the top 10 (I filtered a couple of instances that are not films, but regular TV programs):

df_clean %>% 
  filter(`Spanish title` != "Mejor llama a Kiko" & `Spanish title` != "El zapping de Surferos") %>% 
  group_by(`Spanish title`) %>% 
  summarise(Emisiones = n()) %>% 
  filter(Emisiones > 1) %>% 
  arrange(desc(Emisiones)) %>% 
  head(10) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Spanish title	Emisiones
Querido fotogramas	16
Spanish Movie	8
Diario de Greg 2: La Ley de Rodrick	5
El diario de Greg	5
Locos por las motos	5
Lord Jim	5
Superlópez	5
Zombies 3	5
Amigos…	4
Apocalypse Now	4

The discrepancy between Spanish title (994) and Original title (986) count is striking. There must be one more duplicated value in the latter.

df_clean %>% 
  group_by(`Original title`) %>% 
  distinct() %>% 
  filter(n()>1) %>% 
  arrange(`Original title`) %>% 
  select(3:4) %>% 
  distinct(`Spanish title`) %>% 
  mutate(Count = n()) %>% 
  filter(Count >= 2) %>% 
  select(-Count) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Spanish title	Original title
Asteroides, el nuevo dorado	Astéroïdes, le nouvel eldorado spatial?
Asteroides, el nuevo El Dorado	Astéroïdes, le nouvel eldorado spatial?
El cabo del miedo	Cape Fear
El Cabo del Terror	Cape Fear
El justiciero de la ciudad	Death Wish
El justiciero	Death Wish
Dragon: la vida de Bruce Lee	Dragon: The Bruce Lee Story
Dragón: la vida de Bruce Lee	Dragon: The Bruce Lee Story
Aventura en el centro de la Tierra	Journey to the Center of the Earth
Viaje al centro de la tierra	Journey to the Center of the Earth
Robin Hood, el magnífico	Robin Hood
Robin Hood	Robin Hood
El Padrino II	The Godfather: Part II
El Padrino parte II	The Godfather: Part II
El último patriota	The Patriot
El patriota	The Patriot
El estrafalario prisionero de Zenda	The Prisoner of Zenda
El prisionero de Zenda	The Prisoner of Zenda
The Score (Un golpe maestro)	The Score
Un golpe maestro	The Score

As you can see, some different films share the same Original title. If this happens more times from now on, we will now. Above is the list of every film shown more than once.

Film length

How long are the films shown in TV?

This is the distribution of film length. The mean is 99 and the median 99.

library(ggplot2)
df_clean %>%
  drop_na() %>%
  ggplot(aes(x = Length)) +
    geom_histogram(aes(y=..density..), binwidth = 5) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    theme_classic()

Let’s see how it is by channel.

df_clean %>% 
  drop_na() %>% 
  ggplot(aes(x = Length)) +
    geom_histogram(aes(y=..density..), binwidth = 5) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    facet_wrap(~ Channel) +
    theme_classic()

A violin plot (a boxplot would be useful too) may be easier to interpret. It is remarkable how similar are lengths in Antena 3 channel.

df_clean %>%  
  ggplot(aes(x = Length, y = reorder(Channel, Length, median),
             color = Channel, fill = Channel)) +
  geom_violin(size = 0.9, alpha = 0.2, show.legend = F) +
  theme_classic()

four_shortest <- df_clean %>% 
  group_by(Channel) %>% 
  summarise(Median = median(Length)) %>% 
  arrange(Median) %>% 
  slice_head(n = 4) %>% 
  select(-Median) %>% 
  as.list() %>% 
  unname() %>%
  unlist()

Why are Antena 3, laSexta, La 1 and La 2 broadcasting the shortest films? Let’s have a glimpse.

library(stringr)
library(kableExtra)
library(lubridate)

df_clean %>% 
  mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>% 
  relocate("Weekday", .after = Date) %>% 
  transmute(
    Date = Date,
    Channel = Channel,
    `Spanish title` = `Spanish title`,
    "Weekday" = case_when(
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
      ),
    Length = Length
    ) %>% 
  filter(Channel == four_shortest) %>% 
  arrange(Length) %>%
  head(15) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Date	Channel	Spanish title	Weekday	Length
2022-09-14 22:56:00	La 2	Grace Kelly: Los millones perdidos	Laboral	45
2022-10-21 04:30:00	La 2	La expulsión de los moriscos	Finde	60
2022-09-13 13:10:00	La 2	El sonido de un tambor	Laboral	70
2022-09-15 13:15:00	La 2	El último lobo	Laboral	71
2022-10-15 17:45:00	Antena 3	Obsesión del pasado	Finde	79
2022-10-02 05:30:00	La 2	Una esvástica sobre el Bidasoa	Finde	80
2022-10-07 23:59:00	La 2	El elegido	Finde	80
2022-10-10 02:50:00	La 1	Los médicos y yo	Laboral	80
2022-10-12 19:30:00	Antena 3	Viviendo con una extraña	Laboral	81
2022-10-09 02:05:00	La 1	Amor al sol	Finde	82
2022-10-23 16:00:00	Antena 3	La acusación de Rachel	Finde	82
2022-09-25 16:00:00	Antena 3	Un encuentro peligroso	Finde	83
2022-10-06 00:40:00	La 2	Con la pata quebrada	Laboral	83
2022-10-08 19:30:00	Antena 3	Muerte en familia	Finde	83
2022-10-22 16:00:00	Antena 3	Belleza mortal	Finde	83

It is obvious how short are these films; most of them are broadcasted during the afternoon, starting between 16 p.m. and 19 p.m. Most of these films are also shown on weekends (remember, including Fridays).

Now we transform Length in three categories being (1) less than 90 minutes, (2) between 90 and 150 minutes and (3) more than 150 minutes.

df_clean <- df_clean %>% 
  mutate(
    length_category = case_when(
      Length <= 90 ~ "Less than 90 minutes",
      Length > 90 & Length <= 150 ~ "Between 90 and 150 minutes",
      Length > 150 ~ "More than 150 minutes",
      TRUE ~ "Rest"
    )
  )

Not color-blind graphic

perc_data <- df_clean %>% 
    mutate(length_category = factor(length_category, levels = c("Less than 90 minutes", 
                                                      "Between 90 and 150 minutes", "More than 150 minutes"))) %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, length_category) %>% 
  summarise(perc = n()/max(Channel_shown))

library(scales) # For percent

perc_data %>% 
  ggplot(aes(x = length_category, y = perc, fill = length_category)) +
  geom_bar(stat = 'identity', show.legend = "none") +
  facet_wrap(~ Channel) +
  geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
  theme_classic() +
  theme(axis.text.x = element_blank())

Color-blind graphic

library(ggtext)
library(thematic)
# Color-blind safe colors
coloritos <- thematic::okabe_ito(3)

perc_data %>% 
  ggplot(aes(x = length_category, y = perc, fill = length_category)) +
  geom_bar(stat = 'identity', show.legend = "none") +
  scale_fill_manual(values = coloritos) +
  facet_wrap(~ Channel) +
  geom_text(aes(label = percent(perc)), vjust = -0.2, size = 2.3) +
  theme_classic() +
  theme(axis.text.x = element_blank())

By date

This is a list of the number of films shown each day.

library(stringr)
library(kableExtra)
library(lubridate)

df_clean_film_day <- df_clean %>% 
  mutate("Weekday" = weekdays.Date(Date, abbreviate = F)) %>% 
  relocate("Weekday", .after = Date) %>% 
  transmute(
    Date = Date,
    "Weekday" = case_when(
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("lunes|martes|miercoles|jueves", ignore_case = T)) ~ "Laboral",
      str_detect(iconv(Weekday, to='ASCII//TRANSLIT'), regex("viernes|sabado|domingo", ignore_case = T)) ~ "Finde"
      )
    ) %>% 
  group_by(
    Date = as.Date(Date), Weekday) %>% 
  summarise(Films = n()) 
  

df_clean_film_day %>% 
  head(10) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Date	Weekday	Films
2022-09-13	Laboral	11
2022-09-14	Laboral	13
2022-09-15	Laboral	11
2022-09-16	Finde	20
2022-09-17	Finde	38
2022-09-18	Finde	34
2022-09-19	Laboral	17
2022-09-20	Laboral	22
2022-09-21	Laboral	23
2022-09-22	Laboral	27

There is a surprising spike on Wednesday 12th October which is not that surpring when I tell you that it is Spanish national day.

df_clean_film_day %>% 
  ggplot(aes(x = Date, y = Films, fill = Weekday)) +
  geom_col() +
  theme_classic()

Mapping the data (or analyzing the country variable)

First, we have to prepare the data. Country names are in Spanish, let’s translate them.

unique(df_clean$Country)

##  [1] "Estados Unidos"     "Canadá"             "Alemania"          
##  [4] "España"             "Italia"             "Francia"           
##  [7] "Reino Unido"        "Dinamarca"          "Sudáfrica"         
## [10] "Australia"          "Corea del Sur"      "Tailandia"         
## [13] "Suecia"             "Rusia"              "China"             
## [16] "Hong Kong"          "Islandia"           "Argentina"         
## [19] "República Checa"    "Noruega"            "Países Bajos"      
## [22] "Bélgica"            "Turquía"            "Taiwán"            
## [25] "Alemania del Oeste" "Irlanda"            "Ucrania"           
## [28] NA                   "Japón"              "México"

I will check for empty values in the Country variable. If there isn’t we will continue.

df_clean %>% 
  filter(Country == "")

## # A tibble: 0 x 9
## # ... with 9 variables: Date <dttm>, Channel <chr>, Spanish title <chr>,
## #   Original title <chr>, Year <dbl>, Genre <chr>, Country <chr>, Length <dbl>,
## #   length_category <chr>

df_map <- df_clean %>% 
  group_by(`Country`) %>% 
  summarise(Movies = n())

df_map <- df_map %>% 
  mutate(
    Country = case_when(
      Country == "Alemania" ~ "Germany",
      Country == "Alemania del Oeste" ~ "Germany",
      Country == "Argentina" ~ "Argentina",
      Country == "Australia" ~ "Australia",
      Country == "Bélgica" ~ "Belgium",
      Country == "Canadá" ~ "Canada",
      Country == "China" ~ "China",
      Country == "Corea del Sur" ~ "South Korea",
      Country == "Dinamarca" ~ "Denmark",
      Country == "España" ~ "Spain",
      Country == "Estados Unidos" ~ "United States of America",
      Country == "Francia" ~ "France",
      Country == "Hong Kong" ~ "Hong Kong",
      Country == "Irlanda" ~ "Ireland", 
      Country == "Islandia" ~ "Island",
      Country == "Italia" ~ "Italy",
      Country == "Noruega" ~ "Norway",
      Country == "Países Bajos" ~ "Netherlands",
      Country == "Reino Unido" ~ "United Kingdom",
      Country == "República Checa" ~ "Czech Republic",
      Country == "Rusia" ~ "Russia",
      Country == "Sudáfrica" ~ "South Africa",
      Country == "Suecia" ~ "Sweden",
      Country == "Tailandia" ~ "Tailand",
      Country == "Taiwán" ~ "Taiwan",
      Country == "Turquía" ~ "Turkey",
      Country == "Ucrania" ~ "Ukraine",
      is.na(Country) == TRUE ~ "Sin datos"
    )
  )

df_map %>% 
  arrange(desc(Movies)) %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = T)

Country	Movies
United States of America	924
Spain	248
Sin datos	99
United Kingdom	56
Germany	47
Canada	44
France	28
Italy	19
Australia	9
South Korea	6
Norway	5
Argentina	2
China	2
Hong Kong	2
Ireland	2
Russia	2
Sweden	2
Germany	1
Belgium	1
Denmark	1
Island	1
NA	1
NA	1
Netherlands	1
Czech Republic	1
South Africa	1
Tailand	1
Taiwan	1
Turkey	1
Ukraine	1

First, let’s use a bar plot. Mmm, we do seem to like USA films…

df_map %>% 
  ggplot(aes(x = reorder(Country, Movies), y = Movies)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  theme_classic()

Then, we plot a map.

library(sf)
library(rnaturalearth)

world <- ne_countries(scale = "small", returnclass = "sf")

world %>% 
  ggplot() + 
  geom_sf() +
  geom_hline(yintercept = 0, linetype = "dashed")

world <- world %>% 
  dplyr::rename("Country" = "sovereignt")

df_world <- left_join(world, df_map)

Let’s see how it looks filled.

library(ggplot2)
library(sf)
df_world %>% 
  ggplot() +
  geom_sf(aes(fill = Movies)) +
  theme_void() +
  theme(legend.position = "top") +
  labs(fill = "Number of movies:") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

I will plot the same map but without United States (and maybe without Spain too). The scale is very confusing as it is.

df_world %>% 
  filter(Country != "United States of America" & Country != "Spain") %>% 
  ggplot() +
  geom_sf(aes(fill = Movies)) +
  theme_void() +
  theme(legend.position = "top") +
  labs(fill = "Number of movies:") +
  guides(fill = guide_legend(nrow = 2, byrow = TRUE))

I want to see the Country distribution by channel; for example, to analize which channel promote Spanish cinema. Or how the public channels behave.

library(forcats)
df_clean %>%
  group_by(Channel, Country) %>% 
  count(Country) %>% 
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = n)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()

This we we can see better how Countries are distributed within each channel.

perc_data_2 <- df_clean %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, Country, Channel_shown) %>% 
  summarise(perc = n()/max(Channel_shown))

perc_data_2 %>% 
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()

I want to do the same but removing films from United States.

perc_data_3 <- df_clean %>% 
  filter(Country != "Estados Unidos") %>% 
  group_by(Channel) %>% 
  mutate(Channel_shown = n()) %>% 
  group_by(Channel, Country, Channel_shown) %>% 
  summarise(perc = n()/max(Channel_shown))
  
  
perc_data_3 %>%
  ggplot(aes(x = fct_lump(Country, prop = 0.05), y = perc)) +
  geom_col() +
  facet_wrap(~Channel) +
  coord_flip() +
  theme_classic()

Spanish TDT movies report

Guillermo Diaz

2022-10-29