Olá a todos :) O intuito desse post é explorar as técnicas de agrupamento nos dados dos filmes em que uma determinada celebridade participou, por esse motivo decidi usar como exemplo o ator Nicolas Cage, pois acredito que ele trabalhou nos mais diversos dos mais variados e inusitados gêneros de filmes.

require(tidyverse, warn.conflicts = F)
require(here)
library(dplyr)
library(corrplot)
library(gridExtra)
library(GGally)
library(knitr)
library(plotly)

filmes = read_csv(
    here("data/nicolas_cage.csv"),
    col_types = cols(
      rating = col_double(),
      audience_score = col_double(),
      title = col_character(),
      credit = col_character(),
      box_office = col_double(),
      year = col_double()
    )
)

Vamos analisar quais colunas temos nos dados refentes aos filmes do Nicolas.

Oque queremos entender aqui é se podemos agrupar os filmes do Nicolas Cage em grupos que façam sentido, grupos que expressem quais filmes foram os mais bem avaliados ou com maior bilheteria.

filmes %>% 
    ggplot(aes(x = year, y = box_office)) + 
    geom_point(size = 4, color = '#C4961A') + 
    labs(
        x = ('Ano'),
        y = ('Bilheteria')
    ) +
    scale_y_log10()

Na visualização acima usamos a escala logarítmica pois as bilheterias possuem valores muito dispersos e dessa maneira podemos visualizar melhor os dados. Por exemplo, podemos observar que existe uma concentração maior de filmes que estão entre as escalas de multiplicador 10 a 100.

filmes %>% 
    ggplot(aes(x = box_office)) + 
    geom_histogram(binwidth = 15, fill = paleta[2], color = "black") + 
    geom_rug(size = .5) +   
    labs(
        x = ('Bilheteria')
    )

Neste próximo gráfico podemos ver melhor a informação dita acima, no eixo y, temos a quantidade de filmes que tiveram a participação do Nicolas Cage, onde, há uma concentração maior no filmes entre 0 a 100 milhões, mais precisamente entre 0 a 50 milhões, conseguimos aqui, aguçar um pouco nossa visão sobre os dados.

filmes %>% 
    ggplot(aes(x = rating)) + 
    geom_histogram(binwidth = 10, boundary = 0, fill = paleta[3], color = "black") + 
    geom_rug(size = .5) 

O gráfico acima é um histograma que nos ajuda a visualizar as notas dos filmes do Nico, oque é interessante, pois o mesmo já fez filmes de todos os tipos, jeitos e maneiras, oque pode ter influenciado na nota mediana dos seus filmes.

mean(filmes$rating)
[1] 54.98039

Como dito, é um cara bem na média.

Que tal, tentarmos inserir a bilheteria dos filmes e os ratings de cada um deles em um algorítimo de clustering e tentar entender quais grupos podemos identificar?

data <- filmes %>%
            select(rating, box_office)

# Normalization
dataNorm <- as.data.frame(scale(data))

# Original data
p1 <- ggplot(data, aes(x=box_office, y=rating)) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
p2 <- ggplot(dataNorm, aes(x=box_office, y=rating)) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

# Subplot
grid.arrange(p1, p2, ncol=2)

Primeiramente precisamos normalizar nossos dados numéricos, avaliação e bilheteria, para que eles pertençam as mesmas escalas. Acima podemos comparar graficamente os dados orignais e os dados normalizados.

# Execution of k-means with k=2
set.seed(1234)
filmes_k2 <- kmeans(dataNorm, centers=5)

Agora, vamos primeiramente instanciar nosso algorítimo de agrupamente tentando fazer com que ele identifique para nós, dois grupos!

# Clustering 
graph <- filmes %>% 
    ggplot(aes(label = title, x = rating, y = box_office, colour = filmes_k2$cluster)) + 
    geom_point(size = 2, colour = filmes_k2$cluster) + 
labs(
        x = "Avaliação",
        y = "Bilheteria"
    )
    
ggplotly(graph)

No gráfico acima podemos visualizar os 5 grupos escolhidos pelo algorítmo, algo interessante e que eu queria ver desde o início da análise era se os filmes da lenda do tesouro perdido foram agrupados no mesmo grupo, e por minha felicidade foram sim! Estão presentes no grupo azul escuro, sendo dois filmes com uma das maiores bilheterias, aliás! o grupo azul escuro são os de bilheteria maior/boa aceitação. Em contrapartida, o grupo preto teve as melhores notas mas com um baixa bilheteria =/, vou procurar assisti-los em outro momento rs.

LS0tDQp0aXRsZTogIkFuw6FsaXNlIGRhcyBwYXJ0aWNpcGHDp8O1ZXMgZW0gZmlsbWVzIGRvIE5pY29sYXMgQ2FnZSAoQWdydXBhbWVudG8pIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KT2zDoSBhIHRvZG9zIDopDQpPIGludHVpdG8gZGVzc2UgcG9zdCDDqSBleHBsb3JhciBhcyB0w6ljbmljYXMgZGUgYWdydXBhbWVudG8gbm9zIGRhZG9zIGRvcyBmaWxtZXMgZW0gcXVlIHVtYSBkZXRlcm1pbmFkYSBjZWxlYnJpZGFkZSBwYXJ0aWNpcG91LCBwb3IgZXNzZSBtb3Rpdm8gZGVjaWRpIHVzYXIgY29tbyBleGVtcGxvIG8gYXRvciBOaWNvbGFzIENhZ2UsIHBvaXMgYWNyZWRpdG8gcXVlIGVsZSB0cmFiYWxob3Ugbm9zIG1haXMgZGl2ZXJzb3MgZG9zIG1haXMgdmFyaWFkb3MgZSBpbnVzaXRhZG9zIGfDqm5lcm9zIGRlIGZpbG1lcy4NCg0KYGBge3Igd2FybmluZz1GQUxTRSwgbWVzc2FnZT1GQUxTRX0NCnJlcXVpcmUodGlkeXZlcnNlLCB3YXJuLmNvbmZsaWN0cyA9IEYpDQpyZXF1aXJlKGhlcmUpDQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeShjb3JycGxvdCkNCmxpYnJhcnkoZ3JpZEV4dHJhKQ0KbGlicmFyeShHR2FsbHkpDQpsaWJyYXJ5KGtuaXRyKQ0KbGlicmFyeShwbG90bHkpDQoNCmZpbG1lcyA9IHJlYWRfY3N2KA0KICAgIGhlcmUoImRhdGEvbmljb2xhc19jYWdlLmNzdiIpLA0KICAgIGNvbF90eXBlcyA9IGNvbHMoDQogICAgICByYXRpbmcgPSBjb2xfZG91YmxlKCksDQogICAgICBhdWRpZW5jZV9zY29yZSA9IGNvbF9kb3VibGUoKSwNCiAgICAgIHRpdGxlID0gY29sX2NoYXJhY3RlcigpLA0KICAgICAgY3JlZGl0ID0gY29sX2NoYXJhY3RlcigpLA0KICAgICAgYm94X29mZmljZSA9IGNvbF9kb3VibGUoKSwNCiAgICAgIHllYXIgPSBjb2xfZG91YmxlKCkNCiAgICApDQopDQpgYGANCg0KVmFtb3MgYW5hbGlzYXIgcXVhaXMgY29sdW5hcyB0ZW1vcyBub3MgZGFkb3MgcmVmZW50ZXMgYW9zIGZpbG1lcyBkbyBOaWNvbGFzLg0KDQotIFRvbWF0b21ldGVyOiBFc3NhIG3DqXRyaWNhIMOpIGJhc2VhZGEgbmFzIG9waW5pw7VlcyBkZSBjZW50ZW5hcyBkZSBjcsOtdGljb3MgZGUgY2luZW1hIGUgdGVsZXZpc8OjbyAtIMOpIHVtYSBtZWRpZGEgY29uZmnDoXZlbCBkZSByZWNvbWVuZGHDp8OjbyBjcsOtdGljYSBwYXJhIG1pbGjDtWVzIGRlIGbDo3MuDQotIEF1ZGllbmNlIFNjb3JlOiBQZXJjZW50dWFsIGRlIHVzdcOhcmlvcyBxdWUgdm90YXJhbSBwb3NpdGl2YW1lbnRlIHVtIGRldGVybWluYWRvIGZpbG1lLg0KLSBUaXRsZTogRGlzcGVuc2EgY29tZW50w6FyaW8sIMOpIG8gbm9tZSBkbyBmaWxtZS4NCi0gQ3JlZGl0OiBQZXJzb25hZ2VtIHF1ZSBvIGF0b3IgYnVzY2FkbyBwYXJ0aWNpcG91IG5vIGZpbG1lLiANCi0gQm94IE9mZmljZTogQmlsaGV0ZXJpYSBkbyBmaWxtZS4NCi0gWWVhcjogRGF0YSBkZSBsYW7Dp2FtZW50byBkbyBmaWxtZS4NCg0KT3F1ZSBxdWVyZW1vcyBlbnRlbmRlciBhcXVpIMOpIHNlIHBvZGVtb3MgYWdydXBhciBvcyBmaWxtZXMgZG8gTmljb2xhcyBDYWdlIGVtIGdydXBvcyBxdWUgZmHDp2FtIHNlbnRpZG8sIGdydXBvcyBxdWUgZXhwcmVzc2VtIHF1YWlzIGZpbG1lcyBmb3JhbSBvcyBtYWlzIGJlbSBhdmFsaWFkb3Mgb3UgY29tIG1haW9yIGJpbGhldGVyaWEuDQoNCmBgYHtyfQ0KZmlsbWVzICU+JSANCiAgICBnZ3Bsb3QoYWVzKHggPSB5ZWFyLCB5ID0gYm94X29mZmljZSkpICsgDQogICAgZ2VvbV9wb2ludChzaXplID0gNCwgY29sb3IgPSAnI0M0OTYxQScpICsgDQogICAgbGFicygNCiAgICAgICAgeCA9ICgnQW5vJyksDQogICAgICAgIHkgPSAoJ0JpbGhldGVyaWEnKQ0KICAgICkgKw0KICAgIHNjYWxlX3lfbG9nMTAoKQ0KYGBgDQpOYSB2aXN1YWxpemHDp8OjbyBhY2ltYSB1c2Ftb3MgYSBlc2NhbGEgbG9nYXLDrXRtaWNhIHBvaXMgYXMgYmlsaGV0ZXJpYXMgcG9zc3VlbSB2YWxvcmVzIG11aXRvIGRpc3BlcnNvcyBlIGRlc3NhIG1hbmVpcmEgcG9kZW1vcyB2aXN1YWxpemFyIG1lbGhvciBvcyBkYWRvcy4gUG9yIGV4ZW1wbG8sIHBvZGVtb3Mgb2JzZXJ2YXIgcXVlIGV4aXN0ZSB1bWEgY29uY2VudHJhw6fDo28gbWFpb3IgZGUgZmlsbWVzIHF1ZSBlc3TDo28gZW50cmUgYXMgZXNjYWxhcyBkZSBtdWx0aXBsaWNhZG9yIDEwIGEgMTAwLg0KDQpgYGB7cn0NCmZpbG1lcyAlPiUgDQogICAgZ2dwbG90KGFlcyh4ID0gYm94X29mZmljZSkpICsgDQogICAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGggPSAxNSwgZmlsbCA9IHBhbGV0YVsyXSwgY29sb3IgPSAiYmxhY2siKSArIA0KICAgIGdlb21fcnVnKHNpemUgPSAuNSkgKyAgIA0KICAgIGxhYnMoDQogICAgICAgIHggPSAoJ0JpbGhldGVyaWEnKQ0KICAgICkNCmBgYA0KTmVzdGUgcHLDs3hpbW8gZ3LDoWZpY28gcG9kZW1vcyB2ZXIgbWVsaG9yIGEgaW5mb3JtYcOnw6NvIGRpdGEgYWNpbWEsIG5vIGVpeG8geSwgdGVtb3MgYSBxdWFudGlkYWRlIGRlIGZpbG1lcyBxdWUgdGl2ZXJhbSBhIHBhcnRpY2lwYcOnw6NvIGRvIE5pY29sYXMgQ2FnZSwgb25kZSwgaMOhIHVtYSBjb25jZW50cmHDp8OjbyBtYWlvciBubyBmaWxtZXMgZW50cmUgMCBhIDEwMCBtaWxow7VlcywgbWFpcyBwcmVjaXNhbWVudGUgZW50cmUgMCBhIDUwIG1pbGjDtWVzLCBjb25zZWd1aW1vcyBhcXVpLCBhZ3XDp2FyIHVtIHBvdWNvIG5vc3NhIHZpc8OjbyBzb2JyZSBvcyBkYWRvcy4NCg0KYGBge3J9DQpmaWxtZXMgJT4lIA0KICAgIGdncGxvdChhZXMoeCA9IHJhdGluZykpICsgDQogICAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGggPSAxMCwgYm91bmRhcnkgPSAwLCBmaWxsID0gcGFsZXRhWzNdLCBjb2xvciA9ICJibGFjayIpICsgDQogICAgZ2VvbV9ydWcoc2l6ZSA9IC41KSANCmBgYA0KTyBncsOhZmljbyBhY2ltYSDDqSB1bSBoaXN0b2dyYW1hIHF1ZSBub3MgYWp1ZGEgYSB2aXN1YWxpemFyIGFzIG5vdGFzIGRvcyBmaWxtZXMgZG8gTmljbywgb3F1ZSDDqSBpbnRlcmVzc2FudGUsIHBvaXMgbyBtZXNtbyBqw6EgZmV6IGZpbG1lcyBkZSB0b2RvcyBvcyB0aXBvcywgamVpdG9zIGUgbWFuZWlyYXMsIG9xdWUgcG9kZSB0ZXIgaW5mbHVlbmNpYWRvIG5hIG5vdGEgbWVkaWFuYSBkb3Mgc2V1cyBmaWxtZXMuDQoNCmBgYHtyfQ0KbWVhbihmaWxtZXMkcmF0aW5nKQ0KYGBgDQpDb21vIGRpdG8sIMOpIHVtIGNhcmEgYmVtIG5hIG3DqWRpYS4NCg0KUXVlIHRhbCwgdGVudGFybW9zIGluc2VyaXIgYSBiaWxoZXRlcmlhIGRvcyBmaWxtZXMgZSBvcyByYXRpbmdzIGRlIGNhZGEgdW0gZGVsZXMgZW0gdW0gYWxnb3LDrXRpbW8gZGUgY2x1c3RlcmluZyBlIHRlbnRhciBlbnRlbmRlciBxdWFpcyBncnVwb3MgcG9kZW1vcyBpZGVudGlmaWNhcj8NCg0KYGBge3J9DQpkYXRhIDwtIGZpbG1lcyAlPiUNCiAgICAgICAgICAgIHNlbGVjdChyYXRpbmcsIGJveF9vZmZpY2UpDQoNCiMgTm9ybWFsaXphdGlvbg0KZGF0YU5vcm0gPC0gYXMuZGF0YS5mcmFtZShzY2FsZShkYXRhKSkNCg0KIyBPcmlnaW5hbCBkYXRhDQpwMSA8LSBnZ3Bsb3QoZGF0YSwgYWVzKHg9Ym94X29mZmljZSwgeT1yYXRpbmcpKSArDQogIGdlb21fcG9pbnQoKSArDQogIGxhYnModGl0bGU9Ik9yaWdpbmFsIGRhdGEiKSArDQogIHRoZW1lX2J3KCkNCg0KIyBOb3JtYWxpemVkIGRhdGEgDQpwMiA8LSBnZ3Bsb3QoZGF0YU5vcm0sIGFlcyh4PWJveF9vZmZpY2UsIHk9cmF0aW5nKSkgKw0KICBnZW9tX3BvaW50KCkgKw0KICBsYWJzKHRpdGxlPSJOb3JtYWxpemVkIGRhdGEiKSArDQogIHRoZW1lX2J3KCkNCg0KIyBTdWJwbG90DQpncmlkLmFycmFuZ2UocDEsIHAyLCBuY29sPTIpDQpgYGANClByaW1laXJhbWVudGUgcHJlY2lzYW1vcyBub3JtYWxpemFyIG5vc3NvcyBkYWRvcyBudW3DqXJpY29zLCBhdmFsaWHDp8OjbyBlIGJpbGhldGVyaWEsIHBhcmEgcXVlIGVsZXMgcGVydGVuw6dhbSBhcyBtZXNtYXMgZXNjYWxhcy4NCkFjaW1hIHBvZGVtb3MgY29tcGFyYXIgZ3JhZmljYW1lbnRlIG9zIGRhZG9zIG9yaWduYWlzIGUgb3MgZGFkb3Mgbm9ybWFsaXphZG9zLg0KDQpgYGB7cn0NCiMgRXhlY3V0aW9uIG9mIGstbWVhbnMgd2l0aCBrPTINCnNldC5zZWVkKDEyMzQpDQpmaWxtZXNfazIgPC0ga21lYW5zKGRhdGFOb3JtLCBjZW50ZXJzPTUpDQpgYGANCg0KQWdvcmEsIHZhbW9zIHByaW1laXJhbWVudGUgaW5zdGFuY2lhciBub3NzbyBhbGdvcsOtdGltbyBkZSBhZ3J1cGFtZW50ZSB0ZW50YW5kbyBmYXplciBjb20gcXVlIGVsZSBpZGVudGlmaXF1ZSBwYXJhIG7Ds3MsIGRvaXMgZ3J1cG9zIQ0KDQpgYGB7cn0NCiMgQ2x1c3RlcmluZyANCmdyYXBoIDwtIGZpbG1lcyAlPiUgDQogICAgZ2dwbG90KGFlcyhsYWJlbCA9IHRpdGxlLCB4ID0gcmF0aW5nLCB5ID0gYm94X29mZmljZSwgY29sb3VyID0gZmlsbWVzX2syJGNsdXN0ZXIpKSArIA0KICAgIGdlb21fcG9pbnQoc2l6ZSA9IDIsIGNvbG91ciA9IGZpbG1lc19rMiRjbHVzdGVyKSArIA0KbGFicygNCiAgICAgICAgeCA9ICJBdmFsaWHDp8OjbyIsDQogICAgICAgIHkgPSAiQmlsaGV0ZXJpYSINCiAgICApDQogICAgDQpnZ3Bsb3RseShncmFwaCkNCmBgYA0KDQpObyBncsOhZmljbyBhY2ltYSBwb2RlbW9zIHZpc3VhbGl6YXIgb3MgNSBncnVwb3MgZXNjb2xoaWRvcyBwZWxvIGFsZ29yw610bW8sIGFsZ28gaW50ZXJlc3NhbnRlIGUgcXVlIGV1IHF1ZXJpYSB2ZXIgZGVzZGUgbyBpbsOtY2lvIGRhIGFuw6FsaXNlIGVyYSBzZSBvcyBmaWxtZXMgZGEgbGVuZGEgZG8gdGVzb3VybyBwZXJkaWRvIGZvcmFtIGFncnVwYWRvcyBubyBtZXNtbyBncnVwbywgZSBwb3IgbWluaGEgZmVsaWNpZGFkZSBmb3JhbSBzaW0hIEVzdMOjbyBwcmVzZW50ZXMgbm8gZ3J1cG8gYXp1bCBlc2N1cm8sIHNlbmRvIGRvaXMgZmlsbWVzIGNvbSB1bWEgZGFzIG1haW9yZXMgYmlsaGV0ZXJpYXMsIGFsacOhcyEgbyBncnVwbyBhenVsIGVzY3VybyBzw6NvIG9zIGRlIGJpbGhldGVyaWEgbWFpb3IvYm9hIGFjZWl0YcOnw6NvLiBFbSBjb250cmFwYXJ0aWRhLCBvIGdydXBvIHByZXRvIHRldmUgYXMgbWVsaG9yZXMgbm90YXMgbWFzIGNvbSB1bSBiYWl4YSBiaWxoZXRlcmlhID0vLCB2b3UgcHJvY3VyYXIgYXNzaXN0aS1sb3MgZW0gb3V0cm8gbW9tZW50byBycy4NCg0K