El objetivo de este artículo es analizar las letras de Oasis a través de un conjunto de herramientas que consisten en hacer web scraping y análisis de texto, todo con R.2 Las dos técnicas mencionadas están adquiriendo gran popularidad en el mundo del análisis de datos. Mediante el web scraping es posible acceder a datos que se encuentran en sitios web, en este caso y gracias a la enorme ayuda del grán análisis de Juan Bosco Mendoza Vega pude acceder a ejemplos funciones de extracción de información muy simples y eficientes. Siguiendo este orden de ideas, las técnicas de análisis de texto permiten sistematizar el estudio de las palabras que se utilizan en un determinado contexto con el objeto de crear información a partir de esos datos.
Dicho esto, hablemos de Oasis. Creada en 1991 en la ciudad de Manchester por los hermanos Gallagher, Oasis fue una de las bandas mas importantes del denominado Brit-Pop. Sus letras mezclan diversos tópicos y sentimientos. Durante los 18 años de actividad del grupo editaron 7 discos, los cuales exploraremos en deatalle.
En primer lugar procederemos a instalar las librerías de trabajo
#se borran todos los objetos de la memoria
rm(list =ls())
#Se configuran los directorios de trabajo
setwd("C:/Users/Guille/Dropbox/R/Data/Oasis")
#Se instalan librerias de trabajo
library(rvest)
library(httr)
library(xml2)
library(jsonlite)
library(tidyverse)
library(tidytext)
library(lubridate)
library(scales)
library(textdata)
library(wordcloud)
library(RColorBrewer)
library(highcharter)
library(kableExtra)
library(knitr)
Ahora veamos un breve resumen de los discos de Oasis y sus ventas
Para obtener las letras de Oasis es necesario scrapear por un lado, el sitio musicbrainz de donde sacaremos los nombres de los discos. Con nuetra lista de nombres, la Api de Orion nos permitirá acceder a las letras de las canciones que encuentre. De esta manera, crearemos un data frame que contenga la letra de cada canción según el disco.
musicbrainz_html <- read_html("https://musicbrainz.org/release/d7749b04-2bff-409f-a647-cd5c2c75432b")
ListaDiscos <-
c(
"https://musicbrainz.org/release/9822581d-98bf-3f97-a94c-4b1350d090aa",
"https://musicbrainz.org/release/ed286309-6e64-48e3-835b-61aaf86cdb86",
"https://musicbrainz.org/release/1e84c431-4c9f-438a-a7dc-4b429f449993",
"https://musicbrainz.org/release/a807034d-09ee-3d4c-9566-06d114c1fc6c",
"https://musicbrainz.org/release/22dacc34-e04f-4b9e-97a5-3dedd3b0a56e",
"https://musicbrainz.org/release/9717efc0-0436-3cd0-9e52-31d533a4c026",
"https://musicbrainz.org/release/e2120df2-9a4e-4ab9-a83a-0bb827670624"
)
musicbrainz_html %>%
html_nodes(css = "tbody tr") %>%
html_text() %>%
str_split(pattern = "\\n", simplify = T) %>%
data.frame() %>%
tbl_df() %>%
slice(-1) %>%
select(song = X4) %>%
mutate_all(trimws)
musicbrainz_html %>%
html_nodes(css = ".releaseheader h1") %>%
html_text()
[1] "Definitely Maybe"
obtener_canciones <- function(musicbrainz_url) {
mi_html <-
musicbrainz_url %>%
read_html()
nombre_album <-
mi_html %>%
html_nodes(css = ".releaseheader h1") %>%
html_text()
fecha_album <-
mi_html %>%
html_nodes(css = ".release-date") %>%
html_text()
canciones <-
mi_html %>%
html_nodes(css = "tbody tr") %>%
html_text() %>%
str_split(pattern = "\\n", simplify = T) %>%
data.frame() %>%
tbl_df() %>%
slice(-1) %>%
select(cancion = X4) %>%
mutate_all(trimws)
canciones %>%
mutate(album = nombre_album, fecha = fecha_album)
}
obtener_canciones("https://musicbrainz.org/release/d7749b04-2bff-409f-a647-cd5c2c75432b")
Ya tenemos los nombres de las canciones, a que disco pertenecen y en que año fue lanzado. Ahora vamos a buscar las letras:
Api <- "1xOnM8KcFZ2izYmDm5JrrJnKGxd2znSLhiClT5auONMTczIoOrppcA0AYB7HKrHc"
url_prueba <- paste0(
"https://orion.apiseeds.com/api/music/lyric/",
"Oasis/",
"Columbia",
"?apikey=",
Api
)
Live_Forever <- GET(url = url_prueba)
content(Live_Forever, as = "text", encoding = "UTF-8")
[1] "{\"result\":{\"artist\":{\"name\":\"Oasis\"},\"track\":{\"name\":\"Columbia\",\"text\":\"There we were, now here we are\\r\\nAll this confusion, nothing's the same to me\\r\\nThere we were, now here we are\\r\\nAll this confusion, nothing's the same to me\\r\\n\\r\\nI can't tell you the way I feel\\r\\nBecause the way I feel is oh so new to me\\r\\nI can't tell you the way I feel\\r\\nBecause the way I feel is oh so new to me\\r\\n\\r\\nWhat I heard is not what I hear\\r\\nI can see the signs but they're not very clear\\r\\nWhat I heard is not what I hear\\r\\nI can see the signs but they're not very clear\\r\\n\\r\\nSo I can't tell you the way I feel\\r\\nBecause the way I feel is oh so new to me\\r\\nI can't tell you the way I feel\\r\\nBecause the way I feel is oh so new to me\\r\\n\\r\\nThis is confusion, am I confusing you?\\r\\nThis is confusion, am I confusing you?\\r\\nThis is peculiar, we don't want to fool ya\\r\\nThis is peculiar, we don't want to fool ya\\r\\n\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\\r\\nYeah yeah yeah\",\"lang\":{\"code\":\"xx\",\"name\":\"????\"}},\"copyright\":{\"notice\":\"Columbia lyrics are property and copyright of their owners. Commercial use is not allowed.\",\"artist\":\"Copyright Oasis\",\"text\":\"All lyrics provided for educational purposes and personal use only.\"},\"probability\":100,\"similarity\":1}}"
Oasis <-
map(ListaDiscos, obtener_canciones) %>%
reduce(bind_rows)
Oasis_letras_lista <-
map(Oasis[["cancion"]], function(x){
ruta <- paste0(
"https://orion.apiseeds.com/api/music/lyric/Oasis/",
x,
"?apikey=",
Api
)
GET(url = ruta)
})
content(Oasis_letras_lista[[4]], as = "text", encoding = "UTF-8")
[1] "{\"result\":{\"artist\":{\"name\":\"Oasis\"},\"track\":{\"name\":\"Up In The Sky\",\"text\":\"Hey you! Up in the sky\\nLearning to fly\\nTell me how high\\nDo you think you'll go\\nBefore you start falling\\nHey you! Up in a tree\\nYou wanna be me\\nBut that couldn't be\\nCos the people here they don't hear you calling\\nHow does it feel\\nWhen you're inside me?\\n\\nHey you! wearing the crown\\nMaking no sound\\nI heard you feel down\\nWell that's too bad\\nWelcome to my world\\nHey you! Stealing the light\\nI heard that the shine's\\nGone out of your life.\\nWell that's just too bad\\nWelcome to my world\\nHow does it feel\\nWhen you're inside me?\\n\\nYou'll need assistance with the things that you have never ever seen\\nIt's just a case of never breathing out\\nBefore you've breathed it in\\nHow does it feel\\nWhen you're inside me?\\n\\nHey you! Up in the sky\\nLearning to fly\\nTell me how high\\nDo you think you'll go\\nBefore you start falling\\nHey you! Up in a tree\\nYou wanna be me\\nBut that couldn't be\\nCos the people here they don't hear you calling\\nHow does it feel\\nWhen you're inside me?\\n\\nYou'll need assistance with the things that you have never ever seen\\nIt's just a case of never breathing out\\nBefore you've breathed it in\\nHow does it feel\\nWhen you're inside me?\",\"lang\":{\"code\":\"xx\",\"name\":\"????\"}},\"copyright\":{\"notice\":\"Up In The Sky lyrics are property and copyright of their owners. Commercial use is not allowed.\",\"artist\":\"Copyright Oasis\",\"text\":\"All lyrics provided for educational purposes and personal use only.\"},\"probability\":100,\"similarity\":1}}"
content(Oasis_letras_lista[[4]], as = "text", encoding = "UTF-8") %>%
fromJSON()
$result
$result$artist
$result$artist$name
[1] "Oasis"
$result$track
$result$track$name
[1] "Up In The Sky"
$result$track$text
[1] "Hey you! Up in the sky\nLearning to fly\nTell me how high\nDo you think you'll go\nBefore you start falling\nHey you! Up in a tree\nYou wanna be me\nBut that couldn't be\nCos the people here they don't hear you calling\nHow does it feel\nWhen you're inside me?\n\nHey you! wearing the crown\nMaking no sound\nI heard you feel down\nWell that's too bad\nWelcome to my world\nHey you! Stealing the light\nI heard that the shine's\nGone out of your life.\nWell that's just too bad\nWelcome to my world\nHow does it feel\nWhen you're inside me?\n\nYou'll need assistance with the things that you have never ever seen\nIt's just a case of never breathing out\nBefore you've breathed it in\nHow does it feel\nWhen you're inside me?\n\nHey you! Up in the sky\nLearning to fly\nTell me how high\nDo you think you'll go\nBefore you start falling\nHey you! Up in a tree\nYou wanna be me\nBut that couldn't be\nCos the people here they don't hear you calling\nHow does it feel\nWhen you're inside me?\n\nYou'll need assistance with the things that you have never ever seen\nIt's just a case of never breathing out\nBefore you've breathed it in\nHow does it feel\nWhen you're inside me?"
$result$track$lang
$result$track$lang$code
[1] "xx"
$result$track$lang$name
[1] "????"
$result$copyright
$result$copyright$notice
[1] "Up In The Sky lyrics are property and copyright of their owners. Commercial use is not allowed."
$result$copyright$artist
[1] "Copyright Oasis"
$result$copyright$text
[1] "All lyrics provided for educational purposes and personal use only."
$result$probability
[1] 100
$result$similarity
[1] 1
extraer_letra <- function(contenido){
if(!is.na(contenido)) {
cont_json <- fromJSON(contenido)
c(cancion = cont_json$result$track$name,
letra = cont_json$result$track$text) %>%
gsub("[[:cntrl:]]", " ", .) %>%
gsub("\\[.*?\\]", " ", .) %>%
trimws()
} else {
c(cancion = NA, letra = NA)
}
}
mis_letras_df <-
Oasis_letras_lista %>%
map(~content(., as = "text", encoding = "UTF-8")) %>%
map(~ifelse(grepl("error|Bad Request|html", .), NA, .)) %>%
map(extraer_letra) %>%
do.call(what = bind_rows)
mis_letras_df <-
Oasis_letras_lista %>%
map(~content(., as = "text", encoding = "UTF-8")) %>%
map(~ifelse(grepl("error|Bad Request|html", .), NA, .)) %>%
map(function(x) {
if(!is.na(x)) {
y <- fromJSON(x)
c(cancion = y$result$track$name,
letra = y$result$track$text) %>%
gsub("[[:cntrl:]]", " ", .) %>%
gsub("\\[.*?\\]", " ", .) %>%
trimws()
} else {
c(cancion = NA, letra = NA)
}
}) %>%
do.call(what = bind_rows)
Oasis_df <-
Oasis %>%
left_join(., mis_letras_df, by = "cancion")
Oasis_df %>%
filter(!is.na(letra)) %>%
count(album)
NA
NA
Oasis_df <-
Oasis_df %>%
mutate(fecha = ymd(fecha),
album = reorder(as.factor(album), fecha))
Oasis_tokens2 <-
Oasis_df %>%
unnest_tokens(input = "letra", output = "word") %>%
filter(!word%in%stopwords('es'))
Error in stopwords("es") : no se pudo encontrar la función "stopwords"
Además: Warning messages:
1: In regexpr("<body[^>]*>", html, perl = TRUE) :
input string 1 is invalid UTF-8
2: In regexpr("</body>", html, perl = TRUE) :
input string 1 is invalid UTF-8
3: In regexpr("<body[^>]*>", html, perl = TRUE) :
input string 1 is invalid UTF-8
4: In regexpr("</body>", html, perl = TRUE) :
input string 1 is invalid UTF-8
Oasis_tokens %>%
group_by(sentiment) %>%
count(word, sort = T) %>%
top_n(15) %>%
ggplot() +
aes(word, n, fill = sentiment) +
geom_col() +
scale_y_continuous(expand = c(0, 0)) +
coord_flip() +
facet_wrap(~sentiment, scales = "free_y") +
theme(legend.position = "none")
Selecting by n
PROBAMOS NUBES DE PALABRAS SEGÚN LOS DISCOS
Definitely Maybe
OasisWorldCloud <- Oasis_tokens2%>%
group_by(album) %>%
count(word, sort = T) %>%
arrange(album) %>%
nest()
NubeDM <- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Definitely Maybe')
#PruebaDefinitelyMaybe
wordcloud(words = NubeDM$word, freq = NubeDM$n,
max.words = 1000, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
# wordcloud_custom <- function(grupo, df){
# print(grupo)
# wordcloud(words = df$word, freq = df$n,
# max.words = 400, random.order = FALSE, rot.per = 0.35,
# colors = brewer.pal(8, "Dark2"))
# }
#
#
#
#
# walk2(.x = OasisWorldCloud$album, .y = OasisWorldCloud$data, .f = wordcloud_custom)
(What’s the Story) Morning Glory?
NubeWS <- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == '(What’s the Story) Morning Glory?')
wordcloud(words = NubeWS$word, freq = NubeDM$n,
max.words = 1000, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Be Here Now
NubeBH <- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Be Here Now')
wordcloud(words = NubeBH$word, freq = NubeDM$n,
max.words = 1000, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Standing on the Shoulder of Giants
NubeSS <- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Standing on the Shoulder of Giants')
wordcloud(words = NubeSS$word, freq = NubeDM$n,
max.words = 400, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Heathen Chemistry
NubeHC<- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Heathen Chemistry')
wordcloud(words = NubeHC$word, freq = NubeDM$n,
max.words = 400, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Don’t Believe the Truth
NubeDT<- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Don’t Believe the Truth')
wordcloud(words = NubeDT$word, freq = NubeDM$n,
max.words = 400, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
** Dig Out Your Soul**
NubeDS<- OasisWorldCloud %>%
unnest(cols = c(data)) %>%
filter(album == 'Dig Out Your Soul')
wordcloud(words = NubeDS$word, freq = NubeDM$n,
max.words = 400, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Todos los discos
NubeTOTAL<- OasisWorldCloud %>%
unnest(cols = c(data))
wordcloud(words = NubeTOTAL$word, freq = NubeDM$n,
max.words = 1500, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Ahora veamos que ocurre cuando utilizamos solamente palabras que contienen un sentimiento pronunciado según el algoritmo cargado
Oasis_tokens <-
Oasis_tokens %>%
filter(!word %in% c("words", "boy", "mother", "god", "weight"))
Grafico2 <- Oasis_tokens %>%
group_by(fecha, album) %>%
count(sentiment) %>%
mutate(prop = round((n/sum(n)*100),2)) %>%
hchart('bar', hcaes(x = album, y = prop, group = sentiment)) %>%
hc_plotOptions(series=list(stacking='normal')) %>%
hc_title(text="Proporción de sentimientos por album") %>%
hc_add_theme(hc_theme_flat())
Grafico2
NA
NA
NA
Tal cual puede apreciarse, Be Here Now presenta mayor proporción de sentimientos negativos, situación que se invierte en los últimos discos donde la porporción de sentiminetos positivos o alegres representa una porción significativa. Veamos ahora el mismo gráfico sin apilar
Grafico3 <- Oasis_tokens %>%
group_by(fecha, album) %>%
count(sentiment) %>%
mutate(prop = round((n/sum(n)*100),2)) %>%
hchart('column', hcaes(x = album, y = n, group = sentiment)) %>%
hc_plotOptions(series=list(stacking='normal')) %>%
hc_title(text="Proporción de sentimientos por album") %>%
hc_add_theme(hc_theme_flat())
Grafico3
Veamos que ocurre ahora si consideramos solamente los sentimientos negativos y positivos
Grafico4 <- Oasis_tokens %>%
group_by(fecha, album) %>%
count(sentiment) %>%
mutate(prop = round((n/sum(n)*100),2)) %>%
top_n(1, wt = prop) %>%
hchart('bar', hcaes(x = album, y = prop, group = sentiment)) %>%
hc_plotOptions(series=list(stacking='normal')) %>%
hc_title(text="Proporción de sentimientos por album") %>%
hc_add_theme(hc_theme_flat())
Grafico4
NA
NA
Se observa que, efectivamente, en Be Here Now los sentimientos negativos superan a los positivos transformándose en el único albun en que ocurre esto.
# Oasis_tokens %>%
# group_by(fecha, album) %>%
# count(sentiment) %>%
# mutate(prop = n / sum(n)) %>%
# ungroup() %>%
# mutate(album = reorder(album, fecha)) %>%
# ggplot() +
# aes(album, prop, color = sentiment) +
# geom_point() +
# geom_line(aes(group = sentiment)) +
# theme_minimal() +
# theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4),
# text = element_text(family = "serif")) +
# labs(title = "Oasis\nSentimientos a través del tiempo",
# x = "Disco", y = "Porporción", color = "Sentimiento") +
# scale_y_continuous(labels = percent_format())
Grafico5 <- Oasis_tokens %>%
group_by(fecha, album) %>%
count(sentiment) %>%
mutate(prop = round((n/sum(n)*100),2)) %>%
hchart('line', hcaes(x = album, y = prop, group = sentiment)) %>%
hc_plotOptions(series=list(stacking='normal')) %>%
hc_title(text="Sentimientos positivos - negativos por album") %>%
hc_add_theme(hc_theme_flat())
Grafico5
A grandes rasgos se aprecia la tendencia que venimos observando, Be Here Now representa el disco mas duro, en terminos líricos, marcando un hito desde el cual los sentimientos positivos comienzan a ganar terreno hasta el último album que contiene la mayor cantidad de palabras positivas de la historia del grupo.
Soy sociólogo (FSOC-UBA). Me dedico al analisis de datos en temas como el estudio de la opinión pública, comportamiento electoral, analisis de texto, redes sociales y consumos culturales, todo mediante R y Phyton. Dirijo el Observatorio de Opinión Publica en ACDES y suelo escribir en mi blog de temas de R y en mi blog de temas de generales. Contacto: guilleferchero@gmail.com↩
Es posible descargar el código y la sintaxis completa de este informe presionando en Code > Download Rmd .↩