About

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible.

Capabilities

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Why to use CoreNLP

Download

http://stanfordnlp.github.io/CoreNLP/download.html


R + CoreNLP 👽

First you need to install the Stanford CoreNLP application in your machine or connect to a server running the application. In our case the Analytics Lab has the CoreNLP application running in our server.

Setup

We are going to need the following R packeges.

if(!require(xml2)) install.packages("xml2")
if(!require(XML)) install.packages("XML")
if(!require(httr)) install.packages("httr")
Loading required package: httr

In order to connect and parse the output from CoreNLP we need a couple of functions. ⚠️ These functions are part of the setup, you should not change anything unless you are familiar witht the CoreNLP syntax.


Function to connect to the local/remote CoreNLP server.

connNLP <- function(host="localhost", port="30200",
                    tokenize.whitespace="true", annotators="",
                    outputFormat="xml") {
                      conn <- paste("http://",host,":",port,"/?",sep="")
                      annotators <- gsub(",","%2C",annotators)
                      properties <- paste("properties=%7B",
                                          "%22tokenize.whitespace%22%3A",
                                          "%22",tokenize.whitespace,"%22%2C",
                                          "%22annotators%22%3A",
                                          "%22",annotators,"%22",
                                          "%2C%22outputFormat%22%3A",
                                          "%22",outputFormat,"%22%7D",
                                          sep="")
                      conn <- paste(conn,properties,sep="")
                      request <- list("conn"=conn,"out"=outputFormat)
                      return(request)
            }

Function to make requests to the CoreNLP server to analyse a text document.

getNLP <- function(text,conn){
              text <- gsub("'","",text)
              if(tolower(conn['out'])=="xml"){
                require("XML")
                require("httr")
                result <- POST(conn['conn'][[1]],body = text,encode = "multipart")
                doc <- xmlParse(content(result,"parsed","application/xml",encoding = "UTF-8"))
                }
              if(tolower(conn['out'])=="json"){
                result <- POST(conn['conn'][[1]],body = text,encode = "json")
                doc <- content(result, "parsed","application/json",encoding = "UTF-8")
              }
              if(tolower(conn['out'])=="text"){
                result <- POST(conn['conn'][[1]],body = text,encode = "multipart")
                doc <- content(result, "parsed","application/text",encoding = "UTF-8")
              }
              return(doc)
        }

Function to extract the result of the CoreNLP sentiment annotator for the all text document.

getSent <- function(docNLP){
  scr <- as.integer(xpathSApply(docNLP, "//sentences/sentence/@sentimentValue"))
  sent <- toString(xpathSApply(docNLP, "//sentences/sentence/@sentiment"))
  result <- list(sentiment = c(sent),score=c(scr))
  return(result)
}

Function to extract the result of the CoreNLP sentiment annotator for each word in the text document.

getWords <- function(docNLP){
  word <- xpathSApply(docNLP, "//token/word",xmlValue)
  word_sent <- xpathSApply(docNLP, "//token/sentiment",xmlValue)
  df <- data.frame(word,word_sent)
  return(df)
}

Functions to pull information from an RSS feed.

getRSS <- function(url=""){
  doc <- xmlParse(url)
  title <- xpathSApply(doc, "//item/title", xmlValue)
  desc <- xpathSApply(doc, "//item/description", xmlValue)
  date <- xpathSApply(doc, "//item/pubDate", xmlValue)
  date <- strptime(date,format ="%a, %d %b %Y %H:%M:%S",tz="GMT")
  rss <- list(title=title,description = desc,date = date)
  return(rss)
}

CoreNLP in Action 🚀

Now that we hava the functions to connect, parse and analyze the output of the CoreNLP application. Using the connection function we can connect to the CoreNLP server and set the parameters that we are going to use in the natural language analysis.

conn <- connNLP( host = "10.38.30.10",  port = "30200",
                 tokenize.whitespace = "true", annotators = "sentiment",
                 outputFormat = "xml" )

In order to perform natural language analysis we need to feed a text document to the application. To do that we are going to get some news from an RSS feed using the getRSS function and the link to the RSS page.


Lets analyze the first headline in the US News RSS feed.

  1. We get the news from the RSS feed.
  2. Get the first headline.
  3. Feed the text to the CoreNLP application.
text
[1] "Immigrants 'Missing Ingredient' to Economic Growth, Says Philadelphia Fed President Patrick Harker"

Now lets process the document that contains the result of the natural language analysis, using the getSent function.

text_sentiment
$sentiment
[1] "Negative"

$score
[1] 1

We can also analyze the sentiment of each word in the headline, using the getWords function.

We can do the same for the news description.

description
[1] "A senior official at the Fed indicated Thursday that immigrants could help jump-start the U.S. economy."
desc_sentiment
$sentiment
[1] "Negative"

$score
[1] 1

Now lets get the sentiment of each word for the description.


Sentiment analysis of news headlines 🌎

news_nlp = list()

for(i in 1:length(news$title)){
  doc <- getNLP(news$title[i],conn)
  sent <- getSent(doc)
  df <- data.frame("title"=news$title[i],
                          "description"=news$desc[i],
                          "sentiment"=sent['sentiment'],
                          "score"=sent['score'])
  news_nlp[[i]] <- df
  Sys.sleep(1)
}

news_sentiment <- do.call(“rbind”,news_nlp) #write.csv(news_sentiment,file = “news.csv”) ```

{r}unts <- table(news_sentiment$sentiment) barplot(counts, main="Sentiment Distribution")

