Josival Marques Leite Junior
This Shiny Application gets as input an headline from any online news and classifies it according to three subjects:
The main idea is to insert a news headline and using some text mining do the classification using a score to rate the headline subject. Above, we have the libraries to perform this project
library(SnowballC)
library(stringr)
library(tm)
library(RWeka)
#To count and sort
library(plyr)
To rate the text, we split it into bigrams using this part of the code…
headline <- Corpus(VectorSource(input$headline))
headline <- tm_map(headline, content_transformer(tolower))
headline <- tm_map(headline, removePunctuation)
headline <- tm_map(headline , stripWhitespace)
headline <- tm_map(headline, removeWords, stopwords("italian"))
headline <- tm_map(headline, stemDocument, language = "italian")
# Bigrams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
inputNgrams <- BigramTokenizer(headline)
Last, but not least, we created a ranting function that counts the bigrams occurrences and divide this by the number of bigrams in the database. The higher the number, the greater the chance the headline be classified according to the subject.
for (i in 1:length(inputNgrams))
{
if (is.na(as.vector(table(economiaHeads$bigrams == inputNgrams[i])[2])))
{
aux = 0
ratingEco <- ratingEco + aux
} else {
aux = as.vector(table(economiaHeads$bigrams == inputNgrams[i])[2])
ratingEco <- ratingEco + aux
}
}
ratingEco <- ratingEco/numberItemsEco