headlinesCategorization

Josival Marques Leite Junior

Headlines Categorization

This Shiny Application gets as input an headline from any online news and classifies it according to three subjects:

  • Economy
  • Sports
  • Culture

The R packages behing it

The main idea is to insert a news headline and using some text mining do the classification using a score to rate the headline subject. Above, we have the libraries to perform this project

library(SnowballC)
library(stringr)
library(tm)
library(RWeka)
#To count and sort
library(plyr) 

Text Mining with Ngrams

To rate the text, we split it into bigrams using this part of the code…

    headline <- Corpus(VectorSource(input$headline))
    headline <- tm_map(headline, content_transformer(tolower))
    headline <- tm_map(headline, removePunctuation)
    headline <- tm_map(headline , stripWhitespace)
    headline <- tm_map(headline, removeWords, stopwords("italian")) 
    headline <- tm_map(headline, stemDocument, language = "italian") 
    # Bigrams
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
    inputNgrams <- BigramTokenizer(headline)

Rating function

Last, but not least, we created a ranting function that counts the bigrams occurrences and divide this by the number of bigrams in the database. The higher the number, the greater the chance the headline be classified according to the subject.

    for (i in 1:length(inputNgrams))
    {
      if (is.na(as.vector(table(economiaHeads$bigrams == inputNgrams[i])[2])))
      {
        aux = 0
        ratingEco <- ratingEco + aux
      } else {
        aux = as.vector(table(economiaHeads$bigrams == inputNgrams[i])[2])
        ratingEco <- ratingEco + aux
      }
    }  
    ratingEco <- ratingEco/numberItemsEco