Why text mining

author: Uros Godnov
date: 2014-09-18

Universitiy of Primorska
Faculty of management
Department of Information Science

Data mining

searching for unknown patterns
structured data
with data expansion need for mining unstructured data
R offers excellent tools for text mining

How to inspect metadata with tm package

Crude dataset from tm package

Example(first document in the crude corpus)

library(tm)
data("crude")

meta(crude[[1]], type = "corpus")

Metadata:
  author       : character(0)
  datetimestamp: 1987-02-26 17:00:56
  description  : 
  heading      : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
  id           : 127
  language     : en
  origin       : Reuters-21578 XML
  topics       : YES
  lewissplit   : TRAIN
  cgisplit     : TRAINING-SET
  oldid        : 5670
  places       : usa
  people       : character(0)
  orgs         : character(0)
  exchanges    : character(0)

Creating term document matrix

Inspecting matrix (1:5 rows and 2:5 columns)

We can see that the word “demand” is present in a document with id=144.

dtm <- DocumentTermMatrix(crude)
inspect(dtm[1:5, 2:5])

<<DocumentTermMatrix (documents: 5, terms: 4)>>
Non-/sparse entries: 1/19
Sparsity           : 95%
Maximal term length: 10
Weighting          : term frequency (tf)

     Terms
Docs  "demand "expansion "for "growth
  127       0          0    0       0
  144       1          0    0       0
  191       0          0    0       0
  194       0          0    0       0
  211       0          0    0       0

Conclusion

Text mining is a future.
And R is a future.

Let me finish with a joke:
Keep calm and let the professor handle it :).