Text Mining

Marcin Mazurek

2018-12-12

Literatura

Biblioteka tidytext

https://www.tidytextmining.com/

Materiały źródłowe

https://rpubs.com/pjmurphy/265713

https://rpubs.com/williamsurles/316682

Załadowanie danych: korpus dokumentów

Źródło danych: bbc-text.csv

https://www.kaggle.com/yufengdev/bbc-fulltext-and-category

Utworzenie korpusu z ramki danych:

library(tm)
## Loading required package: NLP
library(data.table)
docs_df<-read.csv2("bbc-text.csv", sep=',', encoding='uft8')
docs_df$text<-as.character(docs_df$text)
#remove non Ascii 
docs_df$text <- gsub("[^\x20-\x7E]", "", docs_df$text)

docs_df$doc_id <-as.numeric(rownames(docs_df))


docs <- SimpleCorpus(DataframeSource(docs_df[1:100, c('doc_id', 'text')]))  

library(tm.plugin.webmining)
## 
## Attaching package: 'tm.plugin.webmining'
## The following object is masked from 'package:base':
## 
##     parse
library(tm.plugin.webmining)

Sprawdzenie załadowania pierwszego elementu w korpusie dokumentów

writeLines(as.character(docs[[1]][[1]]))
## tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels. although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky+.  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc  there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone.  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest.  the reality is that with broadband connections  anybody can be the producer of content.  he added:  the challenge now is that it is hard to promote a programme with so much choice.   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone  the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands  mr hanlon suggested.  on the other end  you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them   said mr hanlon.  ultimately  the consumer will tell the market they want.   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them  instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100-hours of recording capability  instant replay  and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want.

Drugi element oraz jego metadane:

writeLines(as.character(docs[[2]][[1]]))
## worldcom boss  left books alone  former worldcom boss bernie ebbers  who is accused of overseeing an $11bn (5.8bn) fraud  never made accounting decisions  a witness has told jurors.  david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not responsible for worldcom s problems. the phone company collapsed in 2002 and prosecutors claim that losses were hidden to protect the firm s shares. mr myers has already pleaded guilty to fraud and is assisting prosecutors.  on monday  defence lawyer reid weingarten tried to distance his client from the allegations. during cross examination  he asked mr myers if he ever knew mr ebbers  make an accounting decision  .  not that i am aware of   mr myers replied.  did you ever know mr ebbers to make an accounting entry into worldcom books   mr weingarten pressed.  no   replied the witness. mr myers has admitted that he ordered false accounting entries at the request of former worldcom chief financial officer scott sullivan. defence lawyers have been trying to paint mr sullivan  who has admitted fraud and will testify later in the trial  as the mastermind behind worldcom s accounting house of cards.  mr ebbers  team  meanwhile  are looking to portray him as an affable boss  who by his own admission is more pe graduate than economist. whatever his abilities  mr ebbers transformed worldcom from a relative unknown into a $160bn telecoms giant and investor darling of the late 1990s. worldcom s problems mounted  however  as competition increased and the telecoms boom petered out. when the firm finally collapsed  shareholders lost about $180bn and 20 000 workers lost their jobs. mr ebbers  trial is expected to last two months and if found guilty the former ceo faces a substantial jail sentence. he has firmly declared his innocence.
#Metadane: 
docs[[2]][[2]]
##   author       : character(0)
##   datetimestamp: 2019-01-03 12:06:13
##   description  : character(0)
##   heading      : character(0)
##   id           : 2
##   language     : en
##   origin       : character(0)

Preprocessing

  1. Wielkość liter - konwersja do małych liter
docs <- tm_map(docs, tolower)

writeLines(as.character(docs[1][[1]]))
## tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels. although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky+.  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc  there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone.  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest.  the reality is that with broadband connections  anybody can be the producer of content.  he added:  the challenge now is that it is hard to promote a programme with so much choice.   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone  the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands  mr hanlon suggested.  on the other end  you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them   said mr hanlon.  ultimately  the consumer will tell the market they want.   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them  instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100-hours of recording capability  instant replay  and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want.
  1. Usunięcie znaków interpunkcyjnych
docs <- tm_map(docs,removePunctuation)


writeLines(as.character(docs[1][[1]]))
## tv future in the hands of viewers with home theatre systems  plasma highdefinition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices  one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes  like the us s tivo and the uk s sky system  allow people to record  store  play  pause and forward wind tv programmes when they want  essentially  the technology allows for much more personalised tv they are also being builtin to highdefinition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of highdefinition programming not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own alacarte entertainment but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website for the likes of the bbc  there are no issues of lost advertising revenue yet it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest  the reality is that with broadband connections  anybody can be the producer of content  he added  the challenge now is that it is hard to promote a programme with so much choice   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them but it might not suit everyone  the panel recognised older generations are more comfortable with familiar schedules and channel brands because they know what they are getting they perhaps do not want so much of the choice put into their hands  mr hanlon suggested  on the other end  you have the kids just out of diapers who are pushing buttons already  everything is possible and available to them   said mr hanlon  ultimately  the consumer will tell the market they want   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tvwatching experience highdefinition tv sets are everywhere and many new models of lcd liquid crystal display tvs have been launched with dvr capability built into them  instead of being external boxes one such example launched at the show is humax s 26inch lcd tv with an 80hour tivo dvr and dvd recorder one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100hours of recording capability  instant replay  and a search function the set can pause and rewind tv for up to 90 hours and microsoft chief bill gates announced in his preshow keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want
  1. Usunięcie liczb
docs <- tm_map(docs,removeNumbers)

writeLines(as.character(docs[1][[1]]))
## tv future in the hands of viewers with home theatre systems  plasma highdefinition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices  one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes  like the us s tivo and the uk s sky system  allow people to record  store  play  pause and forward wind tv programmes when they want  essentially  the technology allows for much more personalised tv they are also being builtin to highdefinition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of highdefinition programming not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own alacarte entertainment but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website for the likes of the bbc  there are no issues of lost advertising revenue yet it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest  the reality is that with broadband connections  anybody can be the producer of content  he added  the challenge now is that it is hard to promote a programme with so much choice   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them but it might not suit everyone  the panel recognised older generations are more comfortable with familiar schedules and channel brands because they know what they are getting they perhaps do not want so much of the choice put into their hands  mr hanlon suggested  on the other end  you have the kids just out of diapers who are pushing buttons already  everything is possible and available to them   said mr hanlon  ultimately  the consumer will tell the market they want   of the   new gadgets and technologies being showcased at ces  many of them are about enhancing the tvwatching experience highdefinition tv sets are everywhere and many new models of lcd liquid crystal display tvs have been launched with dvr capability built into them  instead of being external boxes one such example launched at the show is humax s inch lcd tv with an hour tivo dvr and dvd recorder one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with hours of recording capability  instant replay  and a search function the set can pause and rewind tv for up to  hours and microsoft chief bill gates announced in his preshow keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want
  1. Stopwords (stoplista)
docs <- tm_map(docs, removeWords, c(stopwords("english"), 's'))   
writeLines(as.character(docs[1][[1]]))
## tv future   hands  viewers  home theatre systems  plasma highdefinition tvs   digital video recorders moving   living room   way people watch tv will  radically different  five years  time    according   expert panel  gathered   annual consumer electronics show  las vegas  discuss   new technologies will impact one   favourite pastimes   us leading  trend  programmes   content will  delivered  viewers via home networks   cable  satellite  telecoms companies   broadband service providers  front rooms  portable devices  one    talkedabout technologies  ces   digital  personal video recorders dvr  pvr  settop boxes  like  us  tivo   uk  sky system  allow people  record  store  play  pause  forward wind tv programmes   want  essentially   technology allows  much  personalised tv   also  builtin  highdefinition tv sets    big business  japan   us   slower  take   europe    lack  highdefinition programming   can people forward wind  adverts   can also forget  abiding  network  channel schedules  putting together   alacarte entertainment   us networks  cable  satellite companies  worried    means    terms  advertising revenues  well   brand identity   viewer loyalty  channels although  us leads   technology   moment    also  concern    raised  europe  particularly   growing uptake  services like sky   happens  today   will see  nine months   years  time   uk   adam hume   bbc broadcast  futurologist told  bbc news website   likes   bbc     issues  lost advertising revenue yet     pressing issue   moment  commercial uk broadcasters   brand loyalty  important  everyone   will  talking   content brands rather  network brands   said tim hanlon   brand communications firm starcom mediavest   reality    broadband connections  anybody can   producer  content   added   challenge now     hard  promote  programme   much choice     means  said stacey jolna  senior vice president  tv guide tv group     way people find  content  want  watch    simplified  tv viewers  means  networks   us terms   channels  take  leaf   google  book    search engine   future  instead   scheduler  help people find   want  watch  kind  channel model might work   younger ipod generation   used  taking control   gadgets    play     might  suit everyone   panel recognised older generations   comfortable  familiar schedules  channel brands   know    getting  perhaps   want  much   choice put   hands  mr hanlon suggested     end     kids just   diapers   pushing buttons already  everything  possible  available     said mr hanlon  ultimately   consumer will tell  market  want       new gadgets  technologies  showcased  ces  many     enhancing  tvwatching experience highdefinition tv sets  everywhere  many new models  lcd liquid crystal display tvs   launched  dvr capability built    instead   external boxes one  example launched   show  humax  inch lcd tv   hour tivo dvr  dvd recorder one   us  biggest satellite tv companies  directtv   even launched   branded dvr   show  hours  recording capability  instant replay    search function  set can pause  rewind tv     hours  microsoft chief bill gates announced   preshow keynote speech  partnership  tivo  called tivotogo   means people can play recorded programmes  windows pcs  mobile devices   reflect  increasing trend  freeing  multimedia   people can watch   want    want

4a. Rdzenie słów

Kopia korpusu do pozniejszego uzycia

dict_from_docs<- docs
writeLines(as.character(dict_from_docs[[1]][1]))
## tv future   hands  viewers  home theatre systems  plasma highdefinition tvs   digital video recorders moving   living room   way people watch tv will  radically different  five years  time    according   expert panel  gathered   annual consumer electronics show  las vegas  discuss   new technologies will impact one   favourite pastimes   us leading  trend  programmes   content will  delivered  viewers via home networks   cable  satellite  telecoms companies   broadband service providers  front rooms  portable devices  one    talkedabout technologies  ces   digital  personal video recorders dvr  pvr  settop boxes  like  us  tivo   uk  sky system  allow people  record  store  play  pause  forward wind tv programmes   want  essentially   technology allows  much  personalised tv   also  builtin  highdefinition tv sets    big business  japan   us   slower  take   europe    lack  highdefinition programming   can people forward wind  adverts   can also forget  abiding  network  channel schedules  putting together   alacarte entertainment   us networks  cable  satellite companies  worried    means    terms  advertising revenues  well   brand identity   viewer loyalty  channels although  us leads   technology   moment    also  concern    raised  europe  particularly   growing uptake  services like sky   happens  today   will see  nine months   years  time   uk   adam hume   bbc broadcast  futurologist told  bbc news website   likes   bbc     issues  lost advertising revenue yet     pressing issue   moment  commercial uk broadcasters   brand loyalty  important  everyone   will  talking   content brands rather  network brands   said tim hanlon   brand communications firm starcom mediavest   reality    broadband connections  anybody can   producer  content   added   challenge now     hard  promote  programme   much choice     means  said stacey jolna  senior vice president  tv guide tv group     way people find  content  want  watch    simplified  tv viewers  means  networks   us terms   channels  take  leaf   google  book    search engine   future  instead   scheduler  help people find   want  watch  kind  channel model might work   younger ipod generation   used  taking control   gadgets    play     might  suit everyone   panel recognised older generations   comfortable  familiar schedules  channel brands   know    getting  perhaps   want  much   choice put   hands  mr hanlon suggested     end     kids just   diapers   pushing buttons already  everything  possible  available     said mr hanlon  ultimately   consumer will tell  market  want       new gadgets  technologies  showcased  ces  many     enhancing  tvwatching experience highdefinition tv sets  everywhere  many new models  lcd liquid crystal display tvs   launched  dvr capability built    instead   external boxes one  example launched   show  humax  inch lcd tv   hour tivo dvr  dvd recorder one   us  biggest satellite tv companies  directtv   even launched   branded dvr   show  hours  recording capability  instant replay    search function  set can pause  rewind tv     hours  microsoft chief bill gates announced   preshow keynote speech  partnership  tivo  called tivotogo   means people can play recorded programmes  windows pcs  mobile devices   reflect  increasing trend  freeing  multimedia   people can watch   want    want
docs <- tm_map(docs, stemDocument)
writeLines(as.character(docs[[1]][1]))
## tv futur hand viewer home theatr system plasma highdefinit tvs digit video record move live room way peopl watch tv will radic differ five year time accord expert panel gather annual consum electron show las vega discuss new technolog will impact one favourit pastim us lead trend programm content will deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talkedabout technolog ces digit person video record dvr pvr settop box like us tivo uk sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also builtin highdefinit tv set big busi japan us slower take europ lack highdefinit program can peopl forward wind advert can also forget abid network channel schedul put togeth alacart entertain us network cabl satellit compani worri mean term advertis revenu well brand ident viewer loyalti channel although us lead technolog moment also concern rais europ particular grow uptak servic like sky happen today will see nine month year time uk adam hume bbc broadcast futurologist told bbc news websit like bbc issu lost advertis revenu yet press issu moment commerci uk broadcast brand loyalti import everyon will talk content brand rather network brand said tim hanlon brand communic firm starcom mediavest realiti broadband connect anybodi can produc content ad challeng now hard promot programm much choic mean said stacey jolna senior vice presid tv guid tv group way peopl find content want watch simplifi tv viewer mean network us term channel take leaf googl book search engin futur instead schedul help peopl find want watch kind channel model might work younger ipod generat use take control gadget play might suit everyon panel recognis older generat comfort familiar schedul channel brand know get perhap want much choic put hand mr hanlon suggest end kid just diaper push button alreadi everyth possibl avail said mr hanlon ultim consum will tell market want new gadget technolog showcas ces mani enhanc tvwatch experi highdefinit tv set everywher mani new model lcd liquid crystal display tvs launch dvr capabl built instead extern box one exampl launch show humax inch lcd tv hour tivo dvr dvd record one us biggest satellit tv compani directtv even launch brand dvr show hour record capabl instant replay search function set can paus rewind tv hour microsoft chief bill gate announc preshow keynot speech partnership tivo call tivotogo mean peopl can play record programm window pcs mobil devic reflect increas trend free multimedia peopl can watch want want

4b. Uzupełnienie rdzeni do form podstawowych

stemCompletion2 <- function(x, dictionary) {
    x <- unlist(strsplit(as.character(x), " "))
    x <- stemCompletion(x, dictionary=dictionary, type="prevalent")
    x <- paste(x, sep="", collapse=" ")
  x
    
}

writeLines(
stemCompletion2('tv futur hand viewer home theatr system plasma highdefinit tvs digit video record move live room way peopl watch tv will radic differ five year time accord expert panel gather annual consum electron show las vega discuss new technolog will impact one favourit pastim us lead trend programm content will deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talkedabout technolog ces digit person video record dvr pvr settop box like us s tivo uk s sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also builtin highdefinit tv set big busi japan us slower take europ lack highdefinit program can peopl forward wind advert can', dict_from_docs)[[1]][1]
)
## tv future hand viewer home theatre system plasma highdefinition tvs digital video record move live room way people watch tv will radical difference five year time accordance expert panel gathered annual consumer electronic show las vegas discuss new technological will impact one favourite pastimes us lead trend programme content will deliver viewer via home network cable satellite telecommunications companies broadband service provide front room portable device one talkedabout technological ces digital person video record dvr pvr settop box like us saab tivo uk saab sky system allow people record store play pause forward wind tv programme want essentially technological allow much personalised tv also builtin highdefinition tv set big business japan us slower take europe lack highdefinition program can people forward wind advert can
docs_completed <- lapply(docs, stemCompletion2, dictionary=dict_from_docs)

docs_completed_df<-data.frame(text=unlist(docs_completed, recursive=FALSE), doc_id=seq(1:length(docs_completed)))
docs_completed_df$text<-as.character(docs_completed_df$text)

docs <- SimpleCorpus(DataframeSource(docs_completed_df[c('doc_id', 'text')]))   


writeLines(as.character(docs[[1]][1]))
## tv future hand viewer home theatre system plasma highdefinition tvs digital video record move live room way people watch tv will radical difference five year time accordance expert panel gathered annual consumer electronic show las vegas discuss new technological will impact one favourite pastimes us lead trend programme content will deliver viewer via home network cable satellite telecommunications companies broadband service provide front room portable device one talkedabout technological ces digital person video record dvr pvr settop box like us tivo uk sky system allow people record store play pause forward wind tv programme want essentially technological allow much personalised tv also builtin highdefinition tv set big business japan us slower take europe lack highdefinition program can people forward wind advert can also forget abided network channel schedule put together alacarte entertain us network cable satellite companies worried mean term advertisers revenue well brand identification viewer  channel although us lead technological moment also concern raise europe particular grow uptake service like sky happen today will see nine month year time uk adam hume bbc broadcast futurologist told bbc news website like bbc issue lost advertisers revenue yet press issue moment commercial uk broadcast brand  importance everyone will talk content brand rather network brand said tim hanlon brand communicate firm starcom mediavest  broadband connected  can produce content ad challenge now hard promote programme much choice mean said stacey jolna senior vice preside tv guidance tv group way people find content want watch simplified tv viewer mean network us term channel take leaf google book search engine future instead schedule help people find want watch kind channel model might work younger ipod generated use take control gadgets play might suit everyone panel recognise older generated comfortable familiar schedule channel brand know get perhaps want much choice put hand mr hanlon suggest end kidney just diapers push buttons  everything possible available said mr hanlon ultimate consumer will tell market want new gadgets technological showcased ces manipulated enhance tvwatching experience highdefinition tv set everywhere manipulated new model lcd liquid crystal display tvs launch dvr capable built instead external box one example launch show humax inch lcd tv hour tivo dvr dvd record one us biggest satellite tv companies directtv even launch brand dvr show hour record capable instant replay search function set can pause rewind tv hour microsoft chief bill gates announce preshow keynote speech partnership tivo call tivotogo mean people can play record programme window pcs mobile device reflect increase trend free multimedia people can watch want want

Document-Term-Matrix (dtm)

dtm <- DocumentTermMatrix(docs)   
dtm 
## <<DocumentTermMatrix (documents: 100, terms: 4058)>>
## Non-/sparse entries: 12588/393212
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency (tf)

Term Document Matrix

tdm <- TermDocumentMatrix(docs)   
tdm 
## <<TermDocumentMatrix (terms: 4058, documents: 100)>>
## Non-/sparse entries: 12588/393212
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency (tf)

Czestosc wystepowania slow

freq <- colSums(as.matrix(dtm))   
freq[1:20]
##      abided  accordance        adam      advert advertisers    alacarte 
##           2          12           1           3           6           1 
##       allow        also    although    announce      annual   available 
##          19          82           9          23          10           5 
##         bbc         big     biggest        bill        book         box 
##          37           9          10          10           8           5 
##       brand   broadband 
##           9          16

Uporządkowane malejąco:

freq <- sort(freq, decreasing=TRUE)
freq[1:40]
##       said       will       year     people       also        new 
##        281        154        118         90         82         71 
##        one governance  companies       last        can    partial 
##         69         69         64         63         62         62 
##       time        use       make       game        now        say 
##         60         60         59         58         56         55 
##       want       call       firm       back      music       need 
##         54         53         53         51         49         48 
##       told       work     market       play        two      right 
##         47         47         46         46         46         46 
##      first       like      month        get       film       take 
##         46         45         45         44         44         41 
##    england       club       show      three 
##         41         40         39         39

Gotowa funkcja:

findFreqTerms(dtm, lowfreq=40) 
##  [1] "also"       "call"       "can"        "companies"  "firm"      
##  [6] "get"        "like"       "market"     "month"      "new"       
## [11] "now"        "one"        "people"     "play"       "said"      
## [16] "take"       "time"       "told"       "use"        "want"      
## [21] "will"       "work"       "year"       "last"       "make"      
## [26] "two"        "back"       "club"       "england"    "game"      
## [31] "say"        "film"       "governance" "partial"    "need"      
## [36] "right"      "first"      "music"
library(ggplot2)   
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
wf <- data.frame(word=names(freq), freq=freq)   
head(wf)  
##          word freq
## said     said  281
## will     will  154
## year     year  118
## people people   90
## also     also   82
## new       new   71
p <- ggplot(subset(wf, freq>30), aes(x = reorder(word, -freq), y = freq)) +
          geom_bar(stat = "identity") + 
          theme(axis.text.x=element_text(angle=90, hjust=1))
p   

Prawo Zipfa

Usuniecie najrzadszych termów

dtms <- removeSparseTerms(dtm, 0.9) # This makes a matrix that is 20% empty space, maximum.   
dtms
## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity           : 83%
## Maximal term length: 11
## Weighting          : term frequency (tf)

TF-IDF

Term Frequency Inverted Document Frequency

\(`idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\)

TF - liczba wystąpień słowa w dokumencie podzielona przez liczbę wszystkich słów w dokunencie IDF - waga słowa, wynikająca z liczby dokumentów, w któych słowo występuje. Im mniejsza liczba dokumentów, tym ważniejsze słowo.

tfidf_dtm <- DocumentTermMatrix(docs, control=list(weighting=weightTfIdf))   
tfidf_dtm 
## <<DocumentTermMatrix (documents: 100, terms: 4058)>>
## Non-/sparse entries: 12588/393212
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
tfidf_dtms <- removeSparseTerms(tfidf_dtm, 0.9) # This makes a matrix that is 20% empty space, maximum.   
tfidf_dtms
## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity           : 83%
## Maximal term length: 11
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
inspect(tfidf_dtms)
## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity           : 83%
## Maximal term length: 11
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs   companies   economic       firm governance     match    partial
##   39 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.14394794
##   40 0.012866412 0.09842750 0.00000000 0.00000000 0.0000000 0.00000000
##   43 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##   52 0.000000000 0.00000000 0.00000000 0.05183380 0.0000000 0.00000000
##   53 0.009649809 0.00000000 0.01111111 0.00000000 0.0000000 0.00000000
##   65 0.032166030 0.03075859 0.05555556 0.06067846 0.0000000 0.02832309
##   68 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##   7  0.000000000 0.00000000 0.00000000 0.07746186 0.0000000 0.02169428
##   90 0.000000000 0.13503773 0.00000000 0.00000000 0.0000000 0.00000000
##   91 0.000000000 0.00000000 0.00000000 0.00000000 0.1634523 0.00000000
##     Terms
## Docs      people      price      share       will
##   39 0.006446495 0.00000000 0.00000000 0.00000000
##   40 0.000000000 0.02358833 0.00000000 0.00000000
##   43 0.000000000 0.14622358 0.00000000 0.00000000
##   52 0.005572394 0.00000000 0.00000000 0.01129790
##   53 0.000000000 0.01769125 0.04727502 0.00000000
##   65 0.000000000 0.00000000 0.15758340 0.01234400
##   68 0.007209896 0.00000000 0.00000000 0.02046506
##   7  0.000000000 0.00000000 0.00000000 0.01890997
##   90 0.000000000 0.02588963 0.00000000 0.00000000
##   91 0.000000000 0.00000000 0.00000000 0.01017674

Analiza częstości występowania w oparciu o macierz tf-idf:

tf_freq <- colSums(as.matrix(tfidf_dtms))   
head(tf_freq)
## accordance      allow       also   announce        bbc    biggest 
##  0.2013956  0.2225343  0.4071121  0.2969897  0.4196205  0.1836695

Uporządkowane malejąco:

tf_freq <- sort(tf_freq, decreasing=TRUE)
tf_freq[1:40]
##    partial      share   economic     people governance      price 
##  0.8484945  0.7599005  0.7475764  0.7370956  0.6591801  0.6547498 
##       firm      match  companies       will       game       club 
##  0.6277491  0.6165783  0.6115855  0.6092822  0.6064244  0.6009927 
##    england     market        win       year      offer    quarter 
##  0.6007431  0.5870324  0.5442894  0.5321437  0.5093472  0.5084403 
##      month       star     player       play      right     return 
##  0.4998237  0.4989058  0.4898740  0.4890583  0.4849442  0.4804559 
##       hope      music    service       call        new   internal 
##  0.4784270  0.4769572  0.4768794  0.4708966  0.4702842  0.4682189 
##      first        can        use      house       told      final 
##  0.4661323  0.4629000  0.4608068  0.4595336  0.4571241  0.4476756 
##      award       make        say       week 
##  0.4473094  0.4439823  0.4430635  0.4419193
library(ggplot2)   

wf <- data.frame(word=names(tf_freq), freq=tf_freq)   
head(wf)  
##                  word      freq
## partial       partial 0.8484945
## share           share 0.7599005
## economic     economic 0.7475764
## people         people 0.7370956
## governance governance 0.6591801
## price           price 0.6547498
p <- ggplot(subset(wf, freq>0.4), aes(x = reorder(word, -freq), y = freq)) +
          geom_bar(stat = "identity") + 
          theme(axis.text.x=element_text(angle=90, hjust=1))
p   

Analiza asocjacji - powiązania pomiędzy słowami

findAssocs(dtm, c("governance" , "market"), corlimit=0.8) # specifying a correlation limit of 0.85
## $governance
##       allotting           angst      appreciate            aren 
##            0.88            0.88            0.88            0.88 
##         avenues    bandsartists             bed      beneficial 
##            0.88            0.88            0.88            0.88 
##           cheap          chopin  commercialised       convinced 
##            0.88            0.88            0.88            0.88 
##        daunting       equipment             etc        exercise 
##            0.88            0.88            0.88            0.88 
##          extent       ferdinand        flourish            fork 
##            0.88            0.88            0.88            0.88 
##           franz    fraternities        frontman       gorbachev 
##            0.88            0.88            0.88            0.88 
##            hate       hostility       idolstyle        idoltype 
##            0.88            0.88            0.88            0.88 
##        kapranos            korn           kylie          louder 
##            0.88            0.88            0.88            0.88 
##       macmillan       megastars           merit         minogue 
##            0.88            0.88            0.88            0.88 
##          modern          moulds        musician          napalm 
##            0.88            0.88            0.88            0.88 
##             nea         nirvana        penalise         pockets 
##            0.88            0.88            0.88            0.88 
##       pollution     prestigious privatelyfunded           raked 
##            0.88            0.88            0.88            0.88 
##      rediscover       reinforce         riddled         scissor 
##            0.88            0.88            0.88            0.88 
##  selfsufficient         shouldn           smell        solution 
##            0.88            0.88            0.88            0.88 
##          soviet     sponsorship     statefunded     stereotypes 
##            0.88            0.88            0.88            0.88 
##      subsidiary     subsidising      sustenance             tea 
##            0.88            0.88            0.88            0.88 
##          thrive          thumbs          travis       twiddling 
##            0.88            0.88            0.88            0.88 
##        upcoming          wagner           waste      wealthiest 
##            0.88            0.88            0.88            0.88 
##      whatsoever            yeah             yes          pursue 
##            0.88            0.88            0.88            0.86 
##          listen           grant            alex             art 
##            0.85            0.84            0.83            0.83 
##           music         lecture            fund           scrap 
##            0.82            0.82            0.81            0.81 
## 
## $market
## numeric(0)

Przykładowe wizualizacja

library(wordcloud)
## Loading required package: RColorBrewer
set.seed(142)   
wordcloud(names(freq), freq, min.freq=25) 

Klasteryzacja

library(cluster)  

dtmss <- removeSparseTerms(tdm, 0.9) # This makes a matrix that is only 15% empty space, maximum.   
dtmss
## <<TermDocumentMatrix (terms: 270, documents: 100)>>
## Non-/sparse entries: 4534/22466
## Sparsity           : 83%
## Maximal term length: 11
## Weighting          : term frequency (tf)
d <- dist(t(dtmss), method="euclidian")   
fit <- hclust(d=d, method="complete")   # for a different look try substituting: method="ward.D"


plot(fit, hang=-1)   

plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters   

library(fpc)   
d <- dist(t(dtmss), method="euclidian")   
kfit <- kmeans(d, 4)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)   

Biblioteka quanteda

library(quanteda)
## Package version: 1.3.4
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
## 
##     View
qdocs <-corpus(docs_df)
summary(qdocs)
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Corpus consisting of 2225 documents, showing 100 documents:
## 
##  Text Types Tokens Sentences      category
##     1   352    775         1          tech
##     2   180    321         1      business
##     3   146    257         1         sport
##     4   189    363         1         sport
##     5   177    288         1 entertainment
##     6   295    659         1      politics
##     7   158    284         1      politics
##     8   126    203         1         sport
##     9   110    166         1         sport
##    10   135    249         1 entertainment
##    11   177    323         1 entertainment
##    12   127    223         1      business
##    13   187    363         1      business
##    14   165    311         1      politics
##    15   242    493         1         sport
##    16   175    314         1      business
##    17   215    436         1      politics
##    18   216    459         1         sport
##    19   163    313         1      business
##    20   217    490         1          tech
##    21   115    229         1          tech
##    22   193    367         1          tech
##    23   303    649         1         sport
##    24    99    160         1         sport
##    25   302    703         1          tech
##    26   146    226         1         sport
##    27   146    263         1 entertainment
##    28   163    318         1          tech
##    29   285    609         1      politics
##    30   220    419         1 entertainment
##    31   262    512         1      politics
##    32   307    670         1          tech
##    33   128    235         1 entertainment
##    34   183    318         1 entertainment
##    35   111    157         1      business
##    36   143    275         1      politics
##    37   112    207         1          tech
##    38   180    336         1 entertainment
##    39   242    537         1      politics
##    40   155    273         1      business
##    41   224    389         1      politics
##    42   194    368         1         sport
##    43   208    439         1      business
##    44   114    190         1         sport
##    45   180    328         1          tech
##    46   541   1363         5 entertainment
##    47   236    524         1      politics
##    48   250    612         1      politics
##    49   128    221         1      politics
##    50   121    168         1      business
##    51   140    252         1         sport
##    52   270    563         1      politics
##    53   162    324         1      business
##    54   173    293         1      business
##    55   152    316         1         sport
##    56   114    190         1      politics
##    57   278    537         1      business
##    58   317    676         1         sport
##    59   261    463         1         sport
##    60   262    521         1      business
##    61   261    492         1      business
##    62   176    347         1         sport
##    63   150    247         1      business
##    64   120    181         1         sport
##    65   108    195         1      business
##    66   231    448         1          tech
##    67   172    289         1      business
##    68   189    426         1 entertainment
##    69   142    248         1          tech
##    70   195    379         1      business
##    71   274    648         1      politics
##    72   141    227         1      business
##    73   258    481         1      politics
##    74   132    216         1         sport
##    75   118    203         1      business
##    76   329    640         2          tech
##    77   211    444         1      business
##    78   252    554         1         sport
##    79   130    203         1         sport
##    80   164    288         1      business
##    81   144    248         1      business
##    82   117    186         1         sport
##    83   173    332         1      politics
##    84   140    253         1      business
##    85   149    242         1 entertainment
##    86   165    311         1      politics
##    87   192    344         1      politics
##    88   185    352         1      business
##    89   302    571         2 entertainment
##    90   136    235         1      business
##    91   154    267         1         sport
##    92    94    134         1         sport
##    93   285    687         1      politics
##    94   140    231         1         sport
##    95   233    516         1      politics
##    96    97    140         1         sport
##    97   156    255         1      business
##    98   339    683         1         sport
##    99   124    202         1      business
##   100   292    622         1 entertainment
## 
## Source: C:/Users/mmazurek/Documents/RWorkDir/RDemo/* on x86-64 by mmazurek
## Created: Thu Jan 03 13:06:42 2019
## Notes:
qdtm<-dfm(qdocs)

==

doc_freq<-docfreq(qdtm)
qdtm_tfidf<-dfm_tfidf(qdtm)
qdtm_tfidf[1,'tv']
## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfm"
##     features
## docs       tv
##    1 12.47801
df<-doc_freq['tv']

tf<-qdtm[1,'tv'] 
tf <-qdtm[1,'tv'] / sum(which(convert(qdtm,"matrix")[1,]>0))

idf<-log(2225/df)

tf_idf<-tf  * idf

Klasyfikacja tekstów

Uczenie nienadzorowane - ekstrakcja tematów (LDA - Latent Dirichlet Allocation)

NLP - Natural Language Processing

Lematyzacja

Identyfikacja części mowy: Part-of-speech tagging

NER: Named Entity Recognition

Dependency parsing

N-gramy

Biblioteka tidytext

library(tidytext)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
text_df<-data_frame(doc=1:nrow(docs_df),  text=docs_df$text)
text_df
## # A tibble: 2,225 x 2
##      doc text                                                             
##    <int> <chr>                                                            
##  1     1 tv future in the hands of viewers with home theatre systems  pla~
##  2     2 worldcom boss  left books alone  former worldcom boss bernie ebb~
##  3     3 tigers wary of farrell  gamble  leicester say they will not be r~
##  4     4 yeading face newcastle in fa cup premiership side newcastle unit~
##  5     5 ocean s twelve raids box office ocean s twelve  the crime caper ~
##  6     6 howard hits back at mongrel jibe michael howard has said a claim~
##  7     7 blair prepares to name poll date tony blair is likely to name 5 ~
##  8     8 henman hopes ended in dubai third seed tim henman slumped to a s~
##  9     9 wilkinson fit to face edinburgh england captain jonny wilkinson ~
## 10    10 last star wars  not for children  the sixth and final star wars ~
## # ... with 2,215 more rows
text_df %>%
  unnest_tokens(word, text)
## # A tibble: 873,994 x 2
##      doc word   
##    <int> <chr>  
##  1     1 tv     
##  2     1 future 
##  3     1 in     
##  4     1 the    
##  5     1 hands  
##  6     1 of     
##  7     1 viewers
##  8     1 with   
##  9     1 home   
## 10     1 theatre
## # ... with 873,984 more rows
data("stop_words")
stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows
cleaned_texts<-text_df %>%
  unnest_tokens(word, text)%>% anti_join(stop_words)
## Joining, by = "word"
cleaned_texts%>% count(word, sort=TRUE)
## # A tibble: 29,827 x 2
##    word           n
##    <chr>      <int>
##  1 people      2045
##  2 time        1322
##  3 world       1201
##  4 government  1160
##  5 uk          1104
##  6 told         911
##  7 film         890
##  8 game         871
##  9 music        839
## 10 000          804
## # ... with 29,817 more rows
text_words <- cleaned_texts%>%
  count(doc, word, sort = TRUE) %>%
  ungroup()



total_words <- text_words %>% 
  group_by(doc) %>% 
  summarize(total = sum(n))

text_words <- left_join(text_words, total_words)
## Joining, by = "doc"
text_words
## # A tibble: 283,412 x 4
##      doc word        n total
##    <int> <chr>   <int> <int>
##  1   866 music      71   912
##  2  1616 song       65  1340
##  3  1928 roddick    53   669
##  4   866 urban      52   912
##  5   678 wage       51  1265
##  6   678 minimum    47  1265
##  7  1928 nadal      46   669
##  8  1605 kilroy     44   962
##  9   409 forsyth    37  1671
## 10  1605 silk       36   962
## # ... with 283,402 more rows
freq_by_rank <- text_words %>% 
  group_by(doc) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank
## # A tibble: 283,412 x 6
## # Groups:   doc [2,225]
##      doc word        n total  rank `term frequency`
##    <int> <chr>   <int> <int> <int>            <dbl>
##  1   866 music      71   912     1           0.0779
##  2  1616 song       65  1340     1           0.0485
##  3  1928 roddick    53   669     1           0.0792
##  4   866 urban      52   912     2           0.0570
##  5   678 wage       51  1265     1           0.0403
##  6   678 minimum    47  1265     2           0.0372
##  7  1928 nadal      46   669     2           0.0688
##  8  1605 kilroy     44   962     1           0.0457
##  9   409 forsyth    37  1671     1           0.0221
## 10  1605 silk       36   962     2           0.0374
## # ... with 283,402 more rows
library(ggplot2)

cleaned_texts %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

 #cleaned_texts %>%  bind_tf_idf(cleaned_texts, document, count) %>%   arrange(desc(tf_idf))