Marcin Mazurek
2018-12-12
Biblioteka tidytext
https://www.tidytextmining.com/
Materiały źródłowe
Źródło danych: bbc-text.csv
https://www.kaggle.com/yufengdev/bbc-fulltext-and-category
Utworzenie korpusu z ramki danych:
library(tm)## Loading required package: NLP
library(data.table)
docs_df<-read.csv2("bbc-text.csv", sep=',', encoding='uft8')
docs_df$text<-as.character(docs_df$text)
#remove non Ascii
docs_df$text <- gsub("[^\x20-\x7E]", "", docs_df$text)
docs_df$doc_id <-as.numeric(rownames(docs_df))
docs <- SimpleCorpus(DataframeSource(docs_df[1:100, c('doc_id', 'text')]))
library(tm.plugin.webmining)##
## Attaching package: 'tm.plugin.webmining'
## The following object is masked from 'package:base':
##
## parse
library(tm.plugin.webmining)Sprawdzenie załadowania pierwszego elementu w korpusie dokumentów
writeLines(as.character(docs[[1]][[1]]))## tv future in the hands of viewers with home theatre systems plasma high-definition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time. that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices. one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes like the us s tivo and the uk s sky+ system allow people to record store play pause and forward wind tv programmes when they want. essentially the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets which are big business in japan and the us but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts they can also forget about abiding by network and channel schedules putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as brand identity and viewer loyalty to channels. although the us leads in this technology at the moment it is also a concern that is being raised in europe particularly with the growing uptake of services like sky+. what happens here today we will see in nine months to a years time in the uk adam hume the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters but brand loyalty is important for everyone. we will be talking more about content brands rather than network brands said tim hanlon from brand communications firm starcom mediavest. the reality is that with broadband connections anybody can be the producer of content. he added: the challenge now is that it is hard to promote a programme with so much choice. what this means said stacey jolna senior vice president of tv guide tv group is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks in us terms or channels could take a leaf out of google s book and be the search engine of the future instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands mr hanlon suggested. on the other end you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them said mr hanlon. ultimately the consumer will tell the market they want. of the 50 000 new gadgets and technologies being showcased at ces many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies directtv has even launched its own branded dvr at the show with 100-hours of recording capability instant replay and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo called tivotogo which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want when they want.
Drugi element oraz jego metadane:
writeLines(as.character(docs[[2]][[1]]))## worldcom boss left books alone former worldcom boss bernie ebbers who is accused of overseeing an $11bn (5.8bn) fraud never made accounting decisions a witness has told jurors. david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not responsible for worldcom s problems. the phone company collapsed in 2002 and prosecutors claim that losses were hidden to protect the firm s shares. mr myers has already pleaded guilty to fraud and is assisting prosecutors. on monday defence lawyer reid weingarten tried to distance his client from the allegations. during cross examination he asked mr myers if he ever knew mr ebbers make an accounting decision . not that i am aware of mr myers replied. did you ever know mr ebbers to make an accounting entry into worldcom books mr weingarten pressed. no replied the witness. mr myers has admitted that he ordered false accounting entries at the request of former worldcom chief financial officer scott sullivan. defence lawyers have been trying to paint mr sullivan who has admitted fraud and will testify later in the trial as the mastermind behind worldcom s accounting house of cards. mr ebbers team meanwhile are looking to portray him as an affable boss who by his own admission is more pe graduate than economist. whatever his abilities mr ebbers transformed worldcom from a relative unknown into a $160bn telecoms giant and investor darling of the late 1990s. worldcom s problems mounted however as competition increased and the telecoms boom petered out. when the firm finally collapsed shareholders lost about $180bn and 20 000 workers lost their jobs. mr ebbers trial is expected to last two months and if found guilty the former ceo faces a substantial jail sentence. he has firmly declared his innocence.
#Metadane:
docs[[2]][[2]]## author : character(0)
## datetimestamp: 2019-01-03 12:06:13
## description : character(0)
## heading : character(0)
## id : 2
## language : en
## origin : character(0)
docs <- tm_map(docs, tolower)
writeLines(as.character(docs[1][[1]]))## tv future in the hands of viewers with home theatre systems plasma high-definition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time. that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices. one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes like the us s tivo and the uk s sky+ system allow people to record store play pause and forward wind tv programmes when they want. essentially the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets which are big business in japan and the us but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts they can also forget about abiding by network and channel schedules putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as brand identity and viewer loyalty to channels. although the us leads in this technology at the moment it is also a concern that is being raised in europe particularly with the growing uptake of services like sky+. what happens here today we will see in nine months to a years time in the uk adam hume the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters but brand loyalty is important for everyone. we will be talking more about content brands rather than network brands said tim hanlon from brand communications firm starcom mediavest. the reality is that with broadband connections anybody can be the producer of content. he added: the challenge now is that it is hard to promote a programme with so much choice. what this means said stacey jolna senior vice president of tv guide tv group is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks in us terms or channels could take a leaf out of google s book and be the search engine of the future instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands mr hanlon suggested. on the other end you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them said mr hanlon. ultimately the consumer will tell the market they want. of the 50 000 new gadgets and technologies being showcased at ces many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies directtv has even launched its own branded dvr at the show with 100-hours of recording capability instant replay and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo called tivotogo which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want when they want.
docs <- tm_map(docs,removePunctuation)
writeLines(as.character(docs[1][[1]]))## tv future in the hands of viewers with home theatre systems plasma highdefinition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes like the us s tivo and the uk s sky system allow people to record store play pause and forward wind tv programmes when they want essentially the technology allows for much more personalised tv they are also being builtin to highdefinition tv sets which are big business in japan and the us but slower to take off in europe because of the lack of highdefinition programming not only can people forward wind through adverts they can also forget about abiding by network and channel schedules putting together their own alacarte entertainment but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as brand identity and viewer loyalty to channels although the us leads in this technology at the moment it is also a concern that is being raised in europe particularly with the growing uptake of services like sky what happens here today we will see in nine months to a years time in the uk adam hume the bbc broadcast s futurologist told the bbc news website for the likes of the bbc there are no issues of lost advertising revenue yet it is a more pressing issue at the moment for commercial uk broadcasters but brand loyalty is important for everyone we will be talking more about content brands rather than network brands said tim hanlon from brand communications firm starcom mediavest the reality is that with broadband connections anybody can be the producer of content he added the challenge now is that it is hard to promote a programme with so much choice what this means said stacey jolna senior vice president of tv guide tv group is that the way people find the content they want to watch has to be simplified for tv viewers it means that networks in us terms or channels could take a leaf out of google s book and be the search engine of the future instead of the scheduler to help people find what they want to watch this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them but it might not suit everyone the panel recognised older generations are more comfortable with familiar schedules and channel brands because they know what they are getting they perhaps do not want so much of the choice put into their hands mr hanlon suggested on the other end you have the kids just out of diapers who are pushing buttons already everything is possible and available to them said mr hanlon ultimately the consumer will tell the market they want of the 50 000 new gadgets and technologies being showcased at ces many of them are about enhancing the tvwatching experience highdefinition tv sets are everywhere and many new models of lcd liquid crystal display tvs have been launched with dvr capability built into them instead of being external boxes one such example launched at the show is humax s 26inch lcd tv with an 80hour tivo dvr and dvd recorder one of the us s biggest satellite tv companies directtv has even launched its own branded dvr at the show with 100hours of recording capability instant replay and a search function the set can pause and rewind tv for up to 90 hours and microsoft chief bill gates announced in his preshow keynote speech a partnership with tivo called tivotogo which means people can play recorded programmes on windows pcs and mobile devices all these reflect the increasing trend of freeing up multimedia so that people can watch what they want when they want
docs <- tm_map(docs,removeNumbers)
writeLines(as.character(docs[1][[1]]))## tv future in the hands of viewers with home theatre systems plasma highdefinition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices one of the most talkedabout technologies of ces has been digital and personal video recorders dvr and pvr these settop boxes like the us s tivo and the uk s sky system allow people to record store play pause and forward wind tv programmes when they want essentially the technology allows for much more personalised tv they are also being builtin to highdefinition tv sets which are big business in japan and the us but slower to take off in europe because of the lack of highdefinition programming not only can people forward wind through adverts they can also forget about abiding by network and channel schedules putting together their own alacarte entertainment but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as brand identity and viewer loyalty to channels although the us leads in this technology at the moment it is also a concern that is being raised in europe particularly with the growing uptake of services like sky what happens here today we will see in nine months to a years time in the uk adam hume the bbc broadcast s futurologist told the bbc news website for the likes of the bbc there are no issues of lost advertising revenue yet it is a more pressing issue at the moment for commercial uk broadcasters but brand loyalty is important for everyone we will be talking more about content brands rather than network brands said tim hanlon from brand communications firm starcom mediavest the reality is that with broadband connections anybody can be the producer of content he added the challenge now is that it is hard to promote a programme with so much choice what this means said stacey jolna senior vice president of tv guide tv group is that the way people find the content they want to watch has to be simplified for tv viewers it means that networks in us terms or channels could take a leaf out of google s book and be the search engine of the future instead of the scheduler to help people find what they want to watch this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them but it might not suit everyone the panel recognised older generations are more comfortable with familiar schedules and channel brands because they know what they are getting they perhaps do not want so much of the choice put into their hands mr hanlon suggested on the other end you have the kids just out of diapers who are pushing buttons already everything is possible and available to them said mr hanlon ultimately the consumer will tell the market they want of the new gadgets and technologies being showcased at ces many of them are about enhancing the tvwatching experience highdefinition tv sets are everywhere and many new models of lcd liquid crystal display tvs have been launched with dvr capability built into them instead of being external boxes one such example launched at the show is humax s inch lcd tv with an hour tivo dvr and dvd recorder one of the us s biggest satellite tv companies directtv has even launched its own branded dvr at the show with hours of recording capability instant replay and a search function the set can pause and rewind tv for up to hours and microsoft chief bill gates announced in his preshow keynote speech a partnership with tivo called tivotogo which means people can play recorded programmes on windows pcs and mobile devices all these reflect the increasing trend of freeing up multimedia so that people can watch what they want when they want
docs <- tm_map(docs, removeWords, c(stopwords("english"), 's'))
writeLines(as.character(docs[1][[1]]))## tv future hands viewers home theatre systems plasma highdefinition tvs digital video recorders moving living room way people watch tv will radically different five years time according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes us leading trend programmes content will delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices one talkedabout technologies ces digital personal video recorders dvr pvr settop boxes like us tivo uk sky system allow people record store play pause forward wind tv programmes want essentially technology allows much personalised tv also builtin highdefinition tv sets big business japan us slower take europe lack highdefinition programming can people forward wind adverts can also forget abiding network channel schedules putting together alacarte entertainment us networks cable satellite companies worried means terms advertising revenues well brand identity viewer loyalty channels although us leads technology moment also concern raised europe particularly growing uptake services like sky happens today will see nine months years time uk adam hume bbc broadcast futurologist told bbc news website likes bbc issues lost advertising revenue yet pressing issue moment commercial uk broadcasters brand loyalty important everyone will talking content brands rather network brands said tim hanlon brand communications firm starcom mediavest reality broadband connections anybody can producer content added challenge now hard promote programme much choice means said stacey jolna senior vice president tv guide tv group way people find content want watch simplified tv viewers means networks us terms channels take leaf google book search engine future instead scheduler help people find want watch kind channel model might work younger ipod generation used taking control gadgets play might suit everyone panel recognised older generations comfortable familiar schedules channel brands know getting perhaps want much choice put hands mr hanlon suggested end kids just diapers pushing buttons already everything possible available said mr hanlon ultimately consumer will tell market want new gadgets technologies showcased ces many enhancing tvwatching experience highdefinition tv sets everywhere many new models lcd liquid crystal display tvs launched dvr capability built instead external boxes one example launched show humax inch lcd tv hour tivo dvr dvd recorder one us biggest satellite tv companies directtv even launched branded dvr show hours recording capability instant replay search function set can pause rewind tv hours microsoft chief bill gates announced preshow keynote speech partnership tivo called tivotogo means people can play recorded programmes windows pcs mobile devices reflect increasing trend freeing multimedia people can watch want want
4a. Rdzenie słów
Kopia korpusu do pozniejszego uzycia
dict_from_docs<- docs
writeLines(as.character(dict_from_docs[[1]][1]))## tv future hands viewers home theatre systems plasma highdefinition tvs digital video recorders moving living room way people watch tv will radically different five years time according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes us leading trend programmes content will delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices one talkedabout technologies ces digital personal video recorders dvr pvr settop boxes like us tivo uk sky system allow people record store play pause forward wind tv programmes want essentially technology allows much personalised tv also builtin highdefinition tv sets big business japan us slower take europe lack highdefinition programming can people forward wind adverts can also forget abiding network channel schedules putting together alacarte entertainment us networks cable satellite companies worried means terms advertising revenues well brand identity viewer loyalty channels although us leads technology moment also concern raised europe particularly growing uptake services like sky happens today will see nine months years time uk adam hume bbc broadcast futurologist told bbc news website likes bbc issues lost advertising revenue yet pressing issue moment commercial uk broadcasters brand loyalty important everyone will talking content brands rather network brands said tim hanlon brand communications firm starcom mediavest reality broadband connections anybody can producer content added challenge now hard promote programme much choice means said stacey jolna senior vice president tv guide tv group way people find content want watch simplified tv viewers means networks us terms channels take leaf google book search engine future instead scheduler help people find want watch kind channel model might work younger ipod generation used taking control gadgets play might suit everyone panel recognised older generations comfortable familiar schedules channel brands know getting perhaps want much choice put hands mr hanlon suggested end kids just diapers pushing buttons already everything possible available said mr hanlon ultimately consumer will tell market want new gadgets technologies showcased ces many enhancing tvwatching experience highdefinition tv sets everywhere many new models lcd liquid crystal display tvs launched dvr capability built instead external boxes one example launched show humax inch lcd tv hour tivo dvr dvd recorder one us biggest satellite tv companies directtv even launched branded dvr show hours recording capability instant replay search function set can pause rewind tv hours microsoft chief bill gates announced preshow keynote speech partnership tivo called tivotogo means people can play recorded programmes windows pcs mobile devices reflect increasing trend freeing multimedia people can watch want want
docs <- tm_map(docs, stemDocument)
writeLines(as.character(docs[[1]][1]))## tv futur hand viewer home theatr system plasma highdefinit tvs digit video record move live room way peopl watch tv will radic differ five year time accord expert panel gather annual consum electron show las vega discuss new technolog will impact one favourit pastim us lead trend programm content will deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talkedabout technolog ces digit person video record dvr pvr settop box like us tivo uk sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also builtin highdefinit tv set big busi japan us slower take europ lack highdefinit program can peopl forward wind advert can also forget abid network channel schedul put togeth alacart entertain us network cabl satellit compani worri mean term advertis revenu well brand ident viewer loyalti channel although us lead technolog moment also concern rais europ particular grow uptak servic like sky happen today will see nine month year time uk adam hume bbc broadcast futurologist told bbc news websit like bbc issu lost advertis revenu yet press issu moment commerci uk broadcast brand loyalti import everyon will talk content brand rather network brand said tim hanlon brand communic firm starcom mediavest realiti broadband connect anybodi can produc content ad challeng now hard promot programm much choic mean said stacey jolna senior vice presid tv guid tv group way peopl find content want watch simplifi tv viewer mean network us term channel take leaf googl book search engin futur instead schedul help peopl find want watch kind channel model might work younger ipod generat use take control gadget play might suit everyon panel recognis older generat comfort familiar schedul channel brand know get perhap want much choic put hand mr hanlon suggest end kid just diaper push button alreadi everyth possibl avail said mr hanlon ultim consum will tell market want new gadget technolog showcas ces mani enhanc tvwatch experi highdefinit tv set everywher mani new model lcd liquid crystal display tvs launch dvr capabl built instead extern box one exampl launch show humax inch lcd tv hour tivo dvr dvd record one us biggest satellit tv compani directtv even launch brand dvr show hour record capabl instant replay search function set can paus rewind tv hour microsoft chief bill gate announc preshow keynot speech partnership tivo call tivotogo mean peopl can play record programm window pcs mobil devic reflect increas trend free multimedia peopl can watch want want
4b. Uzupełnienie rdzeni do form podstawowych
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- stemCompletion(x, dictionary=dictionary, type="prevalent")
x <- paste(x, sep="", collapse=" ")
x
}
writeLines(
stemCompletion2('tv futur hand viewer home theatr system plasma highdefinit tvs digit video record move live room way peopl watch tv will radic differ five year time accord expert panel gather annual consum electron show las vega discuss new technolog will impact one favourit pastim us lead trend programm content will deliv viewer via home network cabl satellit telecom compani broadband servic provid front room portabl devic one talkedabout technolog ces digit person video record dvr pvr settop box like us s tivo uk s sky system allow peopl record store play paus forward wind tv programm want essenti technolog allow much personalis tv also builtin highdefinit tv set big busi japan us slower take europ lack highdefinit program can peopl forward wind advert can', dict_from_docs)[[1]][1]
)## tv future hand viewer home theatre system plasma highdefinition tvs digital video record move live room way people watch tv will radical difference five year time accordance expert panel gathered annual consumer electronic show las vegas discuss new technological will impact one favourite pastimes us lead trend programme content will deliver viewer via home network cable satellite telecommunications companies broadband service provide front room portable device one talkedabout technological ces digital person video record dvr pvr settop box like us saab tivo uk saab sky system allow people record store play pause forward wind tv programme want essentially technological allow much personalised tv also builtin highdefinition tv set big business japan us slower take europe lack highdefinition program can people forward wind advert can
docs_completed <- lapply(docs, stemCompletion2, dictionary=dict_from_docs)
docs_completed_df<-data.frame(text=unlist(docs_completed, recursive=FALSE), doc_id=seq(1:length(docs_completed)))
docs_completed_df$text<-as.character(docs_completed_df$text)
docs <- SimpleCorpus(DataframeSource(docs_completed_df[c('doc_id', 'text')]))
writeLines(as.character(docs[[1]][1]))## tv future hand viewer home theatre system plasma highdefinition tvs digital video record move live room way people watch tv will radical difference five year time accordance expert panel gathered annual consumer electronic show las vegas discuss new technological will impact one favourite pastimes us lead trend programme content will deliver viewer via home network cable satellite telecommunications companies broadband service provide front room portable device one talkedabout technological ces digital person video record dvr pvr settop box like us tivo uk sky system allow people record store play pause forward wind tv programme want essentially technological allow much personalised tv also builtin highdefinition tv set big business japan us slower take europe lack highdefinition program can people forward wind advert can also forget abided network channel schedule put together alacarte entertain us network cable satellite companies worried mean term advertisers revenue well brand identification viewer channel although us lead technological moment also concern raise europe particular grow uptake service like sky happen today will see nine month year time uk adam hume bbc broadcast futurologist told bbc news website like bbc issue lost advertisers revenue yet press issue moment commercial uk broadcast brand importance everyone will talk content brand rather network brand said tim hanlon brand communicate firm starcom mediavest broadband connected can produce content ad challenge now hard promote programme much choice mean said stacey jolna senior vice preside tv guidance tv group way people find content want watch simplified tv viewer mean network us term channel take leaf google book search engine future instead schedule help people find want watch kind channel model might work younger ipod generated use take control gadgets play might suit everyone panel recognise older generated comfortable familiar schedule channel brand know get perhaps want much choice put hand mr hanlon suggest end kidney just diapers push buttons everything possible available said mr hanlon ultimate consumer will tell market want new gadgets technological showcased ces manipulated enhance tvwatching experience highdefinition tv set everywhere manipulated new model lcd liquid crystal display tvs launch dvr capable built instead external box one example launch show humax inch lcd tv hour tivo dvr dvd record one us biggest satellite tv companies directtv even launch brand dvr show hour record capable instant replay search function set can pause rewind tv hour microsoft chief bill gates announce preshow keynote speech partnership tivo call tivotogo mean people can play record programme window pcs mobile device reflect increase trend free multimedia people can watch want want
dtm <- DocumentTermMatrix(docs)
dtm ## <<DocumentTermMatrix (documents: 100, terms: 4058)>>
## Non-/sparse entries: 12588/393212
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency (tf)
Term Document Matrix
tdm <- TermDocumentMatrix(docs)
tdm ## <<TermDocumentMatrix (terms: 4058, documents: 100)>>
## Non-/sparse entries: 12588/393212
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency (tf)
freq <- colSums(as.matrix(dtm))
freq[1:20]## abided accordance adam advert advertisers alacarte
## 2 12 1 3 6 1
## allow also although announce annual available
## 19 82 9 23 10 5
## bbc big biggest bill book box
## 37 9 10 10 8 5
## brand broadband
## 9 16
Uporządkowane malejąco:
freq <- sort(freq, decreasing=TRUE)
freq[1:40]## said will year people also new
## 281 154 118 90 82 71
## one governance companies last can partial
## 69 69 64 63 62 62
## time use make game now say
## 60 60 59 58 56 55
## want call firm back music need
## 54 53 53 51 49 48
## told work market play two right
## 47 47 46 46 46 46
## first like month get film take
## 46 45 45 44 44 41
## england club show three
## 41 40 39 39
Gotowa funkcja:
findFreqTerms(dtm, lowfreq=40) ## [1] "also" "call" "can" "companies" "firm"
## [6] "get" "like" "market" "month" "new"
## [11] "now" "one" "people" "play" "said"
## [16] "take" "time" "told" "use" "want"
## [21] "will" "work" "year" "last" "make"
## [26] "two" "back" "club" "england" "game"
## [31] "say" "film" "governance" "partial" "need"
## [36] "right" "first" "music"
library(ggplot2) ##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
wf <- data.frame(word=names(freq), freq=freq)
head(wf) ## word freq
## said said 281
## will will 154
## year year 118
## people people 90
## also also 82
## new new 71
p <- ggplot(subset(wf, freq>30), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=90, hjust=1))
p dtms <- removeSparseTerms(dtm, 0.9) # This makes a matrix that is 20% empty space, maximum.
dtms## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity : 83%
## Maximal term length: 11
## Weighting : term frequency (tf)
Term Frequency Inverted Document Frequency
\(`idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\)
TF - liczba wystąpień słowa w dokumencie podzielona przez liczbę wszystkich słów w dokunencie IDF - waga słowa, wynikająca z liczby dokumentów, w któych słowo występuje. Im mniejsza liczba dokumentów, tym ważniejsze słowo.
tfidf_dtm <- DocumentTermMatrix(docs, control=list(weighting=weightTfIdf))
tfidf_dtm ## <<DocumentTermMatrix (documents: 100, terms: 4058)>>
## Non-/sparse entries: 12588/393212
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
tfidf_dtms <- removeSparseTerms(tfidf_dtm, 0.9) # This makes a matrix that is 20% empty space, maximum.
tfidf_dtms## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity : 83%
## Maximal term length: 11
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
inspect(tfidf_dtms)## <<DocumentTermMatrix (documents: 100, terms: 270)>>
## Non-/sparse entries: 4534/22466
## Sparsity : 83%
## Maximal term length: 11
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs companies economic firm governance match partial
## 39 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.14394794
## 40 0.012866412 0.09842750 0.00000000 0.00000000 0.0000000 0.00000000
## 43 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## 52 0.000000000 0.00000000 0.00000000 0.05183380 0.0000000 0.00000000
## 53 0.009649809 0.00000000 0.01111111 0.00000000 0.0000000 0.00000000
## 65 0.032166030 0.03075859 0.05555556 0.06067846 0.0000000 0.02832309
## 68 0.000000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## 7 0.000000000 0.00000000 0.00000000 0.07746186 0.0000000 0.02169428
## 90 0.000000000 0.13503773 0.00000000 0.00000000 0.0000000 0.00000000
## 91 0.000000000 0.00000000 0.00000000 0.00000000 0.1634523 0.00000000
## Terms
## Docs people price share will
## 39 0.006446495 0.00000000 0.00000000 0.00000000
## 40 0.000000000 0.02358833 0.00000000 0.00000000
## 43 0.000000000 0.14622358 0.00000000 0.00000000
## 52 0.005572394 0.00000000 0.00000000 0.01129790
## 53 0.000000000 0.01769125 0.04727502 0.00000000
## 65 0.000000000 0.00000000 0.15758340 0.01234400
## 68 0.007209896 0.00000000 0.00000000 0.02046506
## 7 0.000000000 0.00000000 0.00000000 0.01890997
## 90 0.000000000 0.02588963 0.00000000 0.00000000
## 91 0.000000000 0.00000000 0.00000000 0.01017674
Analiza częstości występowania w oparciu o macierz tf-idf:
tf_freq <- colSums(as.matrix(tfidf_dtms))
head(tf_freq)## accordance allow also announce bbc biggest
## 0.2013956 0.2225343 0.4071121 0.2969897 0.4196205 0.1836695
Uporządkowane malejąco:
tf_freq <- sort(tf_freq, decreasing=TRUE)
tf_freq[1:40]## partial share economic people governance price
## 0.8484945 0.7599005 0.7475764 0.7370956 0.6591801 0.6547498
## firm match companies will game club
## 0.6277491 0.6165783 0.6115855 0.6092822 0.6064244 0.6009927
## england market win year offer quarter
## 0.6007431 0.5870324 0.5442894 0.5321437 0.5093472 0.5084403
## month star player play right return
## 0.4998237 0.4989058 0.4898740 0.4890583 0.4849442 0.4804559
## hope music service call new internal
## 0.4784270 0.4769572 0.4768794 0.4708966 0.4702842 0.4682189
## first can use house told final
## 0.4661323 0.4629000 0.4608068 0.4595336 0.4571241 0.4476756
## award make say week
## 0.4473094 0.4439823 0.4430635 0.4419193
library(ggplot2)
wf <- data.frame(word=names(tf_freq), freq=tf_freq)
head(wf) ## word freq
## partial partial 0.8484945
## share share 0.7599005
## economic economic 0.7475764
## people people 0.7370956
## governance governance 0.6591801
## price price 0.6547498
p <- ggplot(subset(wf, freq>0.4), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=90, hjust=1))
p findAssocs(dtm, c("governance" , "market"), corlimit=0.8) # specifying a correlation limit of 0.85## $governance
## allotting angst appreciate aren
## 0.88 0.88 0.88 0.88
## avenues bandsartists bed beneficial
## 0.88 0.88 0.88 0.88
## cheap chopin commercialised convinced
## 0.88 0.88 0.88 0.88
## daunting equipment etc exercise
## 0.88 0.88 0.88 0.88
## extent ferdinand flourish fork
## 0.88 0.88 0.88 0.88
## franz fraternities frontman gorbachev
## 0.88 0.88 0.88 0.88
## hate hostility idolstyle idoltype
## 0.88 0.88 0.88 0.88
## kapranos korn kylie louder
## 0.88 0.88 0.88 0.88
## macmillan megastars merit minogue
## 0.88 0.88 0.88 0.88
## modern moulds musician napalm
## 0.88 0.88 0.88 0.88
## nea nirvana penalise pockets
## 0.88 0.88 0.88 0.88
## pollution prestigious privatelyfunded raked
## 0.88 0.88 0.88 0.88
## rediscover reinforce riddled scissor
## 0.88 0.88 0.88 0.88
## selfsufficient shouldn smell solution
## 0.88 0.88 0.88 0.88
## soviet sponsorship statefunded stereotypes
## 0.88 0.88 0.88 0.88
## subsidiary subsidising sustenance tea
## 0.88 0.88 0.88 0.88
## thrive thumbs travis twiddling
## 0.88 0.88 0.88 0.88
## upcoming wagner waste wealthiest
## 0.88 0.88 0.88 0.88
## whatsoever yeah yes pursue
## 0.88 0.88 0.88 0.86
## listen grant alex art
## 0.85 0.84 0.83 0.83
## music lecture fund scrap
## 0.82 0.82 0.81 0.81
##
## $market
## numeric(0)
library(wordcloud)## Loading required package: RColorBrewer
set.seed(142)
wordcloud(names(freq), freq, min.freq=25) library(cluster)
dtmss <- removeSparseTerms(tdm, 0.9) # This makes a matrix that is only 15% empty space, maximum.
dtmss## <<TermDocumentMatrix (terms: 270, documents: 100)>>
## Non-/sparse entries: 4534/22466
## Sparsity : 83%
## Maximal term length: 11
## Weighting : term frequency (tf)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d=d, method="complete") # for a different look try substituting: method="ward.D"
plot(fit, hang=-1) plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters library(fpc)
d <- dist(t(dtmss), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0) library(quanteda)## Package version: 1.3.4
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
qdocs <-corpus(docs_df)
summary(qdocs)## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Corpus consisting of 2225 documents, showing 100 documents:
##
## Text Types Tokens Sentences category
## 1 352 775 1 tech
## 2 180 321 1 business
## 3 146 257 1 sport
## 4 189 363 1 sport
## 5 177 288 1 entertainment
## 6 295 659 1 politics
## 7 158 284 1 politics
## 8 126 203 1 sport
## 9 110 166 1 sport
## 10 135 249 1 entertainment
## 11 177 323 1 entertainment
## 12 127 223 1 business
## 13 187 363 1 business
## 14 165 311 1 politics
## 15 242 493 1 sport
## 16 175 314 1 business
## 17 215 436 1 politics
## 18 216 459 1 sport
## 19 163 313 1 business
## 20 217 490 1 tech
## 21 115 229 1 tech
## 22 193 367 1 tech
## 23 303 649 1 sport
## 24 99 160 1 sport
## 25 302 703 1 tech
## 26 146 226 1 sport
## 27 146 263 1 entertainment
## 28 163 318 1 tech
## 29 285 609 1 politics
## 30 220 419 1 entertainment
## 31 262 512 1 politics
## 32 307 670 1 tech
## 33 128 235 1 entertainment
## 34 183 318 1 entertainment
## 35 111 157 1 business
## 36 143 275 1 politics
## 37 112 207 1 tech
## 38 180 336 1 entertainment
## 39 242 537 1 politics
## 40 155 273 1 business
## 41 224 389 1 politics
## 42 194 368 1 sport
## 43 208 439 1 business
## 44 114 190 1 sport
## 45 180 328 1 tech
## 46 541 1363 5 entertainment
## 47 236 524 1 politics
## 48 250 612 1 politics
## 49 128 221 1 politics
## 50 121 168 1 business
## 51 140 252 1 sport
## 52 270 563 1 politics
## 53 162 324 1 business
## 54 173 293 1 business
## 55 152 316 1 sport
## 56 114 190 1 politics
## 57 278 537 1 business
## 58 317 676 1 sport
## 59 261 463 1 sport
## 60 262 521 1 business
## 61 261 492 1 business
## 62 176 347 1 sport
## 63 150 247 1 business
## 64 120 181 1 sport
## 65 108 195 1 business
## 66 231 448 1 tech
## 67 172 289 1 business
## 68 189 426 1 entertainment
## 69 142 248 1 tech
## 70 195 379 1 business
## 71 274 648 1 politics
## 72 141 227 1 business
## 73 258 481 1 politics
## 74 132 216 1 sport
## 75 118 203 1 business
## 76 329 640 2 tech
## 77 211 444 1 business
## 78 252 554 1 sport
## 79 130 203 1 sport
## 80 164 288 1 business
## 81 144 248 1 business
## 82 117 186 1 sport
## 83 173 332 1 politics
## 84 140 253 1 business
## 85 149 242 1 entertainment
## 86 165 311 1 politics
## 87 192 344 1 politics
## 88 185 352 1 business
## 89 302 571 2 entertainment
## 90 136 235 1 business
## 91 154 267 1 sport
## 92 94 134 1 sport
## 93 285 687 1 politics
## 94 140 231 1 sport
## 95 233 516 1 politics
## 96 97 140 1 sport
## 97 156 255 1 business
## 98 339 683 1 sport
## 99 124 202 1 business
## 100 292 622 1 entertainment
##
## Source: C:/Users/mmazurek/Documents/RWorkDir/RDemo/* on x86-64 by mmazurek
## Created: Thu Jan 03 13:06:42 2019
## Notes:
qdtm<-dfm(qdocs)==
doc_freq<-docfreq(qdtm)
qdtm_tfidf<-dfm_tfidf(qdtm)
qdtm_tfidf[1,'tv']## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfm"
## features
## docs tv
## 1 12.47801
df<-doc_freq['tv']
tf<-qdtm[1,'tv']
tf <-qdtm[1,'tv'] / sum(which(convert(qdtm,"matrix")[1,]>0))
idf<-log(2225/df)
tf_idf<-tf * idflibrary(tidytext)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
text_df<-data_frame(doc=1:nrow(docs_df), text=docs_df$text)
text_df## # A tibble: 2,225 x 2
## doc text
## <int> <chr>
## 1 1 tv future in the hands of viewers with home theatre systems pla~
## 2 2 worldcom boss left books alone former worldcom boss bernie ebb~
## 3 3 tigers wary of farrell gamble leicester say they will not be r~
## 4 4 yeading face newcastle in fa cup premiership side newcastle unit~
## 5 5 ocean s twelve raids box office ocean s twelve the crime caper ~
## 6 6 howard hits back at mongrel jibe michael howard has said a claim~
## 7 7 blair prepares to name poll date tony blair is likely to name 5 ~
## 8 8 henman hopes ended in dubai third seed tim henman slumped to a s~
## 9 9 wilkinson fit to face edinburgh england captain jonny wilkinson ~
## 10 10 last star wars not for children the sixth and final star wars ~
## # ... with 2,215 more rows
text_df %>%
unnest_tokens(word, text)## # A tibble: 873,994 x 2
## doc word
## <int> <chr>
## 1 1 tv
## 2 1 future
## 3 1 in
## 4 1 the
## 5 1 hands
## 6 1 of
## 7 1 viewers
## 8 1 with
## 9 1 home
## 10 1 theatre
## # ... with 873,984 more rows
data("stop_words")
stop_words## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
cleaned_texts<-text_df %>%
unnest_tokens(word, text)%>% anti_join(stop_words)## Joining, by = "word"
cleaned_texts%>% count(word, sort=TRUE)## # A tibble: 29,827 x 2
## word n
## <chr> <int>
## 1 people 2045
## 2 time 1322
## 3 world 1201
## 4 government 1160
## 5 uk 1104
## 6 told 911
## 7 film 890
## 8 game 871
## 9 music 839
## 10 000 804
## # ... with 29,817 more rows
text_words <- cleaned_texts%>%
count(doc, word, sort = TRUE) %>%
ungroup()
total_words <- text_words %>%
group_by(doc) %>%
summarize(total = sum(n))
text_words <- left_join(text_words, total_words)## Joining, by = "doc"
text_words## # A tibble: 283,412 x 4
## doc word n total
## <int> <chr> <int> <int>
## 1 866 music 71 912
## 2 1616 song 65 1340
## 3 1928 roddick 53 669
## 4 866 urban 52 912
## 5 678 wage 51 1265
## 6 678 minimum 47 1265
## 7 1928 nadal 46 669
## 8 1605 kilroy 44 962
## 9 409 forsyth 37 1671
## 10 1605 silk 36 962
## # ... with 283,402 more rows
freq_by_rank <- text_words %>%
group_by(doc) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank## # A tibble: 283,412 x 6
## # Groups: doc [2,225]
## doc word n total rank `term frequency`
## <int> <chr> <int> <int> <int> <dbl>
## 1 866 music 71 912 1 0.0779
## 2 1616 song 65 1340 1 0.0485
## 3 1928 roddick 53 669 1 0.0792
## 4 866 urban 52 912 2 0.0570
## 5 678 wage 51 1265 1 0.0403
## 6 678 minimum 47 1265 2 0.0372
## 7 1928 nadal 46 669 2 0.0688
## 8 1605 kilroy 44 962 1 0.0457
## 9 409 forsyth 37 1671 1 0.0221
## 10 1605 silk 36 962 2 0.0374
## # ... with 283,402 more rows
library(ggplot2)
cleaned_texts %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() #cleaned_texts %>% bind_tf_idf(cleaned_texts, document, count) %>% arrange(desc(tf_idf))