There is no doubt that the text data presents itself to be a challenge
for machine learning enthusiasts as it is probably most difficult to
arrange and analyse. Topic modelling has become one of the top areas for
the research. In this small study we make an attempt to implement the
clustering mechanisms to the BBC news summary data.
# loading necessary packages
if (!require('pacman')) install.packages('pacman')
pacman::p_load(DescTools, rmarkdown, tibble, tidytext, tm,
tokenizers, text2vec, textcat, stopwords, SnowballC,
dplyr, tidyr, factoextra, Rtsne, lsa, quanteda, tsne, irlba,
clusterR, NbClust, flexclust, fpc, dbscan, cluster, wordcloud)
The data set we use was retrieved from kaggle.com and available under the following link.
It contains around 1000 entries of BBC news summaries from 2004-2005. As given by authors, the news should originate from 5 different categories: business, entertainment, politics, sport and tech.
We will omit the categories in data and try to find clusters with text mining approach.
# loading data
df <- read.csv('bbc-news-data.csv', sep = '\t')
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
Desc(df)
## ------------------------------------------------------------------------------
## Describe df (data.frame):
##
## data frame: 1096 obs. of 4 variables
## 1096 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 category character .
## 2 filename character .
## 3 title character .
## 4 content character .
##
##
## ------------------------------------------------------------------------------
## 1 - category (character)
##
## length n NAs unique levels dupes
## 1'096 1'096 0 5 5 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 sport 259 23.6% 259 23.6%
## 2 politics 226 20.6% 485 44.3%
## 3 entertainment 223 20.3% 708 64.6%
## 4 business 214 19.5% 922 84.1%
## 5 tech 174 15.9% 1'096 100.0%
## ------------------------------------------------------------------------------
## 2 - filename (character)
##
## length n NAs unique levels dupes
## 1'096 1'096 0 466 466 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 029.txt 5 0.5% 5 0.5%
## 2 030.txt 5 0.5% 10 0.9%
## 3 098.txt 5 0.5% 15 1.4%
## 4 099.txt 5 0.5% 20 1.8%
## 5 169.txt 5 0.5% 25 2.3%
## 6 171.txt 5 0.5% 30 2.7%
## 7 172.txt 5 0.5% 35 3.2%
## 8 196.txt 5 0.5% 40 3.6%
## 9 209.txt 5 0.5% 45 4.1%
## 10 237.txt 5 0.5% 50 4.6%
## 11 238.txt 5 0.5% 55 5.0%
## 12 298.txt 5 0.5% 60 5.5%
## ... etc.
## [list output truncated]
## ------------------------------------------------------------------------------
## 3 - title (character)
##
## length n NAs unique levels dupes
## 1'096 1'096 0 1'067 1'067 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 'Debate needed' on donations cap 2 0.2% 2 0.2%
## 2 'Ultimate game' award for Doom 3 2 0.2% 4 0.4%
## 3 Apple attacked over sources row 2 0.2% 6 0.5%
## 4 Ask Jeeves joins web log market 2 0.2% 8 0.7%
## 5 Blair dismisses quit claim report 2 0.2% 10 0.9%
## 6 Brown ally rejects Budget spree 2 0.2% 12 1.1%
## 7 Brown names 16 March for Budget 2 0.2% 14 1.3%
## 8 Brown outlines third term vision 2 0.2% 16 1.5%
## 9 Commodore finds new lease of life 2 0.2% 18 1.6%
## 10 Davenport hits out at Wimbledon 2 0.2% 20 1.8%
## 11 Doors open at biggest gadget fair 2 0.2% 22 2.0%
## 12 Downloads enter US singles chart 2 0.2% 24 2.2%
## ... etc.
## [list output truncated]
## ------------------------------------------------------------------------------
## 4 - content (character)
##
## length n NAs unique levels dupes
## 1'096 1'096 0 1'062 1'062 y
## 100.0% 0.0%
##
## level
## 1 A cap on donations to political parties should not be introduced yet, the elections watchdog has said. Fears that big donors can buy political favours have sparked calls for a limit. In a new report, the Electoral Commission says it is worth debating a £10,000 cap for the future but now is not the right time to introduce it. It also says there should be more state funding for political parties and candidates should be able to spend more on election campaigning. There were almost £68m in reported donations to political parties in 2001, 2002 and 2003, with nearly £12m of them from individual gifts worth more than £1m. The rules have already been changed so the public can see who gives how much to the parties but the report says there are still public suspicions. The commission says capping donations would mean taxpayers giving parties more cash - something which would first have to be acceptable to the public and shown to work. While we are not in principle opposed to the introduction of a donation cap, we do not believe that such a major departure from the existing system now would be sensible, says its report. If there was to be a cap, it should be £10,000 - a small enough amount to make a difference but which would have banned £56m in donations between 2001 and 2003. Even without changes the commission does urge political parties to seek out more small-scale donations and suggests there should be income tax relief for gifts under £200. It also suggests increasing state funding for parties to £3m so help can be extended to all parties with at least two members in the House of Commons, European Parliament, Scottish Parliament, Welsh Assembly or Northern Ireland Assembly. And it suggests new ways of boosting election campaigning, seen as a way of improving voter turnout. All local election candidates should be entitled to a free mailshot for campaign leaflets, says the watchdog. And there should be a shift in the amount of money allowed to be spent at elections from a national level to a local level to help politicians engage better with voters. The report suggests doubling the money which can be spent by candidates, while cutting national spending limits from £20m to £15m. The commission also says the spending limits for general elections should cover the four months before the poll - as happens with other elections. Electoral Commission chairman Sam Younger said: There is no doubt that political parties have a vital role to play in maintaining the health of our democracy and for this they need to be adequately resourced. Our research has shown that people want to be more informed about party politics and that they want politicians to be more visible and accessible. The public are reluctant for the state to fund parties but at the same time are unhappy with large private donations. He called for a wider public debate on party funding to find the consensus needed for radical changes to the current system.
## 2 Ask Jeeves has bought the Bloglines website to improve the way it handles content from web journals or blogs. The Bloglines site has become hugely popular as it gives users one place in which to read, search and share all the blogs they are interested in. Ask Jeeves said it was not planning to change Bloglines but would use the 300 million articles it has archived to round out its index of the web. How much Ask Jeeves paid for Bloglines was not revealed. Bloglines has become popular because it lets users build a list of the blogs they want to follow without having to visit each journal site individually. To do this it makes use of a technology known as Really Simple Syndication (RSS) that many blogs have adopted to let other sites know when new entries are made on their journals. The acquisition follows similar moves by other search sites. Google acquired Pyra Labs, makers of the Blogger software, in 2003. In 2004 MSN introduced its own blog system and Yahoo has tweaked its technology to do a better job of handling blog entries. Jim Lanzone, vice president of search properties at Ask Jeeves in the US, said it did not acquire Bloglines just to get a foothold in the blog publishing world. He said Ask Jeeves was much more interested in helping people find information they were looking for rather than helping them write it. The universe of readers is vastly larger than the universe of writers, he said. Mr Lanzone said the acquisition would sit well with Ask's My Jeeves service which lets people customise their own web experience and build up a personal collection of useful links. Search engines are about discovering information for the first time and RSS is the ideal way to keep track of and monitor those sites, he said. It would also help drive information and entries from blogs to the portals that Ask Jeeves operates. There would be no instant sweeping changes to Bloglines, said Mr Lanzone. Our intent is to take our time to figure out the right business model not to try to monetise it right away, he said. Though Mr Lanzone added that Ask Jeeves would be helping organise the database of 300m blog entries Bloglines holds with its own net indexing technology. Being able to search the blogosphere as one corpus of information will be very useful in its own right, said Mr Lanzone. Rumours about the acquisition were broken by the Napsterization weblog which said it got the hint from Ask Jeeves insiders.
## 3 British and Irish Lions coach Clive Woodward says he is unlikely to select any players not involved in next year's RBS Six Nations Championship. World Cup winners Lawrence Dallaglio, Neil Back and Martin Johnson had all been thought to be in the frame for next summer's tour to New Zealand. I don't think you can ever say never, said Woodward. But I would have to have a compulsive reason to pick any player who is not available to international rugby. Dallaglio, Back and Johnson have all retired from international rugby over the last 12 months but continue to star for their club sides. But Woodward added: The key thing that I want to stress is that I intend to use the Six Nations and the players who are available to international rugby as the key benchmark. My job, along with all the other senior representatives, is to make sure that we pick the strongest possible team. If you are not playing international rugby then it's still a step up to Test rugby. It's definitely a disadvantage. I think it's absolutely critical and with the history of the Lions we have got to take players playing for the four countries. Woodward also revealed that the race for the captaincy was still wide open. It is an open book, he said. There are some outstanding candidates from all four countries. And following the All Blacks' impressive displays in Europe in recent weeks, including a 45-6 humiliation of France, Woodward believes the three-test series in New Zealand will provide the ultimate rugby challenge. Their performance in particular against France was simply awesome, said the Lions coach. Certain things have been suggested about the potency of their front five, but they're a very powerful unit. With his customary thoroughness, Woodward revealed he had taken soundings from Australia coach Eddie Jones and Jake White of South Africa following their tour matches in Britain and Ireland. As a result, Woodward stressed his Lions group might not be dominated by players from England and Ireland and held out hope for the struggling Scots. Scotland's recent results have not been that impressive but there have been some excellent individual performances. Eddie in particular told me how tough they had made it for Australia and I will take on board their opinions. And Scotland forward Simon Taylor looks certain to get the call, provided he recovers from knee and tendon problems. I took lessons from 2001 in that they did make a mistake in taking Lawrence Dallaglio when he wasn't fit and went on the trip. Every player has to be looked at on their own merits and Simon Taylor is an outstanding player and I have no doubts that if he gets back to full fitness he will be on the trip. I am told he should be back playing by March and he has plenty of time to prove his fitness for the Lions - and there are other players like Richard Hill in the same boat.
## 4 Chancellor Gordon Brown's closest ally has denied suggestions there will be a Budget giveaway on 16 March. Ed Balls, ex-chief economic adviser to the Treasury, said there would be no spending spree before polling day. But Mr Balls, a prospective Labour MP, said he was confident the chancellor would meet his fiscal rules. He was speaking as Sir Digby Jones, CBI director general, warned Mr Brown not to be tempted to use any extra cash on pre-election bribes. Mr Balls, who stepped down from his Treasury post to stand as a Labour candidate in the election, had suggested that Mr Brown would meet his golden economic rule - with a margin to spare. He said he hoped more would be done to build on current tax credit rules. He also stressed rise in interest rates ahead of an expected May election would not affect the Labour Party's chances of winning. Expectations of a rate rise have gathered pace after figures showed house prices are still rising. Consumer borrowing rose at a near-record pace in January. If the MPC (the Bank of England's Monetary Policy Committee) were to judge that a rate rise was justified before the election because of the strength of the economy - and I'm not predicting that they will - I do not believe that this will be a big election issue in Britain for Labour, he told a Parliamentary lunch. This is a big change in our political culture. During an interview with BBC Radio 4's Today programme, Mr Balls said he was sure Mr Brown's Budget would not put at risk the stability of the economy. I don't think we'll see a pre-election spending spree - we certainly did not see that before 2001, he said. His assurances came after Sir Digby Jones said stability was all important and any extra cash should be spent on improving workers' skills. His message to the chancellor was: Please don't give it away in any form of electioneering. Sir Digby added: I don't think he will. I have to say he has been a prudent chancellor right the way through. Stability is the key word - British business needs boring stability more than anything. We would say to him 'don't increase your public spending, don't give it away. But if you are going to anywhere, just add something to the competitiveness of Britain, put it into skilling our people'. That would be a good way to spend any excess. Mr Balls refused to say whether Mr Brown would remain as chancellor after the election, amid speculation he will be offered the job of Foreign Secretary. I think that Gordon Brown wants to be part of the successful Labour government which delivers in the third term for the priorities of the people and sees off a Conservative Party that will take Britain backwards, Mr Balls told Today. Prime Minister Tony Blair has yet to name the date of the election, but most pundits are betting on 5 May.
## 5 Chart-topping pop band Busted have confirmed that they plan to take a break, following rumours that they were on the verge of splitting. A statement from the band's record company Universal said frontman Charlie Simpson planned to spend some time working with his other band, Fightstar. However they said that Busted would reconvene in due course. The band have had eight top three hits, including four number ones, since they first hit the charts in 2002. Their singles include What I Go To School For, Year 3000, Crashed The Wedding, You Said No, and Who's David? The band, which also includes members Matt Jay and James Bourne, made the top ten with their self-titled debut album, as well as the follow-up, A Present For Everyone, in 2003. They won best pop act and best breakthrough act at the 2004 Brit Awards and were nominated for best British group. Most recently they topped the charts with the theme from the live-action film version of Thunderbirds, which was voted Record Of The Year on the ITV1 show. The band have capitalised on a craze for artists playing catchy pop music with rock overtones. The trio are seen as an alternative to more manufactured artists who are not considered credible musicians because they do not write their own songs or play their own instruments. However, recent rumours have suggested that Simpson has been wanting to quit the band to focus on Fightstar. He now plans to take Fightstar on tour.
## 6 Conductor Marcello Viotti, director of Venice's famous La Fenice Theatre, has died in Germany at 50. Viotti, director of La Fenice since 2002, conducted at renowned opera houses worldwide including Milan's La Scala and the Vienna State Opera. His time at La Fenice coincided with its reopening in 2003 after it was destroyed by fire in 1996. He fell into a coma after suffering a stroke during rehearsals for Jules Massenet's Manon last week. He conducted some of the best orchestras in the world including the Berlin Philharmonic and the English Chamber Orchestra. Viotti was born in Switzerland and studied the piano, cello and singing at the Lausanne Conservatory. His career breakthrough came in 1982 when he won first prize at the Gino Marinuzzi conducting competition in Italy. Viotti established himself as chief conductor of the Turin Opera and went on to become chief conductor of Munich's Radio Orchestra. At La Fenice Viotti was widely acclaimed for his production of the French composer Massenet's Thais and some of his other productions included Giuseppe Verdi's La Traviata and Richard Strauss's Ariadne auf Naxos. The last opera he directed at La Fenice was Massenet's Le Roi de Lahore. Viotti's debut at the New York's Metropolitan Opera came in 2000 with Giacomo Puccini's Madame Butterfly, followed by La Boheme, La Traviata and Fromental Halevy's La Juive. Giampaolo Vianello, superintendent of the Fenice Theatre Foundation, said: I am filled with extreme sadness because, other than a great artist, he is missed as a friend - a main character in the latest joyous times, during the rebirth of our theatre. Viotti's last public performance was on 5 February when he conducted Vincenzo Bellini's Norma at the Vienna State Opera.
## 7 Digital music downloads are being included in the main US singles chart for the first time. Billboard's Hot 100 chart now incorporates data from sales of music downloads, previously only assigned to a separate download chart. Green Day's Boulevard of Broken Dreams is currently number two in Billboard's pop chart, and tops its digital chart. Download sales are due to be incorporated into the UK singles chart later this year. Digital sales in the US are already used to compile Billboard's Hot Digital Sales chart. They will now be tallied with sales of physical singles and airplay information to make up its new Hot 100 chart. Its second new chart - the Pop 100 - also combines airplay, digital and physical sales but confines its airplay information to US radio stations which play chart music. In addition to Green Day, other artists in the current US digital sales top 10 include Kelly Clarkson, The Game and the Killers. Sales of legally downloaded songs shot up more than tenfold in 2004, with 200 million track purchased online in the US and Europe in 12 months, the International Federation of the Phonographic Industry (IFPI) reported last month. In the UK sales of song downloads overtook those for physical singles for the first time at the end of last year. The last week of December 2004 saw download sales of 312,000 compared with 282,000 physical singles, according to the British Phonographic Industry. The UK's first official music download chart was launched last September, compiling the most popular tracks downloaded from legal UK sites - including iTunes, OD2, mycokemusic.com and Napster. Westlife's Flying Without Wings - a 1999 track reissued for the occasion - was the first number one of the UK download chart. A spokesman for the British Phonographic Industry (BPI) said the first combined UK download and sales chart was due to be compiled within the first half of this year. Work is going on across the music business right now to make sure the new chart works to plan, he said. The BPI spokesman described the UK music download chart, compiled by the Official Charts Company, as having been a great success since its launch. It has provided a focus for the industry and has really driven interest in downloads among music fans, he said.
## 8 Fast web access is encouraging more people to express themselves online, research suggests. A quarter of broadband users in Britain regularly upload content and have personal sites, according to a report by UK think-tank Demos. It said that having an always-on, fast connection is changing the way people use the internet. More than five million households in the UK have broadband and that number is growing fast. The Demos report looked at the impact of broadband on people's net habits. It found that more than half of those with broadband logged on to the web before breakfast. One in five even admitted to getting up in the middle of the night to browse the web. More significantly, argues the report, broadband is encouraging people to take a more active role online. It found that one in five post something on the net everyday, ranging from comments or opinions on sites to uploading photographs. Broadband is putting the 'me' in media as it shifts power from institutions and into the hands of the individual, said John Craig, co-author of the Demos report. From self-diagnosis to online education, broadband creates social innovation that moves the debate beyond simple questions of access and speed. The Demos report, entitled Broadband Britain: The End Of Asymmetry?, was commissioned by net provider AOL. Broadband is moving the perception of the internet as a piece of technology to an integral part of home life in the UK, said Karen Thomson, Chief Executive of AOL UK, with many people spending time on their computers as automatically as they might switch on the television or radio. According to analysts Nielsen//NetRatings, more than 50% of the 22.8 million UK net users regularly accessing the web from home each month are logging on at high speed They spend twice as long online than people on dial-up connections, viewing an average of 1,444 pages per month. The popularity of fast net access is growing, partly fuelled by fierce competition over prices and services.
## 9 French actress Audrey Tautou, star of hit film Amelie, will play the female lead in the film adaptation of The Da Vinci Code, it has been reported. The movie version of Dan Brown's best-selling novel is being directed by Ron Howard and also stars Tom Hanks. Tautou will play Hanks' code-cracking partner, according to various newspapers. She is currently starring in A Very Long Engagement, directed by Jean-Pierre Jeunet. Jeunet was also responsible for directing Tautou in Amelie in 2001, which launched the actress into the mainstream. She also starred as the lead role in critically-acclaimed film Dirty Pretty Things in 2002. Oscar-winning director Ron Howard chose Tautou for the part, preferring a French actress to a big name Hollywood star. UK actress Kate Beckinsale had been widely tipped as a possibility for the role alongside Vanessa Paradis and Juliette Binoche. The thriller upon which the movie is based has sold more than 17 million copies and is centred on a global conspiracy surrounding the Holy Grail mythology. The Louvre Museum, scene of the gruesome murder at the beginning of the novel, recently gave permission for filming to take place there, showbusiness newspaper Variety reported. The $100m movie will be produced by Columbia/Sony Pictures and is due for release on May 19, 2006 in the United States and France.
## 10 Gordon Brown has outlined what he thinks should be the key themes of New Labour's next general election bid. He said ensuring every child in Britain had the best start in life could be a legacy to match the NHS's creation. The chancellor has previously planned the party's election strategy but this time the role will be filled by Alan Milburn - a key ally of Tony Blair. The premier insisted Mr Brown will have a key role in Labour's campaign, and praised his handling of the economy. Writing in the Guardian newspaper, Mr Brown outlined his view of the direction New Labour should be taking. As our manifesto and our programme for the coming decade should make clear, Labour's ambition is not simply tackling idleness but delivering full employment; not just attacking ignorance, disease and squalor but promoting lifelong education, good health and sustainable communities. BBC political editor Andrew Marr said that Mr Brown's article was a warning shot to Mr Blair not to try and cut him out of the manifesto writing process. It was, as always, coded and careful... but entirely deliberate, was Mr Marr's assessment. The prime minister was asked about Mr Brown's article and about his election role when he appeared on BBC Radio 4's Today programme. Mr Blair said a decision had yet to be taken over how the election would be run but the chancellor's role would be central. Mr Blair argued that under New Labour the country had changed for the better and that was in part because of Mr Brown's management of the economy. And he pledged childcare would be a centrepiece of Labour's manifesto. He also predicted the next general election will be a tough, tough fight for New Labour. But the prime minister insisted he did not know what date the poll would take place despite speculation about 5 May. Mr Blair said he was taking nothing for granted ahead of the vote - warning that the Tory strategy was to win power via the back door by hinting they were aiming to cut Labour's majority instead of hoping for an outright win.
## 11 Kim Clijsters has denied reports that she has pulled out of January's Australian Open because of her persistent wrist injury. Open chief Paul McNamee had said: Kim's wrist obviously isn't going to be rehabilitated. But her spokesman insisted she had simply delayed submitting her entry. The doctors are assessing her injury on a weekly basis and if there is no risk she could play. But if there's the least risk she will stay away. Despite being absent from the WTA entry list for the tournament, which begins on 17 January, Clijsters would be certain to get a wild card if she requested one. Clijsters is still ranked 22nd in the world despite only playing a handful of matches last season. The Belgian had an operation on her left wrist early in the season but injured it again on her return to the tour. Meanwhile, Jelena Dokic, who used to compete for Australia, has opted out of the first Grand Slam of the season. Dokic has not played in the Australian Open since 2001 when she lost in the first round. But the 21-year-old would have had to rely on a wild card next season because her ranking has tumbled to 127th. Four-time champion Monica Seles, who has not played since last year's French Open, is another absentee because of an injured left foot.
## 12 Listen to the full interview on Sport on Five and the BBC Sport website from 1900 GMT. But Parry, speaking exclusively to BBC Sport, also admits Gerrard, who has been constantly linked with Chelsea, will have the final say on his future. He told BBC Five Live: Steven is above money. He is the future of Liverpool. It doesn't matter if it's £30m, £40m or £50m, we will not accept offers. But we are also realistic enough to know we can't keep Steven against his will. On the subject of Liverpool's finances, Parry also revealed the club is ready to explore the possibility of a sponsorship deal for its proposed new stadium. And responding to criticism from BBC Sport pundit and former Liverpool stalwart Alan Hansen, he insisted talks on new investment are ongoing, but added the door has not closed on shareholder and lifelong fan Steve Morgan. Parry joined Liverpool as chief executive in July 1998 from a similar role at the Premier League. There have been several highs and lows during his time in charge at Anfield - and he had a busy summer, overseeing the arrival of new manager Rafael Benitez and managing to hold on to Steven Gerrard. On the subject of Liverpool's captain and prize asset, Parry revealed Real Madrid did ask for an option on the England midfield man during negotiations for striker Fernando Morientes. He said: They were looking for ways of saying they got more out of the deal for Fernando Morientes, but the response to Real Madrid was the same - Steven is not for sale. But when asked if Gerrard would be a Liverpool player on the first day of next season, Parry said: I sincerely hope he will be. Steven knows my views. He knows Rafa's views. We have re-affirmed recently to Steven that we are trying to build a team around him. We crave success as much as he does. We know he's ambitious and nobody can argue with that. I think Steven would dearly love to win things with Liverpool more than he'd like to do anything else. We all want to see progress by next season. He's not alone in that. There are a lot of other players who feel the same, so we all have a common aim. It is expected Chelsea will test Liverpool with a £30m-plus bid in the summer - but Parry claims he will be in no mood to listen. There have been a lot of open secrets about Steven, most of which have been complete myths. It is suggested we had a deal tied up last summer. We didn't had an offer last summer, Parry explained. We had told Chelsea that as far as we were concerned he was not for sale and we didn't want to sell him. In reality it didn't go beyond that. Maybe there will be an offer in the summer. Maybe there won't. Our position is we want Steven to stay, but we are also realistic enough and have enough respect for Steven - and he has enough respect for us - to know that it is his decision that will be crucial. You are not going to keep a player like Steven against his will. That just doesn't work, but any idea we are going to accept offers for Steven and then tell him 'by the way we've decided to sell you' is not on the agenda. You can forget that. Parry is currently in the process of finalising funding for Liverpool's new stadium in Stanley Park, which is set to open in 2007. And he confessed Arsenal's £100m deal with Emirates to sponsor their new ground - complete with naming rights - has given the Anfield club serious food for thought. He said: I have to say historically it is something I have been against, and I have been on record as saying that, but I think the size of the Arsenal deal is a real eye-opener. I would say in the past deals have been done frankly far too cheaply and it just hasn't even been worth contemplating. But the Arsenal deal is the sort of deal that causes you to draw breath and say 'wow - that's interesting.' My personal point of view is that I would find it a hell of a lot more palatable than a shared stadium. Some Liverpool fans would find such a move highly controversial, but Parry countered: I recognise it would be an emotive issue for many supporters, but you look at the amount of money available and it could go into the team. If it was the right partner how strong an issue is it? Time will tell. I think the stadium will always be Anfield, not least because of where it is, but do we need to investigate the possibilities of sponsorship? I think it would be remiss not to. That's not to say we have made a decision that we will go down that road, but I think it is clearly something we have to explore. On the subject of possible new investment, Parry revealed Liverpool are still in negotiations with a mystery investor, with rumours of interest from the Middle East. That prompted the withdrawal of tycoon Steve Morgan, who got frustrated by failed bids and what he claimed was indecision by the board. He also accused Liverpool of using him as a stalking horse to attract other bids, but Parry explained: Steve has never been used as a stalking horse. There's no need, and that is not the way we do business. We had discussions with Steve over the course of 2004. I think we came close to concluding a deal in the summer but it didn't happen. Quite genuinely, the new interest did appear relatively late in the day just prior to the AGM in December, and as I have said it was of such potential magnitude, and that potential is so exciting, we felt we had to evaluate it. We are still evaluating it. Steve's interest was taken very much on its own merits. His enthusiasm for the club is there for all to see and who knows what the next few months will hold? The door isn't closed on anything. We had a perfectly sensible dialogue with Steve last year. We have a common interest in making Liverpool successful. That's a dream we all share, so as far as I'm concerned the door is not closed. I would take £50m if we had no investment, but if we did, keep him. As for the stadium, if it gets us cash what difference does it make really? £50m for Gerrard? I don't care who you are, the Directors would take the money and it is the way it should be. We cannot let that sum of money go, despite Gerrard's quality. Through a cleverly worded statement, the club has effectively forced Gerrard to publicly make the decision for himself, which I think is the right thing to do. Critical time for Liverpool with regards to Gerrard. Ideally we would want to secure his future to the club for the long term. I am hoping he doesn't walk out of the club like Michael Owen did for very little cash. £50m realistically would allow Rafa to completely rebuild the squad, however, if we can afford to do this AND keep Gerrard we will be better for it. I would however be happy with Gerrard's transfer for any fee over £35m. Parry's statements are clever in that any future Gerrard transfer cannot be construed as a lack of ambition by the club to not try and keep their best players. Upping the ante is another smart move by Parry. I would keep Gerrard. No amount of money could replace his obvious love of the club and determination to succeed. The key is if Gerrard comes out and says that he is happy. Clearly, if he isn't, then we would be foolish not to sell. The worrying thing is who would you buy (or who would come) pending possible non-Champions League football.
## freq perc cumfreq cumperc
## 1 2 0.2% 2 0.2%
## 2 2 0.2% 4 0.4%
## 3 2 0.2% 6 0.5%
## 4 2 0.2% 8 0.7%
## 5 2 0.2% 10 0.9%
## 6 2 0.2% 12 1.1%
## 7 2 0.2% 14 1.3%
## 8 2 0.2% 16 1.5%
## 9 2 0.2% 18 1.6%
## 10 2 0.2% 20 1.8%
## 11 2 0.2% 22 2.0%
## 12 2 0.2% 24 2.2%
## ... etc.
## [list output truncated]
We have duplicated values in the dataset, so we need to omit them.
# # omitting duplicated rows and leaving only content of the article
df <- df[!duplicated(df$content), c('title', 'content')]
dim(df)
## [1] 1062 2
When processing text data, it would be would to make sure that we
have all entries of the same language. We can use textcat
function which recognises languages.
# checking the language of articles with `textcat` package
df$lang <- textcat(df$content)
summary(as.factor(df$lang))
## english middle_frisian scots
## 1055 1 6
As we see, we have only English summaries, so we may proceed further.
Before fitting any model on text data, it takes a lot of effort to process it appropriately to mathematically convenient form to work with.
First of all, we need to squeeze the content and clean
text. The tm package is very useful for that.
With tm, we may
Remove punctuation
Remove Stop Words
Remove Additional Spaces and Digits
Lemmatise the Text
The tm package requires our data to be presented in a
specific structure - a Corpus. One may imagine it as a list object which
contains the contents as documents and some metadata about them.
corpus <- Corpus(VectorSource(as.vector(df$content)))
To access the content, we need to use double brackets, as Corpus is a list in R.
# This is how the text is accessed in Corpus
corpus[[1]]$content
## [1] " Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding. Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility, chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake. "
# And metadata
corpus[[1]]$meta
## author : character(0)
## datetimestamp: 2023-03-01 21:29:13
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
As the Corpus object is created, we may proceed to some
transformations. As we will omit some specific patterns in text entries,
gsub function will help us.
# In case any new lines are persistent, omit them
new_line <- function(x) gsub("[\n]","",x)
corpus <- tm_map(corpus, new_line)
corpus[[5]]$content
## [1] " Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard. Reports in the Wall Street Journal and the Financial Times suggested that the French spirits firm is considering a bid, but has yet to contact its target. Allied Domecq shares in London rose 4% by 1200 GMT, while Pernod shares in Paris slipped 1.2%. Pernod said it was seeking acquisitions but refused to comment on specifics. Pernod's last major purchase was a third of US giant Seagram in 2000, the move which propelled it into the global top three of drinks firms. The other two-thirds of Seagram was bought by market leader Diageo. In terms of market value, Pernod - at 7.5bn euros ($9.7bn) - is about 9% smaller than Allied Domecq, which has a capitalisation of £5.7bn ($10.7bn; 8.2bn euros). Last year Pernod tried to buy Glenmorangie, one of Scotland's premier whisky firms, but lost out to luxury goods firm LVMH. Pernod is home to brands including Chivas Regal Scotch whisky, Havana Club rum and Jacob's Creek wine. Allied Domecq's big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka and Ballantine's whisky - as well as snack food chains such as Dunkin' Donuts and Baskin-Robbins ice cream. The WSJ said that the two were ripe for consolidation, having each dealt with problematic parts of their portfolio. Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains. "
# omit apostrophes
apostr <- function(x) gsub("'","",x)
corpus <- tm_map(corpus, apostr)
corpus[[1]]$content
## [1] " Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOLs underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOLs existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding. Time Warners fourth quarter profits were slightly better than analysts expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility, chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmanns purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake. "
# omit brackets
bracket <- function(x) gsub("\\[|\\]","",x)
corpus <- tm_map(corpus, bracket)
## Warning in tm_map.SimpleCorpus(corpus, bracket): transformation drops documents
corpus[[5]]$content
## [1] " Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by Frances Pernod Ricard. Reports in the Wall Street Journal and the Financial Times suggested that the French spirits firm is considering a bid, but has yet to contact its target. Allied Domecq shares in London rose 4% by 1200 GMT, while Pernod shares in Paris slipped 1.2%. Pernod said it was seeking acquisitions but refused to comment on specifics. Pernods last major purchase was a third of US giant Seagram in 2000, the move which propelled it into the global top three of drinks firms. The other two-thirds of Seagram was bought by market leader Diageo. In terms of market value, Pernod - at 7.5bn euros ($9.7bn) - is about 9% smaller than Allied Domecq, which has a capitalisation of £5.7bn ($10.7bn; 8.2bn euros). Last year Pernod tried to buy Glenmorangie, one of Scotlands premier whisky firms, but lost out to luxury goods firm LVMH. Pernod is home to brands including Chivas Regal Scotch whisky, Havana Club rum and Jacobs Creek wine. Allied Domecqs big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka and Ballantines whisky - as well as snack food chains such as Dunkin Donuts and Baskin-Robbins ice cream. The WSJ said that the two were ripe for consolidation, having each dealt with problematic parts of their portfolio. Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains. "
# Cut the stopwords
corpus <- tm_map(corpus, removeWords, stopwords('english'))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus[[5]]$content
## [1] " Shares UK drinks food firm Allied Domecq risen speculation target takeover Frances Pernod Ricard. Reports Wall Street Journal Financial Times suggested French spirits firm considering bid, yet contact target. Allied Domecq shares London rose 4% 1200 GMT, Pernod shares Paris slipped 1.2%. Pernod said seeking acquisitions refused comment specifics. Pernods last major purchase third US giant Seagram 2000, move propelled global top three drinks firms. The two-thirds Seagram bought market leader Diageo. In terms market value, Pernod - 7.5bn euros ($9.7bn) - 9% smaller Allied Domecq, capitalisation £5.7bn ($10.7bn; 8.2bn euros). Last year Pernod tried buy Glenmorangie, one Scotlands premier whisky firms, lost luxury goods firm LVMH. Pernod home brands including Chivas Regal Scotch whisky, Havana Club rum Jacobs Creek wine. Allied Domecqs big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka Ballantines whisky - well snack food chains Dunkin Donuts Baskin-Robbins ice cream. The WSJ said two ripe consolidation, dealt problematic parts portfolio. Pernod reduced debt took fund Seagram purchase just 1.8bn euros, Allied improved performance fast-food chains. "
# Removing punctuation
corpus <- tm_map(corpus, removePunctuation)
corpus[[5]]$content
## [1] " Shares UK drinks food firm Allied Domecq risen speculation target takeover Frances Pernod Ricard Reports Wall Street Journal Financial Times suggested French spirits firm considering bid yet contact target Allied Domecq shares London rose 4 1200 GMT Pernod shares Paris slipped 12 Pernod said seeking acquisitions refused comment specifics Pernods last major purchase third US giant Seagram 2000 move propelled global top three drinks firms The twothirds Seagram bought market leader Diageo In terms market value Pernod 75bn euros 97bn 9 smaller Allied Domecq capitalisation £57bn 107bn 82bn euros Last year Pernod tried buy Glenmorangie one Scotlands premier whisky firms lost luxury goods firm LVMH Pernod home brands including Chivas Regal Scotch whisky Havana Club rum Jacobs Creek wine Allied Domecqs big names include Malibu rum Courvoisier brandy Stolichnaya vodka Ballantines whisky well snack food chains Dunkin Donuts BaskinRobbins ice cream The WSJ said two ripe consolidation dealt problematic parts portfolio Pernod reduced debt took fund Seagram purchase just 18bn euros Allied improved performance fastfood chains "
# Remove numbers if persist
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus[[5]]$content
## [1] " Shares UK drinks food firm Allied Domecq risen speculation target takeover Frances Pernod Ricard Reports Wall Street Journal Financial Times suggested French spirits firm considering bid yet contact target Allied Domecq shares London rose GMT Pernod shares Paris slipped Pernod said seeking acquisitions refused comment specifics Pernods last major purchase third US giant Seagram move propelled global top three drinks firms The twothirds Seagram bought market leader Diageo In terms market value Pernod bn euros bn smaller Allied Domecq capitalisation £bn bn bn euros Last year Pernod tried buy Glenmorangie one Scotlands premier whisky firms lost luxury goods firm LVMH Pernod home brands including Chivas Regal Scotch whisky Havana Club rum Jacobs Creek wine Allied Domecqs big names include Malibu rum Courvoisier brandy Stolichnaya vodka Ballantines whisky well snack food chains Dunkin Donuts BaskinRobbins ice cream The WSJ said two ripe consolidation dealt problematic parts portfolio Pernod reduced debt took fund Seagram purchase just bn euros Allied improved performance fastfood chains "
# Changing to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus[[5]]$content
## [1] " shares uk drinks food firm allied domecq risen speculation target takeover frances pernod ricard reports wall street journal financial times suggested french spirits firm considering bid yet contact target allied domecq shares london rose gmt pernod shares paris slipped pernod said seeking acquisitions refused comment specifics pernods last major purchase third us giant seagram move propelled global top three drinks firms the twothirds seagram bought market leader diageo in terms market value pernod bn euros bn smaller allied domecq capitalisation £bn bn bn euros last year pernod tried buy glenmorangie one scotlands premier whisky firms lost luxury goods firm lvmh pernod home brands including chivas regal scotch whisky havana club rum jacobs creek wine allied domecqs big names include malibu rum courvoisier brandy stolichnaya vodka ballantines whisky well snack food chains dunkin donuts baskinrobbins ice cream the wsj said two ripe consolidation dealt problematic parts portfolio pernod reduced debt took fund seagram purchase just bn euros allied improved performance fastfood chains "
# remove currency signs
pounds <- function(x) gsub("\\$|\\£","",x)
corpus <- tm_map(corpus, pounds)
## Warning in tm_map.SimpleCorpus(corpus, pounds): transformation drops documents
corpus[[1]]$content
## [1] " quarterly profits us media giant timewarner jumped bn m three months december m yearearlier the firm now one biggest investors google benefited sales highspeed internet connections higher advert sales timewarner said fourth quarter sales rose bn bn its profits buoyed one gains offset profit dip warner bros less users aol time warner said friday now owns searchengine google but internet business aol mixed fortunes it lost subscribers fourth quarter profits lower preceding three quarters however company said aols underlying profit exceptional items rose back stronger internet advertising revenues it hopes increase subscribers offering online service free timewarner internet customers try sign aols existing customers highspeed broadband timewarner also restate results following probe us securities exchange commission sec close concluding time warners fourth quarter profits slightly better analysts expectations but film division saw profits slump m helped boxoffice flops alexander catwoman sharp contrast yearearlier third final film lord rings trilogy boosted results for fullyear timewarner posted profit bn performance revenues grew bn our financial performance strong meeting exceeding fullyear objectives greatly enhancing flexibility chairman chief executive richard parsons said for timewarner projecting operating earnings growth around also expects higher revenue wider profit margins timewarner restate accounts part efforts resolve inquiry aol us market regulators it already offered pay m settle charges deal review sec the company said unable estimate amount needed set aside legal reserves previously set m it intends adjust way accounts deal german music publisher bertelsmanns purchase stake aol europe reported advertising revenue it now book sale stake aol europe loss value stake "
# remove abbreviations for millions, etc
million <- function(x) gsub("m|bn|a","",x)
corpus <- tm_map(corpus, million)
## Warning in tm_map.SimpleCorpus(corpus, million): transformation drops documents
corpus[[5]]$content
## [1] " shres uk drinks food fir llied doecq risen specultion trget tkeover frnces pernod ricrd reports wll street journl finncil ties suggested french spirits fir considering bid yet contct trget llied doecq shres london rose gt pernod shres pris slipped pernod sid seeking cquisitions refused coent specifics pernods lst jor purchse third us gint segr ove propelled globl top three drinks firs the twothirds segr bought rket leder digeo in ters rket vlue pernod euros sller llied doecq cpitlistion euros lst yer pernod tried buy glenorngie one scotlnds preier whisky firs lost luxury goods fir lvh pernod hoe brnds including chivs regl scotch whisky hvn club ru jcobs creek wine llied doecqs big nes include libu ru courvoisier brndy stolichny vodk bllntines whisky well snck food chins dunkin donuts bskinrobbins ice cre the wsj sid two ripe consolidtion delt probletic prts portfolio pernod reduced debt took fund segr purchse just euros llied iproved perfornce fstfood chins "
# Omit white spaces
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents
corpus[[8]]$content
## [1] " indi ttends g eeting seven leding industrilised ntions fridy unlikely cowed newcoer sttus in london thursdy hed eeting indis finnce inister lshed restrictive trde policies g ntions he objected subsidies griculture ke hrd developing ntions like indi copete he lso clled refor united ntions world bnk if plnippn chidbr indis finnce inister rgued orgnistions need tke ccount chnging world order given indi chins integrtion globl econoy he sid issue globlistion ters enggeent globlistion r chidbr ttending g eeting prt g group ntions ccount two thirds worlds popultion t conference developing enterprise hosted uk finnce inister gordon brown fridy sid fvour floting exchnge rtes help countries cope econoic shocks flexible exchnge rte one chnnel bsorbing positive negtive shocks told conference indi long chin brzil south fric russi invited tke prt g eeting tking plce london fridy sturdy chin expected fce renewed pressure bndon fixed exchnge rte g ntions prticulr us bled surge chep chinese exports soe countries tried use fixed exchnge rtes i wish ke judgeents r chidbr sid seprtely if wrned thursdy indis budget deficit lrge hper countrys econoic growth forecst round yer rch in yer rch indin econoy grew "
# Lemmatization to the root
corpus <- tm_map(corpus, content_transformer(stemDocument), language = 'english')
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(stemDocument), :
## transformation drops documents
corpus[[8]]$content
## [1] "indi ttend g eet seven lede industrilis ntion fridi unlik cow newcoer sttus in london thursdi hed eet indi finnc inist lshed restrict trde polici g ntion he object subsidi gricultur ke hrd develop ntion like indi copet he lso clled refor unit ntion world bnk if plnippn chidbr indi finnc inist rgu orgnist need tke ccount chnging world order given indi chin integrt globl econoy he sid issu globlist ter enggeent globlist r chidbr ttend g eet prt g group ntion ccount two third world popult t confer develop enterpris host uk finnc inist gordon brown fridi sid fvour flote exchng rtes help countri cope econo shock flexibl exchng rte one chnnel bsorb posit negtiv shock told confer indi long chin brzil south fric russi invit tke prt g eet tking plce london fridi sturdi chin expect fce renew pressur bndon fix exchng rte g ntion prticulr us bled surg chep chines export soe countri tri use fix exchng rtes i wish ke judgeent r chidbr sid seprt if wrned thursdi indi budget deficit lrge hper countri econo growth forecst round yer rch in yer rch indin econoy grew"
Having more or less cleaned the text content, we may reverse Corpus
object into a regular data frame. The cleaned documents are added in a
new column tidy_text in our df.
df$tidy_text <- data.frame(tidy_text = sapply(corpus, as.character), stringsAsFactors = FALSE)[,'tidy_text']
To be able to analyse text data, we need to present it in a traditional form of M x N matrix, with M number of observations and N number of features.
Thus, we need to spread the cleaned articles into a Document-Term Matrix, where:
rows stand for documents
columns stand for terms.
At this step, we shall use the tidytext package
developed by Julia Silge et al. A comprehensive guidance to text mining
is also provided by authors in their book.
We first unnest tokens, i.e. split the documents into
separate words. Next, the term-frequency and inverse document
frequencies are counted, and the TF-IDF scores are calculated for each
term. For this, the bind_tf_idf function is used.
set.seed(123)
words_tdm <- df %>%
# tokenize quotes into words with tidytext
unnest_tokens(word, tidy_text, drop=FALSE) %>%
# remove again stop words
anti_join(stop_words) %>%
# count number of each term per document
count(tidy_text, word, sort = TRUE)
## Joining, by = "word"
words <- words_tdm %>%
# create tf-idf values with tidytext
bind_tf_idf(word, tidy_text, n) %>%
# transform the matrix with tidyr to have words as features
select(tidy_text, word, tf_idf) %>%
spread(key=word, value = tf_idf, fill = 0 )
dim(words)
## [1] 1062 21543
as_tibble(words[450:455, 1:5])
## # A tibble: 6 × 5
## tidy_text bb bbb bbc bbcis
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 new europen direct put softwr writer risk legl ctio… 0 0 0.00263 0
## 2 new hoe secretri chrles clrke vow plough plns id cr… 0 0 0.00555 0
## 3 new york bnd scissor sister won gig yer wrd perforn… 0 0 0 0
## 4 new zelnd prove strong ustrlindoint brbrin round un… 0 0 0 0
## 5 newcstl boss gree souness line suer ove englnd rel … 0 0 0.0185 0
## 6 newcstl nger gree souness close sign chels defend c… 0 0 0 0
So, now we obtained a matrix with 21543 term features presented as columns. Each cell of the data frame now contains the TF-IDF values.
In fact, here is the moment, when we approached the very turmoil of text analysis: the curse of high dimensionality.
That high number of features does not really improve the quality of classification or clustering, but rather creates additional noise and computational cost on our way.
Thus, our task now is to extract a much lower number of features from this matrix, without losing in explanatory power.
Generally, the most popular algorithm for Matrix Factorization is the Singular Value Decomposition. In fact, this is a powerful tool, I would say the star of linear algebra.
According to the rule, every matrix can be decomposed into three matrices, one of them is the diagonal matrix containing the singular values. So, what we can do is to take a number of the highest singular values and leave the columns which are of the highest rank. Of course, those will not be the columns of the original matrix M, but the new ones, transformed.
{width
= 80%}
In R, the truncated SVD is performed with the help of
irlba package.
For the decomposition, we set the number of singular values (equal to the future number of columns we want to get) which will be left non-zero. We will set 10 features to be obtained.
# irlba applies truncated SVD, so we may set the number of features we want to get
set.seed(858)
words <- words[,2:ncol(words)]
svd <- irlba(t(words), nv = 10, maxit = 40)
Now we will cluster the 10 features extracted by Singular Value Decomposition.
options(scipen = 999)
concepts <- as.data.frame(svd$v)*100
Before fitting any clustering algorithms, we usually checks the clusterability of the data. Hopkins statistic is very useful for that. The higher the statistic is, the more divisible data we have.
# calculate Hopkins statistic
get_clust_tendency(concepts, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9800416
##
## $plot
Hopkins statistic is very close to 1, so we definitely expect to find some clusters.
Next, we try to fit K-means clustering and decide upon the optimal number of clusters. We may refer first to the average silhouette criterion, variance explained and AIC (Akaike information criterion).
set.seed(123)
library(ClusterR)
## Loading required package: gtools
optimum <-Optimal_Clusters_KMeans(concepts, max_clusters=10, plot_clusters = TRUE, criterion="silhouette")
optimum <-Optimal_Clusters_KMeans(concepts, max_clusters=10, plot_clusters = TRUE, criterion="AIC")
optimum <-Optimal_Clusters_KMeans(concepts, max_clusters=10, plot_clusters = TRUE, criterion="variance_explained")
As we observe, on our case it would be difficult to decide which k is the optimum. All th three measures suggest, that the clusters will be better fitted if the k is larger.
This may occur because our data is too scattered in small groups.
So, let us try different options.
# estimate K-means with different number of k
set.seed(13)
kmeans_3 <- eclust(concepts, "kmeans", k=3)
kmeans_4 <- eclust(concepts, "kmeans", k=4)
kmeans_5 <- eclust(concepts, "kmeans", k=5)
kmeans_6 <- eclust(concepts, "kmeans", k=6)
kmeans_8 <- eclust(concepts, "kmeans", k=8)
Silhouette criterion Additionally we check the silhouette width for separate clusters. As we see, the problem is that, if we increase k, we still have one cluster in which the points are far from each other, so the silhoette falls negative in this cluster.
fviz_silhouette(kmeans_3)
## cluster size ave.sil.width
## 1 1 92 0.58
## 2 2 683 -0.02
## 3 3 287 0.37
fviz_silhouette(kmeans_4)
## cluster size ave.sil.width
## 1 1 92 0.56
## 2 2 707 -0.07
## 3 3 194 0.63
## 4 4 69 0.43
fviz_silhouette(kmeans_5)
## cluster size ave.sil.width
## 1 1 92 0.56
## 2 2 620 -0.06
## 3 3 195 0.62
## 4 4 66 0.44
## 5 5 89 0.53
fviz_silhouette(kmeans_6)
## cluster size ave.sil.width
## 1 1 94 0.54
## 2 2 445 -0.06
## 3 3 208 0.60
## 4 4 157 0.36
## 5 5 89 0.52
## 6 6 69 0.42
fviz_silhouette(kmeans_8)
## cluster size ave.sil.width
## 1 1 90 0.49
## 2 2 142 -0.05
## 3 3 275 0.25
## 4 4 136 0.27
## 5 5 80 0.46
## 6 6 59 0.35
## 7 7 134 0.52
## 8 8 146 0.17
The silhouette also can be checked with the
NbClust
function.
opt2 <- NbClust(concepts, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="silhouette")
opt2$Best.nc
## Number_clusters Value_Index
## 2.0000 0.7346
opt2$All.index
## 2 3 4 5 6 7 8
## 0.7346 0.5512 0.5278 0.5405 0.3457 0.3446 0.2305
Duda-Hart Index
Another quality measure to check is the Duda-Heart index, which tests the hypothesis whether the data is homogeneous within the cluster, or rather heterogeneous and can be easily split. Again, it’s p-value = 0 suggests that our data is heterogeneous.
dudahart2(concepts, kmeans_4$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.7096542
##
## $compare
## [1] 0.8956854
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Calinki Harabasz Index It also measures the dispersion withing and between clusters. Usually, we wand CH index to be higher, and we see that the highest value is obtained for 8 clusters, following the inference from previous metrics.
round(calinhara(concepts, kmeans_3$cluster), digits = 2)
## [1] 101.43
round(calinhara(concepts, kmeans_4$cluster), digits = 2)
## [1] 111.84
round(calinhara(concepts, kmeans_5$cluster), digits = 2)
## [1] 126.93
round(calinhara(concepts, kmeans_6$cluster), digits = 2)
## [1] 144.12
round(calinhara(concepts, kmeans_8$cluster), digits = 2)
## [1] 167.71
Shadow Statistic
set.seed(9)
kmeans.6 <- kcca(concepts, 6, kccaFamily('kmeans'))
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
shadow(kmeans.6)
## 1 2 3 4 5 6
## 0.5507418 0.5814969 0.8003789 0.7960655 0.7711653 0.6199907
plot(shadow(kmeans.6))
As the R documentation says: ‘If the shadow values of a point is close to 0, then the point is close to its cluster centroid. If the shadow value is close to 1, it is almost equidistant to the two centroids. Thus, a cluster that is well separated from all other clusters should have many points with small shadow values.’
As we observe, a lot of points in our dataset could be associated with different centroids, so the clusters generated by K-Means are not solidly clustered.
For this reason, it is likely that the otimum separation is not the case for the data we have. Definitely, we do not want to split the data into too many clusters, so that the inference will be very hard to summarize.
Though, what we can do is to try vusualising our clusters in a more enhanced way, by implementing dimensionality reduction with t-SNE. T-SNE algorithm reduces the data into two dimensions, so we can easily plot it with clusters obtained.
We will stick to 6 clusters.
set.seed(27)
tsne <- Rtsne(concepts, dim = 2, perplexity = 30)
tsne_df <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], col = as.factor(kmeans_6$cluster))
ggplot(tsne_df) + geom_point(aes(x=x, y=y, color=col))
As we observe, there are group of points visible, and the documents can
be generalized to several categories. K-means generally grasps the
clusters well, but cluster 2 is too spread.
As we observed, the K-Means spotted the persistent clusters in data fairly well, though it definitely could be improved. Another option to try for clustering text data is hierarchical clustering. Let’s see if it defines better clusters.
# dissimilarity matrix
d <- dist(concepts, method = "euclidean")
# agglomerative clustering
hierarch <- hclust(d, method = 'ward.D')
sub <- cutree(hierarch, k = 6)
tsne_df <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], col = as.factor(sub))
ggplot(tsne_df) + geom_point(aes(x=x, y=y, color=col))
div_hcut <- diana(concepts)
div_hcut$dc
## [1] 0.9714491
sub_d <- cutree(div_hcut, k = 6)
tsne_df <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], col = as.factor(sub_d))
ggplot(tsne_df) + geom_point(aes(x=x, y=y, color=col))
Having obtained a few clusters for the news summaries data, we may check the most frequent terms for each group. Usually, the most frequent words allow to summarize, what is the common topic for the particular document group.
One of the best ways is to create word clouds.
So, in cluster 1 the most common root words are ‘club’, ‘sport’, ‘player’, etc. - so we may conclude that articles from group 1 are most probably about sports.
For the third cluster, we see ‘group’, ‘govern’, ‘world’, ‘European’, so expect it to be about economics, politics or the international news.