DAM Assignment 3 - Analysis of Unstructured Data

Introduction

This report deals with the analysis and insights obtained from performing various text analytics tasks to identify the hidden content,themes and topics of a directory containing several text documents.

Business Understanding

A mysterious directory named “docs” was identified by the Manager at his workplace computer. He has no idea about what information the directory holds and is uninterested in going through the painstaking task of having to manually interpret them. Hence, he has approached me to help him in identifying the relevant contents,themes and topics hidden in the directory by using Text Analytics.

Data Preparation

All text files in the “doc” directory have been loaded and merged into a large corpus as it is necessary to perform a combined and comparative analysis on all the text files. Having a quick check randomly at one of the documents shows us that there are plenty of unwanted characters which need to be removed from the corpus and the corpus has to be cleaned in order to format the data in a suitable format required for performing Text Analytics. The below mentioned Data Cleaning steps have been performed to convert the corpus into a suitable format for Text Analytics :

Remove Punctuation Marks : It is important to remove punctuation marks from the corpus as they are insignificant and do not add any value while performing Text Analytics.
Transform all words and letters to lower case : It’s important to convert all the words to lower case because in the process of Text Analytics words are treated with case sensitivity. Hence the same word represented with different cases in the corpus would be treated as 2 seperate words.
Remove numbers and digits : Numbers and digits need to be removed from the corpus as they do not add any value as we are only concerned with analysis of words to identify hidden topics and themes present in the corpus.
Remove Stop Words : Stop words refer to the most commonly used words in any language. Although these words are commonly used, they do not hold any significant or important meaning. Hence it is important to remove these stop words present in the corpus so that we can focus on only the important words. Some examples of stop words present in the document below are to, is, in, the and so on.
Remove Whitespaces : It is also important to remove all white spaces in the corpus as they are insignificant and do not add any value.
Stemming : Stemming is defined as the process of trimming words to their stem or root by removing suffixes. This is done in order to ensure that same words represented with different suffixes are treated as one words.

Contents of the randomly selected document before performing the above mentioned data cleaning activities :

## Conventional approaches to knowledge management on projects focus on  the cognitive (or thought related) and mechanical  aspects of knowledge creation and capture. There is alternate view, one which considers knowledge as being created through interactions between people who  through their interactions  develop mutually acceptable interpretations of theories and facts in ways that suit their particular needs. That is, project knowledge is socially constructed. If this is true, then project managers need to pay attention to the environmental and social factors that influence knowledge construction.  This is the position taken by Paul Jackson and Jane Klobas in their paper entitled, Building knowledge in projects: A practical application of social constructivism to information systems development, which presents a  knowledge creation / sharing process model based social constructivist theory. This article is a summary and review of the paper.  A social constructivist view of knowledge Jackson and Klobas begin with the observation that engineering disciplines are founded on the belief that knowledge can be expressed in propositions that correspond to a reality which  is independent of human perception.  However, there is an alternate view that knowledge is not absolute, but relative i.e.  it depends on the mental models and beliefs used to interpret facts, objects and events. A  relevant example is how a software product is viewed by business users and software developers. The former group may see an application in terms of its utility whereas the latter may see it as an instance of a particular technology. Such perception gaps can also occur within seemingly homogenous groups such as teams comprised of software developers, for example. This can happen for a variety of reasons such as the differences in the experience and cultural backgrounds of those who make up the group. Social constructivism looks at how such gaps can be bridged.  The authors' discussion relies on the work of Berger and Luckmann, who described how the gap between perceptions of different individuals can be overcome to create a socially constructed, shared reality. The phrase "socially constructed" implies that reality (as it pertains to a project, for example) is created via a common understanding of issues, followed by mutual agreement between all the players as to what comprises that reality. For me this view strikes a particular chord because of it is akin to the stated aims of dialogue mapping, a technique that I have described in several earlier posts (see this article for an example relevant to projects).  Knowledge in information systems development as a social construct First up, the authors make the point that information systems development (ISD) projects are:  intensive exercises in constructing social reality through process and data modeling. These models are informed with the particular world view of systems designers and their use of particular formal representations. In ISD projects, this operational reality is new and explicitly constructed and becomes understood and accepted through negotiated agreement between participants from the two cultures of business and IT  Essentially, knowledge emerges  through interaction and discussion  as the project proceeds.  However, the methodologies used in design are typically founded on an engineering approach, which takes a positivist view rather than a social one. As the authors suggest,  Perhaps the social constructivist paradigm offers an insight into continuing failure, namely that what is happening in an ISD project is far more complex than the simple translation of a description of an external reality into instructions for a computer. It is the emergence and articulation of multiple, indeterminate, sometimes unconscious, sometimes ineffable realities and the negotiated achievement of a consensus of a new, agreed reality in an explicit form, such as a business or data model, which is amenable to computerization.  With this in mind, the authors aim to develop a model that addresses the shortcomings of the traditional, positivist view of knowledge in ISD projects. They do this by representing Berger and Luckmann's theory of social constructivism in terms of a knowledge process model. They then identify management principles that map on to these processes. These principles form the basis of a survey which is used as an operational version of the process model. The operational model is then assessed by experts and tested by a project manager in a real life project.  The knowledge creation/sharing process model The process model that Jackson and Klobas describe is based on Berger and Luckmann's work.  Figure 1: Knowledge creation/sharing model Figure 1: Knowledge creation/sharing model  The model  describes how personal knowledge is created personal knowledge being what an individual knows. Personal knowledge is built up using mental models of the world these models are frameworks that individuals use to make sense of the world.  According to the Jackson Klobas process model, personal knowledge is built up through a number of process including:  Internalisation: The absorption of knowledge by an individual  Knowledge creation: The construction of new knowledge through repetitive performance of tasks (learning skills) or becoming aware of new ideas, ways of thinking or frameworks. The latter corresponds to learning concepts and theories, or even new ways of perceiving the world. These correspond to a change in subjective reality for the individual.  Externalisation: The representation and description of knowledge using speech or symbols so that it can be perceived and internalized by others. Think of this as explaining ideas or procedures to other individuals.  Objectivation: The creation of a shared constructs that represent a group's understanding of the world. At this point, knowledge is objectified and is perceived as having an existence independent of individuals.  Legitimation: The authorization of objectified knowledge as being "correct" or "standard."  Reification: The process by which objective knowledge assumes a status that makes it difficult to change or challenge. A familiar example of reified knowledge is any procedure or process that is "hardened" into a system "That's just the way things are done around here," is a common response when such processes are challenged.  The links depicted in the figure show the relationships between these processes.  Jackson and Klobas suggest that knowledge creation in ISD projects is a social process, which occurs through continual communication between the business and IT. Sure, there are other elements of knowledge creation design, prototyping, development, learning new skills etc. but these amount to nought unless they are discussed, argued, agreed on and communicated through social interactions. These interactions occur in the wider context of the organization, so it is reasonable to claim that the resulting knowledge takes on a form that mirrors the social environment of the organization.  Clearly, this model of knowledge creation is very different from the usual interpretation of knowledge having an independent reality, regardless of whether it is known to the group or not.  An operational model The above is good theory, which makes for interesting, but academic, discussions. What about practice? Can the model be operationalised?  Jackson and Klobas describe an approach to creating to testing the utility (rather than the validity) of the model.  I discuss this in the following sections.  Knowledge sharing heuristics  To begin with, they surveyed the literature on knowledge management to identify knowledge sharing heuristics (i.e. experience based techniques to enable knowledge sharing).  As an example, some of the heuristics associated with the externalization process were:  We have standard documentation and modelling tools which make business requirements easy to understand Stakeholders and IS staff communicate regularly through direct face to face contact We use prototypes The authors identified more than 130 heuristics. Each of these was matched with a process in the model. According to the authors, this matching process was simple: in most cases there was no doubt as to which process a heuristic should be attached to. This suggests that the model provides a natural way to organize the voluminous and complex body of research in knowledge creation and sharing. Why is this important? Well, because it suggests that the conceptual model (as illustrated in Fig. 1) can form the basis for a simple means to assess knowledge creation / sharing capabilities in their work environments, with the assurance that they have all relevant variables covered.  Validating the mapping  The validity of the matching was checked using twenty historical case studies of ISD projects. This worked as follows: explanations for what worked well and what didn't were mapped against the model process areas (using the heuristics identified in the prior step). The aim was to answer the question:  "is there a relationship between project failure and problems in the respective knowledge processes or, conversely, between project success and the presence of positive indicators?"  One of the case studies the authors use is the well known (and possibly over analysed) failure of the automated dispatch system for the London Ambulance Service.  The paper has a succinct summary of the case study, which I reproduce below:  The London Ambulance Service (LAS) is the largest ambulance service in the world and provides accident and emergency and patient transport services to a resident population of nearly seven million people. Their ISD project was intended to produce an automated system for the dispatch of ambulances to emergencies. The existing manual system was poor, cumbersome, inefficient and relatively unreliable. The goal of the new system was to provide an efficient command and control process to overcome these deficiencies. Furthermore, the system was seen by management as an opportunity to resolve perceived issues in poor industrial relations, outmoded work practices and low resource utilization. A tender was let for development of system components including computer aided dispatch, automatic vehicle location, radio interfacing and mobile data terminals to update the status of any call out. The tender was let to a company inexperienced in large systems delivery. Whilst the project had profound implications for work practices, personnel were hardly involved in the design of the system. Upon implementation, there were many errors in the software and infrastructure, which led to critical operational shortcomings such as the failure of calls to reach ambulances. The system lasted only a week before it was necessary to revert to the manual system.  Jackson and Klobas show how their conceptual model maps to knowledge related factors that may have played a role in the failure project. For example, under the heading of personal knowledge, one can identify at least two potential factors: lack of involvement of end users in design and selection of an inexperienced vendor. Further, the disconnect between management and employees suggests a couple of factors relating to reification: mutual negative perceptions and outmoded (but unchallenged) work practices.  From their validation, the authors suggest that the model provides a comprehensive framework that explains why these projects failed. That may be overstating the case what's cause and what's effect is hard to tell, especially after the fact. Nonetheless, the model does seem to be able to capture many, if not all, knowledge related gaps that could have played a role in these failures. Further, by looking at the heuristics mapped to each process, one might be able to suggest ways in which these deficiencies could have been addressed. For example, if externalization is a problem area one might suggest the use of prototypes or encourage face to face communication between IS and business personnel.  Survey based tool  Encouraged by the above, the authors created a survey tool which was intended to evaluate knowledge creation/sharing effectiveness in project environments. In the tool, academic terms used in the model were translated into everyday language (for example, the term externalization was translated to knowledge sharing see Fig 1 for translated terms). The tool asked project managers to evaluate their project environments against each knowledge creation process (or capability) on a scale of 1 to 10.  Based on inputs, it could recommend specific improvement strategies for capabilities that were scored low. The tool was evaluated by four project managers, who used it in their work environment over a period of 4 6 weeks. At the end of the period, they were interviewed and their responses were analysed using content analysis to match their experiences and requirements against the designed intent of the tool.  Unfortunately, the paper does not provide any details about the tool, so it's difficult to say much more than paraphrase the authors comments.  Based on their evaluation, the authors conclude that the tool provides:  A common framework for project managers to discuss issues pertaining to knowledge creation and sharing. A means to identify potential problems and what might be done to address them. Field testing  One of the evaluators of the model tested the tool in the field. The tester was a project manager who wanted to identify knowledge creation/sharing deficiencies in his work environment, and ways in which these could be addressed.  He answered questions based on his own evaluation of knowledge sharing capabilities in his environment and then developed an improvement plan based on strategies suggested by the tool along with some of his own ideas.  The completed survey and plan were returned to the researchers.  Use of the tool revealed the following knowledge creation/sharing deficiencies in the project manager's environment:  Inadequate personal knowledge. Ineffective externalization Inadequate standardization (objectivation) Strategies suggested by the tool include:  An internet portal to promote knowledge capture and sharing. This included discussion forums, areas to capture and discuss best practices etc. Role playing workshops to reveal how processes worked in practice (i.e. surface tacit knowledge). Based on the above, the authors suggest that:  Technology can be used to promote support knowledge sharing and standardization, not just storage. Interventions that make tacit knowledge explicit can be helpful. As a side benefit, they note that the survey has raised consciousness about knowledge creation/sharing within the team. Reflections and Conclusions In my opinion, the value of the paper lies not in the model or the survey tool, but the conceptual framework that underpins them namely, the idea knowledge depends on, and is shaped by, the social environment in which it evolves. Perhaps an example might help clarify what this means. Consider an organisation that decides to implement project management "best practices" as described by <fill in any of the popular methodologies here>. The wrong way to do this would be to implement practices wholesale, without regard to organizational culture, norms and pre existing practices. Such an approach is unlikely to lead to the imposed practices taking root in the organisation. On the other hand, an approach that picks the practices that are useful and tailors these to organizational needs, constraints and culture is likely to meet with more success. The second approach works because it attempts to bridge gap between the "ideal best practice" and social reality in the organisation. It encourages employees to adapt practices in ways that make sense in the context of the organization. This invariably involves modifying practices, sometimes substantially, creating new (socially constructed!) knowledge in the bargain.  Another interesting point the authors make is that several knowledge sharing heuristics (130, I think the number was) could be classified unambiguously under one of the processes in the model. This suggests that the model is a reasonable view of the knowledge creation/sharing process. If one accepts this conclusion, then the model does indeed provide a common framework for discussing issues relating knowledge creation in project environments. Further, the associated heuristics can help identify processes that don't work well.  I'm unable to judge the usefulness of the survey based tool developed by the authors because they do not provide much detail about it in the paper. However, that isn't really an issue;  the field of project management has too many "tools and techniques" anyway.  The key message of the paper, in my opinion, is the that every project has a unique context, and that the techniques used by others have to be interpreted and applied in ways that are meaningful in the context of the particular project. The paper is an excellent counterpoint to the methodology oriented practice of knowledge management in projects; it should be required reading for methodologists and  project managers who believe that things need to be done by The Book, regardless of social or organizational context.

Contents of the same document after performing the above mentioned data cleaning activities :

## convent approach knowledg manag project focus cognit thought relat mechan aspect knowledg creation captur altern view  consid knowledg creat interact peopl interact develop mutual accept interpret theori fact  suit   project knowledg social construct true project manag  pay attent environment social factor influenc knowledg construct posit  paul jackson jane kloba paper entitl build knowledg project practic applic social constructiv inform system develop present knowledg creation share process model base social constructivist theori articl summari review paper social constructivist view knowledg jackson kloba begin observ engin disciplin found belief knowledg  express proposit correspond realiti independ human percept howev altern view knowledg absolut relat  depend mental model belief  interpret fact object event relev exampl softwar product view busi user softwar develop  group   applic term util wherea    instanc  technolog percept gap   occur   homogen group team compris softwar develop exampl  happen varieti reason differ experi cultur background make group social constructiv  gap  bridg author discuss reli work berger luckmann describ gap percept differ individu  overcom creat social construct share realiti phrase social construct impli realiti pertain project exampl creat  common understand issu follow mutual agreement player compris realiti view strike  chord akin state aim dialogu map techniqu describ sever earlier post  articl exampl relev project knowledg inform system develop social construct  author make point inform system develop isd project intens exercis construct social realiti process data model model inform  world view system design   formal represent isd project oper realiti  explicit construct becom understood accept negoti agreement particip  cultur busi essenti knowledg emerg interact discuss project proceed howev methodolog  design typic found engin approach  positivist view  social  author suggest perhap social constructivist paradigm offer insight continu failur  happen isd project  complex simpl translat descript extern realiti instruct comput emerg articul multipl indetermin sometim unconsci sometim ineff realiti negoti achiev consensus  agre realiti explicit form busi data model amen computer mind author aim develop model address shortcom tradit positivist view knowledg isd project repres berger luckmann theori social constructiv term knowledg process model identifi manag principl map process principl form basi survey  oper version process model oper model assess expert test project manag real life project knowledg creationshar process model process model jackson kloba describ base berger luckmann work figur knowledg creationshar model figur knowledg creationshar model model describ person knowledg creat person knowledg individu  person knowledg built  mental model world model framework individu  make sens world accord jackson kloba process model person knowledg built number process includ internalis absorpt knowledg individu knowledg creation construct  knowledg repetit perform task learn skill becom awar  idea   framework  correspond learn concept theori    perceiv world correspond chang subject realiti individu externalis represent descript knowledg  speech symbol  perceiv intern   explain idea procedur individu objectiv creation share construct repres group understand world point knowledg objectifi perceiv exist independ individu legitim author objectifi knowledg correct standard reific process object knowledg assum status make difficult chang challeng familiar exampl reifi knowledg procedur process harden system    thing   common respons process challeng link depict figur show relationship process jackson kloba suggest knowledg creation isd project social process occur continu communic busi  element knowledg creation design prototyp develop learn  skill  amount nought  discuss argu agre communic social interact interact occur wider context organ reason claim result knowledg  form mirror social environ organ clear model knowledg creation differ usual interpret knowledg independ realiti    group oper model good theori make interest academ discuss practic  model operationalis jackson kloba describ approach creat test util  valid model discuss follow section knowledg share heurist begin survey literatur knowledg manag identifi knowledg share heurist  experi base techniqu enabl knowledg share exampl heurist associ extern process standard document model tool make busi requir easi understand stakehold staff communic regular direct face face contact  prototyp author identifi heurist match process model accord author match process simpl case doubt process heurist attach suggest model provid natur  organ volumin complex bodi research knowledg creation share import  suggest conceptu model illustr fig  form basi simpl  assess knowledg creation share capabl work environ assur relev variabl cover valid map valid match check  twenti histor case studi isd project work follow explan work  didnt map model process area  heurist identifi prior step aim answer question relationship project failur problem respect knowledg process convers project success presenc posit indic  case studi author    possibl analys failur autom dispatch system london ambul servic paper succinct summari case studi reproduc london ambul servic las largest ambul servic world provid accid emerg patient transport servic resid popul   million peopl isd project intend produc autom system dispatch ambul emerg exist manual system poor cumbersom ineffici relat unreli goal  system provid effici command control process overcom defici furthermor system  manag opportun resolv perceiv issu poor industri relat outmod work practic low resourc util tender  develop system compon includ comput aid dispatch automat vehicl locat radio interfac mobil data termin updat status call tender  compani inexperienc larg system deliveri whilst project profound implic work practic personnel hard involv design system  implement mani error softwar infrastructur led critic oper shortcom failur call reach ambul system  week necessari revert manual system jackson kloba show conceptu model map knowledg relat factor  play role failur project exampl head person knowledg   identifi   potenti factor lack involv end user design select inexperienc vendor disconnect manag employe suggest coupl factor relat reific mutual negat percept outmod unchalleng work practic valid author suggest model provid comprehens framework explain project fail  overst case  caus  effect hard  especi fact nonetheless model  abl captur mani knowledg relat gap play role failur  heurist map process   abl suggest  defici address exampl extern problem area   suggest  prototyp encourag face face communic busi personnel survey base tool encourag author creat survey tool intend evalu knowledg creationshar effect project environ tool academ term  model translat everyday languag exampl term extern translat knowledg share  fig translat term tool  project manag evalu project environ knowledg creation process capabl scale base input recommend specif improv strategi capabl score low tool evalu  project manag  work environ period week end period interview respons analys  content analysi match experi requir design intent tool unfortun paper provid detail tool difficult   paraphras author comment base evalu author conclud tool provid common framework project manag discuss issu pertain knowledg creation share  identifi potenti problem   address field test  evalu model test tool field tester project manag  identifi knowledg creationshar defici work environ  address answer question base evalu knowledg share capabl environ develop improv plan base strategi suggest tool  idea complet survey plan return research  tool reveal follow knowledg creationshar defici project manag environ inadequ person knowledg ineffect extern inadequ standard objectiv strategi suggest tool includ internet portal promot knowledg captur share includ discuss forum area captur discuss  practic  role play workshop reveal process work practic  surfac tacit knowledg base author suggest technolog   promot support knowledg share standard  storag intervent make tacit knowledg explicit   side benefit note survey rais conscious knowledg creationshar  team reflect conclus opinion valu paper lie model survey tool conceptu framework underpin  idea knowledg depend shape social environ evolv perhap exampl   clarifi  consid organis decid implement project manag  practic describ fill popular methodolog wrong  implement practic wholesal  regard organiz cultur norm pre exist practic approach unlik lead impos practic  root organis hand approach pick practic  tailor organiz  constraint cultur  meet success  approach work attempt bridg gap ideal  practic social realiti organis encourag employe adapt practic  make sens context organ invari involv modifi practic sometim substanti creat  social construct knowledg bargain anoth interest point author make sever knowledg share heurist  number classifi unambigu  process model suggest model reason view knowledg creationshar process  accept conclus model inde provid common framework discuss issu relat knowledg creation project environ associ heurist   identifi process dont work  im unabl judg  survey base tool develop author provid  detail paper howev isnt realli issu field project manag mani tool techniqu  key messag paper opinion everi project uniqu context techniqu   interpret appli  meaning context  project paper excel counterpoint methodolog orient practic knowledg manag project requir read methodologist project manag believ thing   book  social organiz context

Data Understanding

The “docs” directory contains a total of 42 Text documents. After performing the required data cleaning steps, there are a total of 4166 unique terms or words present in the corpus. The bar chart and word cloud below show the words with the top most occurrences in the corpus. In the wordcloud words with most occurrences are represented bigger while words with lesser occurrences are represented with smaller size. The words project, risk and manag are the top 3 words with most occurrences and their occurrence counts are almost close.

histogram of most frequent words in the corpus

Worcloud of the most frequent words in the document

An unigram denotes just a single word. When words or unigrams are combined paired to form a pair of 2 words or unigrams they are called bigrams and when they are joined to form a group of 3 words they are called trigrams.

The below figure shows the most frequent bigrams in our corpus where project-manag , risk-manag and complet-time are the top 3 most common bigrams.

Most frequent Bigrams in the corpus

Most frequent Trigrams in the corpus

The most frequent trigrams present in our corpus are represented in the below figure. monte-carlo-simul is the most frequent trigram in the corpus.

Grouping the words or unigrams into bigrams and trigrams has given us a basic idea of what the hidden themes and topics in our corpus could be. In this case, analysing the bigrams and trigrams idndicates that our top topics could be related to :

Project management
risk management
monte carlo simulation

Analysis

Although bigrams and trigrams give us a quick glimpse of what our topics would look like, it is important to further use Text analytics approaches to dive deeper into identifying the appropriate topics.

The first text analytics approach we would be looking at is Clustering.

Clustering

Clustering is defined as the process of grouping a set of objects such that objects within the same group are more similar to each other and objects in different groups are less similar to each other. In the case of text analytics, clustering groups documnets with similar words under the same group and dissimilar documents in different groups. There are 2 types of clustering methods available which are :

Hierarchial clustering
K means clustering

Since both the clustering methods compute differently, in our approach we have used both the clustering methods to compare the results from both the methods and decide on the suitable number of clusters which can be formed for the corpus.

Hierarchial clustering

The hierarchial clustering model was run and after many trial and errors performed for identifying the optimal number of subtrees or cluster it was finally decided that either 5 or 6 clusters would be the most optimal number of cluster groups for our corpus.

The dendrogram plots below were obtained for 5 and 6 cluster groups respectively. In the dendrogram, branch points which have large sepearation space between them resemble well defined clusters and branch points which are closely spaced with each other denotes dissimilarity. In our analysis the clusters formed when dividing the dendrogram into 5 and 6 subtress looked to be very well defined with large sepeartion space between the border branch points of well defined clusters.

Hierarchial Clustering dendrogram for 5 subtrees :

Hierarchial Clustering dendrogram for 6 subtrees :

K-means clustering

Since 5 and 6 were found as the optimal number of cluster groups for our corpus, the k values of 5 and 6 were attempted for the K-means clustering to prove that 5 or 6 could be the ideal number of clusters for this corpus.

The clusplot shows the variability between the different cluster groups formed. An ideal k value would be the one for which the variability between the clusters is high. In this case the variability observed on the clusplots for both k equals 5 and 6 are same at 49.02%

Clusplot for K=5 :

## Warning: package 'cluster' was built under R version 3.6.3

Clusplot for K=6 :

Although the clusplot returned the same value of variability for k values of 5 and 6. In order to further consider the optimal k value the elbow plot can be considered. The ideal k-value or number of clusters would be the one at which the within-group sum of squares (WSS) cannot be decreased further or slows down with increase in number of clusters. Here, it can be seen clearly that the decrease in within-group sum of squares slows down or is flat after 5 clusters. Hence, it is proved that out of k values of 5 and 6; 5 is the most optimal number of clusters for grouping our corpus.

Elbow Plot for K-means clustering :

Topic Modelling

However, till now we have only found out the ideal number of clusters for our corpus. What topics the 5 clusters correspond to is still a question. The solution for finding this is by using the topic modelling approach which helps to find the corresponding topics for different clusters. The Latent Dirichlet Allocation algorith has been used for topic modelling in this analysis.

Although our ideal cluster or k value has been identified as 5. Topic modelling was done for both 5 and 6 clusters to check if the topic of the 6th cluster was completely unique from the other 5 clusters.

Top 8 most frequent words for every topic obtained for K=5 :

##      Topic 1    Topic 2    Topic 3    Topic 4     Topic 5    
## [1,] "knowledg" "risk"     "organis"  "document"  "time"     
## [2,] "issu"     "project"  "practic"  "word"      "task"     
## [3,] "question" "manag"    "chang"    "cluster"   "distribut"
## [4,] "point"    "work"     "mani"     "data"      "probabl"  
## [5,] "idea"     "process"  "model"    "topic"     "complet"  
## [6,] "discuss"  "author"   "design"   "figur"     "figur"    
## [7,] "argument" "approach" "techniqu" "term"      "simul"    
## [8,] "develop"  "paper"    "effect"   "algorithm" "number"

From the above topic groups/clusters formed; the meaning/theme of each topic can be interpreted as follows :

Topic 1 deals with asking questions, discussing and development of ideas and argument over a point or idea
Topic 2 deals with Project management and risk management
Topic 3 deals with organisation practices , model design and technique
Topic 4 deals with clustering of documents and data algorithms
Topic 5 deals with Task distribution and task completion

Top 8 most frequent words for every topic obtained for K=6 :

##      Topic 1   Topic 2    Topic 3     Topic 4     Topic 5    Topic 6   
## [1,] "project" "risk"     "task"      "document"  "data"     "issu"    
## [2,] "organis" "manag"    "time"      "word"      "model"    "knowledg"
## [3,] "manag"   "project"  "distribut" "cluster"   "point"    "discuss" 
## [4,] "work"    "base"     "probabl"   "topic"     "plot"     "question"
## [5,] "practic" "author"   "complet"   "term"      "boundari" "point"   
## [6,] "process" "techniqu" "figur"     "figur"     "function" "idea"    
## [7,] "chang"   "social"   "simul"     "corpus"    "decis"    "exampl"  
## [8,] "organ"   "mani"     "number"    "algorithm" "variabl"  "argument"

From the above topic groups/clusters formed; the meaning/theme of each topic can be interpreted as follows :

Topic 1 deals with project management, work organisation and process
Topic 2 deals with Project management and risk management
Topic 3 deals with Task distribution and task completion
Topic 4 deals with clustering of documents and words(refers to the topic of text analytics)
Topic 5 deals with data modelling
Topic 6 deals with asking questions, discussing and development of ideas and argument over a point or idea

From the above comaprisons it can be noted that :

(Topic 1,K=5) is similar to (Topic 6, k=6)
(Topic 2,K=5) is similar to (Topic 2, k=6) and (Topic 1,k=6)
(Topic 3, k=5) is similar to (Topic 1, k=6)
(Topic 5, k=5) is similar to (Topic 3 , k=5)
(Topic 4, k=5) is similar to (Topic 4, k=6) and (Topic 5, k=6)

Hence from the comparison above it is evident that Topic 2 in K=5 has been split into Topic 2 and Topic 1 in K=6. Topic 2 in k=5 relates to risk management and project management which got split into 2 topics(Topic 1 & 2 , k=6) with high similarity of words(Proj & manag) when k was made 6. This split is unnecessary as after the split both the topics have high similarity of words.

Another interesting comparison was the split of Topic 4 in k=5 into (Topic 4, k=6) and (Topic 5, k=6). Topic 4 in k=5 relates to a combination of data algorithms and text analytics terms, which got split into :

(Topic 4, k=6) which represents terms related to text analytics.
(Topic 5, k=6) which represents terms related to data algorithms.

Again I assume this split also to be unnecessary as although data algorithms and text analytics are 2 different topics, they both fall under the broader category of Machine learning algorithms.

On summing all the above findings it can again be reconfirmed that the ideal number of clusters for the corpus is 5 and not 6.

Network Graphs

Now as we have almost confirmed that our ideal number of topics is k=5, it’s time to reconfirm the same for the last time by plotting the network graph. In the network graph each document is represented as a node and the similarity between documents is represented by an edge between the two similar nodes. In text analytics similarity is calculated based on the similarity of words present in documents. Aditionally, the topic assignment for every document was represented on the network graph by colours. Documents whuch were assigned the same topic by the LDA model were represented by nodes with same colour on thenetwork graph.

The topic assignment for every document by the LDA model can be found below.

topics(ldaOut)

## Doc01.txt Doc02.txt Doc03.txt Doc04.txt Doc05.txt Doc06.txt Doc07.txt Doc08.txt 
##         2         2         3         2         2         2         2         3 
## Doc09.txt Doc10.txt Doc11.txt Doc12.txt Doc13.txt Doc14.txt Doc15.txt Doc16.txt 
##         2         3         1         1         1         1         1         1 
## Doc17.txt Doc18.txt Doc19.txt Doc20.txt Doc21.txt Doc22.txt Doc23.txt Doc24.txt 
##         1         4         4         4         4         4         2         3 
## Doc25.txt Doc26.txt Doc27.txt Doc28.txt Doc29.txt Doc30.txt Doc31.txt Doc32.txt 
##         3         3         3         2         2         2         3         2 
## Doc33.txt Doc34.txt Doc35.txt Doc36.txt Doc37.txt Doc38.txt Doc39.txt Doc40.txt 
##         3         3         5         5         5         5         5         5 
## Doc41.txt Doc42.txt 
##         5         4

Colour codings for the topics have been done as mentioned below :

Topic 1 - darkblue
Topic 2 - green
Topic 3 - brown
Topic 4 - black
Topic 5 - pink

It can be observed from the network graph that Topic 1(dark blue), Topic 4(black) and Topic 5(pink) have been perfectly clustered as their clusters are distinct. There is high similarity between them as the nodes in these topics are connected mostly to only the nodes which belong to the same topic and there is no mix with nodes belonging to other topics.

However Topics 2 and 3 are not distinctly clustered as there is a lot of similarity between the documents of these 2 topics.

Network graph representing topics of documents :

#plot network graph
visNetwork(no, e)

Sentiment Analysis

Although we have identified the topics of documents, we are still not aware whether the documents carry positive or negative feelings. This can be identified by performing sentiment analysis on the documents.

From the sentiment analysis score obtained for each document it can be observed that documents 1 to 10 and 34 to 42 all had negative sentiments. However, it was interesting to note that most of the Documents from 11 to 34 had positive sentiment scores.

Sentiment analysis for each document :

The below plot shows the top words contributing to both positive and negative sentiments. It is interesting to note that the word risk has largely contributed to the negative sentiment. Another interesting point to note is that almost all of the documents from Document 1 to 10 belong to Topic 2 in which risk is the most frequently occurring word in that topic. Hence documents 1 to 10 are having a negative sentiment.

Top words contributing to both positive and negative sentiments :

## Joining, by = "word"

## Selecting by n

Evaluation

From the dendrogram of hierarchial clustering and clusplot of k-means clustering, it was clearly evident that the total number of topics for the corpus should either be 5 or 6. But it was difficult to pick the most optimal one out of the two. From the elbow plot of K-means clustering it was clearly visible that the ideal number of clusters for the corpus is 5. From the LDA model it was clearly noted that there are a total of 5 relevant topics in the corpus. Since both the results of the number of clusters and number of topics in the corpus was 5; it gave solid backing to select 5 as the optimal number of clusters and topics. This claim was further strengthened from the Network graph where the clusters of Topic 1,4 and 5 were clearly distinct with only Topics 2 and 3 not having distinct clusters due to few similarities observed between them.

Conclusion

It can be concluded from this analysis that the ideal number of topics and clusters for the documents present in the “doc” folder is 5. These 5 topics are closely relevant to the ones as mentioned below.

Topic 1 deals with asking questions, discussing and development of ideas and argument over a point or idea
Topic 2 deals with Project management and risk management
Topic 3 deals with organisation practices , model design and technique
Topic 4 deals with clustering of documents and data algorithms
Topic 5 deals with Task distribution and task completion

Analysis of Unstructured Data - Anatomy of an unknown corpus

Ganesh Arunagiri Rajan

02/06/2020